Datasets

Modules:

Name	Description
`dataset_config`
`dataset_manager`
`managers`

Classes:

Name	Description
`DatasetConfig`	Configuration for a dataset.
`DatasetManager`	A protocol for managing datasets.
`DatasetManagerRegistry`	A registry for dataset managers.
`DatasetMetadata`	The metadata for a dataset.
`DatasetRecord`	A record identifying a dataset.
`FileBasedDatasetManager`	An abstract class for managing datasets.
`FileMetadata`	The metadata for a dataset file.
`LocalSource`	The local source of the dataset file(s).
`OnlineSource`	The online source of the dataset file(s).
`SplitMetadata`	The metadata for a dataset split.
`VersionMetadata`	The metadata for a dataset version.

Functions:

Name	Description
`manager`	Decorator to register a dataset manager.

DatasetConfig

Configuration for a dataset.

Attributes:

Name	Type	Description
`dataset_name`	`str`	The name of the dataset.
`dataset_metadata`	`DatasetMetadata`	The metadata for the dataset.

Methods:

Name	Description
`__init__`	Initializes a new DatasetConfig.
`get_files`	Gets the files for the specified version and splits.
`get_splits`	Gets the dataset splits for the specified version.

Source code in evalsense/datasets/dataset_config.py

class DatasetConfig:
    """Configuration for a dataset.

    Attributes:
        dataset_name (str): The name of the dataset.
        dataset_metadata (DatasetMetadata): The metadata for the dataset.
    """

    def __init__(self, dataset_name: str):
        """Initializes a new DatasetConfig.

        Args:
            dataset_name (str): The name of the dataset.
        """
        self.dataset_name = dataset_name
        config = {}
        for config_path in DATASET_CONFIG_PATHS:
            config_file = config_path / (to_safe_filename(dataset_name) + ".yml")
            if config_file.exists():
                try:
                    with open(config_file, "r") as f:
                        new_config = yaml.safe_load(f)
                    config = deep_update(config, new_config)
                except Exception as e:
                    warnings.warn(
                        f"Failed to load dataset config from {config_file}: {e}"
                    )
                    continue
        self.dataset_metadata = DatasetMetadata(**config)

    def get_files(self, version: str, splits: list[str]) -> dict[str, FileMetadata]:
        """Gets the files for the specified version and splits.

        Args:
            version (str): The name of the version.
            splits (list[str]): The names of the splits.

        Returns:
            (dict[str, FileMetadata]): The files for the version and splits.
        """
        return self.dataset_metadata.get_files(version, splits)

    def get_splits(self, version: str) -> dict[str, SplitMetadata]:
        """Gets the dataset splits for the specified version.

        Args:
            version (str): The name of the version.

        Returns:
            (dict[str, SplitMetadata]): The splits for the version.
        """
        return self.dataset_metadata.get_splits(version)

init

__init__(dataset_name: str)

Initializes a new DatasetConfig.

Parameters:

Name	Type	Description	Default
`dataset_name`	`str`	The name of the dataset.	required

Source code in evalsense/datasets/dataset_config.py

def __init__(self, dataset_name: str):
    """Initializes a new DatasetConfig.

    Args:
        dataset_name (str): The name of the dataset.
    """
    self.dataset_name = dataset_name
    config = {}
    for config_path in DATASET_CONFIG_PATHS:
        config_file = config_path / (to_safe_filename(dataset_name) + ".yml")
        if config_file.exists():
            try:
                with open(config_file, "r") as f:
                    new_config = yaml.safe_load(f)
                config = deep_update(config, new_config)
            except Exception as e:
                warnings.warn(
                    f"Failed to load dataset config from {config_file}: {e}"
                )
                continue
    self.dataset_metadata = DatasetMetadata(**config)

get_files

get_files(
    version: str, splits: list[str]
) -> dict[str, FileMetadata]

Gets the files for the specified version and splits.

Parameters:

Name	Type	Description	Default
`version`	`str`	The name of the version.	required
`splits`	`list[str]`	The names of the splits.	required

Returns:

Type	Description
`dict[str, FileMetadata]`	The files for the version and splits.

Source code in evalsense/datasets/dataset_config.py

def get_files(self, version: str, splits: list[str]) -> dict[str, FileMetadata]:
    """Gets the files for the specified version and splits.

    Args:
        version (str): The name of the version.
        splits (list[str]): The names of the splits.

    Returns:
        (dict[str, FileMetadata]): The files for the version and splits.
    """
    return self.dataset_metadata.get_files(version, splits)

get_splits

get_splits(version: str) -> dict[str, SplitMetadata]

Gets the dataset splits for the specified version.

Parameters:

Name	Type	Description	Default
`version`	`str`	The name of the version.	required

Returns:

Type	Description
`dict[str, SplitMetadata]`	The splits for the version.

Source code in evalsense/datasets/dataset_config.py

def get_splits(self, version: str) -> dict[str, SplitMetadata]:
    """Gets the dataset splits for the specified version.

    Args:
        version (str): The name of the version.

    Returns:
        (dict[str, SplitMetadata]): The splits for the version.
    """
    return self.dataset_metadata.get_splits(version)

DatasetManager

Bases: Protocol

A protocol for managing datasets.

Attributes:

Name	Type	Description
`priority`	`int`	The priority of the dataset manager. Ranges from 0 (the lowest priority) to 10 (the highest priority). A class attribute.
`name`	`str`	The name of the dataset.
`version`	`str`	The used dataset version.
`splits`	`list[str]`	The dataset splits to retrieve.
`data_path`	`Path`	The top-level directory for storing all datasets.
`dataset`	`Dataset \| None`	The loaded dataset.
`dataset_dict`	`DatasetDict \| None`	The loaded dataset dictionary.

Methods:

Name	Description
`__init__`	Initializes a new DatasetManager.
`can_handle`	Checks if the DatasetManager can handle the given dataset.
`create`	Creates a new dataset manager for the specified dataset.
`is_retrieved`	Checks if the dataset at the specific version is already downloaded.
`load`	Loads the dataset as a HuggingFace dataset.
`remove`	Deletes the dataset at the specific version from disk.
`retrieve`	Downloads and preprocesses a dataset.
`unload`	Unloads the dataset from memory.

Source code in evalsense/datasets/dataset_manager.py

class DatasetManager(Protocol):
    """A protocol for managing datasets.

    Attributes:
        priority (int): The priority of the dataset manager. Ranges from
            0 (the lowest priority) to 10 (the highest priority).
            A class attribute.
        name (str): The name of the dataset.
        version (str): The used dataset version.
        splits (list[str]): The dataset splits to retrieve.
        data_path (Path): The top-level directory for storing all datasets.
        dataset (Dataset | None): The loaded dataset.
        dataset_dict (DatasetDict | None): The loaded dataset dictionary.
    """

    priority: int = 0

    name: str
    version: str
    splits: list[str]
    data_path: Path
    dataset: Dataset | None
    dataset_dict: DatasetDict | None

    @classmethod
    def create(
        cls,
        name: str,
        splits: list[str],
        version: str | None = None,
        data_dir: str | None = None,
        **kwargs: dict,
    ) -> "DatasetManager":
        """Creates a new dataset manager for the specified dataset.

        Args:
            name (str): The name of the dataset.
            splits (list[str]): The dataset splits to retrieve.
            version (str | None): The dataset version to retrieve.
            data_dir (str | None): The top-level directory for storing all datasets.
            **kwargs (dict): Additional keyword arguments.

        Returns:
            (DatasetManager): The created dataset manager.
        """
        manager = DatasetManagerRegistry.get(name)
        if manager is not None:
            return manager(
                name=name,
                splits=splits,
                version=version,
                data_dir=data_dir,
                **kwargs,
            )
        raise ValueError(f"No suitable dataset manager found for {name}")

    def __init__(
        self,
        name: str,
        splits: list[str],
        version: str | None = None,
        data_dir: str | None = None,
        **kwargs: dict,
    ):
        """Initializes a new DatasetManager.

        Args:
            name (str): The name of the dataset.
            splits (list[str]): The dataset splits to retrieve.
            version (str, optional): The dataset version to retrieve.
            data_dir (str, optional): The top-level directory for storing all
                datasets. Defaults to "datasets" in the user cache directory.
            **kwargs (dict): Additional keyword arguments.
        """
        self.name = name
        self.splits = list(sorted(splits))
        self.version = version or DEFAULT_VERSION_NAME
        if data_dir is not None:
            self.data_path = Path(data_dir)
        else:
            self.data_path = DATA_PATH
        self.dataset = None
        self.dataset_dict = None

    @property
    def dataset_path(self) -> Path:
        """The top-level directory for storing this dataset.

        Returns:
            (Path): The dataset directory.
        """
        return self.data_path / to_safe_filename(self.name)

    @property
    def version_path(self) -> Path:
        """The directory for storing a specific version of this dataset.

        Returns:
            (Path): The dataset version directory.
        """
        return self.dataset_path / to_safe_filename(self.version)

    @property
    def main_data_path(self) -> Path:
        """The path for storing the preprocessed dataset files for a specific version.

        Returns:
            (Path): The main dataset directory.
        """
        return self.version_path / "main"

    @property
    def record(self) -> DatasetRecord:
        """Returns a record identifying the dataset.

        Returns:
            (DatasetRecord): The dataset record.
        """
        return DatasetRecord(
            name=self.name,
            version=self.version,
            splits=tuple(self.splits),
        )

    @abstractmethod
    def retrieve(self, **kwargs) -> None:
        """Downloads and preprocesses a dataset.

        Args:
            **kwargs (dict): Additional keyword arguments.
        """
        ...

    def is_retrieved(self) -> bool:
        """Checks if the dataset at the specific version is already downloaded.

        Returns:
            (bool): True if the dataset exists locally, False otherwise.
        """
        return self.main_data_path.exists()

    def remove(self) -> None:
        """Deletes the dataset at the specific version from disk."""
        if self.version_path.exists():
            shutil.rmtree(self.version_path)

    @overload
    def load(
        self,
        *,
        retrieve: bool = True,
        cache: bool = True,
        force_retrieve: bool = False,
        load_as_dict: Literal[False] = ...,
    ) -> Dataset: ...
    @overload
    def load(
        self,
        *,
        retrieve: bool = True,
        cache: bool = True,
        force_retrieve: bool = False,
        load_as_dict: Literal[True],
    ) -> DatasetDict: ...
    def load(
        self,
        *,
        retrieve: bool = True,
        cache: bool = True,
        force_retrieve: bool = False,
        load_as_dict: bool = False,
    ) -> Dataset | DatasetDict:
        """Loads the dataset as a HuggingFace dataset.

        Args:
            retrieve (bool, optional): Whether to retrieve the dataset if it
                does not exist locally. Defaults to True.
            cache (bool, optional): Whether to cache the dataset in memory.
                Defaults to True.
            force_retrieve (bool, optional): Whether to force retrieving and
                reloading the dataset even if it is already cached. Overrides
                the `retrieve` flag if set to True. Defaults to False.
            load_as_dict (bool, optional): Whether to load the dataset with
                multiple splits as a DatasetDict. If False (the default),
                the selected dataset splits are concatenated into a single
                dataset.

        Returns:
            (Dataset | DatasetDict): The loaded dataset.
        """
        # Return quickly if we already have the dataset cached
        if not load_as_dict and self.dataset is not None and not force_retrieve:
            return self.dataset
        if load_as_dict and self.dataset_dict is not None and not force_retrieve:
            return self.dataset_dict

        # Retrieve the dataset if needed
        if (not self.is_retrieved() and retrieve) or force_retrieve:
            self.retrieve()
        elif not self.is_retrieved():
            raise ValueError(
                f"Dataset {self.name} is not available locally and "
                "retrieve is set to False. Either `retrieve` the dataset first or "
                "set the retrieve flag to True."
            )

        # Load the retrieved dataset
        hf_dataset = load_from_disk(self.main_data_path)
        if not isinstance(hf_dataset, DatasetDict):
            raise ValueError(
                "Expected dataset to be DatasetDict, but got regular Dataset."
            )
        try:
            hf_dataset = DatasetDict({sid: hf_dataset[sid] for sid in self.splits})
        except KeyError as e:
            raise ValueError(f"No such split {e}.")

        if load_as_dict:
            # Return the dataset as a dictionary
            if cache:
                self.dataset_dict = hf_dataset
            return hf_dataset

        # Concatenate the splits and return the data as a single Dataset object
        hf_dataset = concatenate_datasets(
            [
                hf_dataset[s].cast(hf_dataset[self.splits[0]].features)
                for s in self.splits
            ]
        )
        if cache:
            self.dataset = hf_dataset
        return hf_dataset

    def unload(self) -> None:
        """Unloads the dataset from memory."""
        self.dataset = None
        self.dataset_dict = None

    @classmethod
    @abstractmethod
    def can_handle(cls, name: str) -> bool:
        """Checks if the DatasetManager can handle the given dataset.

        Args:
            name (str): The name of the dataset.

        Returns:
            (bool): True if the manager can handle the dataset, False otherwise.
        """
        pass

dataset_path `property`

dataset_path: Path

The top-level directory for storing this dataset.

Returns:

Type	Description
`Path`	The dataset directory.

main_data_path `property`

main_data_path: Path

The path for storing the preprocessed dataset files for a specific version.

Returns:

Type	Description
`Path`	The main dataset directory.

record `property`

record: DatasetRecord

Returns a record identifying the dataset.

Returns:

Type	Description
`DatasetRecord`	The dataset record.

version_path `property`

version_path: Path

The directory for storing a specific version of this dataset.

Returns:

Type	Description
`Path`	The dataset version directory.

init

__init__(
    name: str,
    splits: list[str],
    version: str | None = None,
    data_dir: str | None = None,
    **kwargs: dict,
)

Initializes a new DatasetManager.

Parameters:

Name	Type	Description	Default
`name`	`str`	The name of the dataset.	required
`splits`	`list[str]`	The dataset splits to retrieve.	required
`version`	`str`	The dataset version to retrieve.	`None`
`data_dir`	`str`	The top-level directory for storing all datasets. Defaults to "datasets" in the user cache directory.	`None`
`**kwargs`	`dict`	Additional keyword arguments.	`{}`

Source code in evalsense/datasets/dataset_manager.py

def __init__(
    self,
    name: str,
    splits: list[str],
    version: str | None = None,
    data_dir: str | None = None,
    **kwargs: dict,
):
    """Initializes a new DatasetManager.

    Args:
        name (str): The name of the dataset.
        splits (list[str]): The dataset splits to retrieve.
        version (str, optional): The dataset version to retrieve.
        data_dir (str, optional): The top-level directory for storing all
            datasets. Defaults to "datasets" in the user cache directory.
        **kwargs (dict): Additional keyword arguments.
    """
    self.name = name
    self.splits = list(sorted(splits))
    self.version = version or DEFAULT_VERSION_NAME
    if data_dir is not None:
        self.data_path = Path(data_dir)
    else:
        self.data_path = DATA_PATH
    self.dataset = None
    self.dataset_dict = None

can_handle `abstractmethod` `classmethod`

can_handle(name: str) -> bool

Checks if the DatasetManager can handle the given dataset.

Parameters:

Name	Type	Description	Default
`name`	`str`	The name of the dataset.	required

Returns:

Type	Description
`bool`	True if the manager can handle the dataset, False otherwise.

Source code in evalsense/datasets/dataset_manager.py

@classmethod
@abstractmethod
def can_handle(cls, name: str) -> bool:
    """Checks if the DatasetManager can handle the given dataset.

    Args:
        name (str): The name of the dataset.

    Returns:
        (bool): True if the manager can handle the dataset, False otherwise.
    """
    pass

create `classmethod`

create(
    name: str,
    splits: list[str],
    version: str | None = None,
    data_dir: str | None = None,
    **kwargs: dict,
) -> DatasetManager

Creates a new dataset manager for the specified dataset.

Parameters:

Name	Type	Description	Default
`name`	`str`	The name of the dataset.	required
`splits`	`list[str]`	The dataset splits to retrieve.	required
`version`	`str \| None`	The dataset version to retrieve.	`None`
`data_dir`	`str \| None`	The top-level directory for storing all datasets.	`None`
`**kwargs`	`dict`	Additional keyword arguments.	`{}`

Returns:

Type	Description
`DatasetManager`	The created dataset manager.

Source code in evalsense/datasets/dataset_manager.py

@classmethod
def create(
    cls,
    name: str,
    splits: list[str],
    version: str | None = None,
    data_dir: str | None = None,
    **kwargs: dict,
) -> "DatasetManager":
    """Creates a new dataset manager for the specified dataset.

    Args:
        name (str): The name of the dataset.
        splits (list[str]): The dataset splits to retrieve.
        version (str | None): The dataset version to retrieve.
        data_dir (str | None): The top-level directory for storing all datasets.
        **kwargs (dict): Additional keyword arguments.

    Returns:
        (DatasetManager): The created dataset manager.
    """
    manager = DatasetManagerRegistry.get(name)
    if manager is not None:
        return manager(
            name=name,
            splits=splits,
            version=version,
            data_dir=data_dir,
            **kwargs,
        )
    raise ValueError(f"No suitable dataset manager found for {name}")

is_retrieved

is_retrieved() -> bool

Checks if the dataset at the specific version is already downloaded.

Returns:

Type	Description
`bool`	True if the dataset exists locally, False otherwise.

Source code in evalsense/datasets/dataset_manager.py

def is_retrieved(self) -> bool:
    """Checks if the dataset at the specific version is already downloaded.

    Returns:
        (bool): True if the dataset exists locally, False otherwise.
    """
    return self.main_data_path.exists()

load

load(
    *,
    retrieve: bool = True,
    cache: bool = True,
    force_retrieve: bool = False,
    load_as_dict: Literal[False] = ...,
) -> Dataset

load(
    *,
    retrieve: bool = True,
    cache: bool = True,
    force_retrieve: bool = False,
    load_as_dict: Literal[True],
) -> DatasetDict

load(
    *,
    retrieve: bool = True,
    cache: bool = True,
    force_retrieve: bool = False,
    load_as_dict: bool = False,
) -> Dataset | DatasetDict

Loads the dataset as a HuggingFace dataset.

Parameters:

Name	Type	Description	Default
`retrieve`	`bool`	Whether to retrieve the dataset if it does not exist locally. Defaults to True.	`True`
`cache`	`bool`	Whether to cache the dataset in memory. Defaults to True.	`True`
`force_retrieve`	`bool`	Whether to force retrieving and reloading the dataset even if it is already cached. Overrides the `retrieve` flag if set to True. Defaults to False.	`False`
`load_as_dict`	`bool`	Whether to load the dataset with multiple splits as a DatasetDict. If False (the default), the selected dataset splits are concatenated into a single dataset.	`False`

Returns:

Type	Description
`Dataset \| DatasetDict`	The loaded dataset.

Source code in evalsense/datasets/dataset_manager.py

def load(
    self,
    *,
    retrieve: bool = True,
    cache: bool = True,
    force_retrieve: bool = False,
    load_as_dict: bool = False,
) -> Dataset | DatasetDict:
    """Loads the dataset as a HuggingFace dataset.

    Args:
        retrieve (bool, optional): Whether to retrieve the dataset if it
            does not exist locally. Defaults to True.
        cache (bool, optional): Whether to cache the dataset in memory.
            Defaults to True.
        force_retrieve (bool, optional): Whether to force retrieving and
            reloading the dataset even if it is already cached. Overrides
            the `retrieve` flag if set to True. Defaults to False.
        load_as_dict (bool, optional): Whether to load the dataset with
            multiple splits as a DatasetDict. If False (the default),
            the selected dataset splits are concatenated into a single
            dataset.

    Returns:
        (Dataset | DatasetDict): The loaded dataset.
    """
    # Return quickly if we already have the dataset cached
    if not load_as_dict and self.dataset is not None and not force_retrieve:
        return self.dataset
    if load_as_dict and self.dataset_dict is not None and not force_retrieve:
        return self.dataset_dict

    # Retrieve the dataset if needed
    if (not self.is_retrieved() and retrieve) or force_retrieve:
        self.retrieve()
    elif not self.is_retrieved():
        raise ValueError(
            f"Dataset {self.name} is not available locally and "
            "retrieve is set to False. Either `retrieve` the dataset first or "
            "set the retrieve flag to True."
        )

    # Load the retrieved dataset
    hf_dataset = load_from_disk(self.main_data_path)
    if not isinstance(hf_dataset, DatasetDict):
        raise ValueError(
            "Expected dataset to be DatasetDict, but got regular Dataset."
        )
    try:
        hf_dataset = DatasetDict({sid: hf_dataset[sid] for sid in self.splits})
    except KeyError as e:
        raise ValueError(f"No such split {e}.")

    if load_as_dict:
        # Return the dataset as a dictionary
        if cache:
            self.dataset_dict = hf_dataset
        return hf_dataset

    # Concatenate the splits and return the data as a single Dataset object
    hf_dataset = concatenate_datasets(
        [
            hf_dataset[s].cast(hf_dataset[self.splits[0]].features)
            for s in self.splits
        ]
    )
    if cache:
        self.dataset = hf_dataset
    return hf_dataset

remove

remove() -> None

Deletes the dataset at the specific version from disk.

Source code in evalsense/datasets/dataset_manager.py

def remove(self) -> None:
    """Deletes the dataset at the specific version from disk."""
    if self.version_path.exists():
        shutil.rmtree(self.version_path)

retrieve `abstractmethod`

retrieve(**kwargs) -> None

Downloads and preprocesses a dataset.

Parameters:

Name	Type	Description	Default
`**kwargs`	`dict`	Additional keyword arguments.	`{}`

Source code in evalsense/datasets/dataset_manager.py

@abstractmethod
def retrieve(self, **kwargs) -> None:
    """Downloads and preprocesses a dataset.

    Args:
        **kwargs (dict): Additional keyword arguments.
    """
    ...

unload

unload() -> None

Unloads the dataset from memory.

Source code in evalsense/datasets/dataset_manager.py

def unload(self) -> None:
    """Unloads the dataset from memory."""
    self.dataset = None
    self.dataset_dict = None

DatasetManagerRegistry

A registry for dataset managers.

Methods:

Name	Description
`get`	Gets the dataset manager for a specific dataset.
`register`	Registers a new dataset manager.

Source code in evalsense/datasets/dataset_manager.py

class DatasetManagerRegistry:
    """A registry for dataset managers."""

    registry: list[Type["DatasetManager"]] = []

    @classmethod
    def register(cls, manager: Type["DatasetManager"]):
        """Registers a new dataset manager.

        Args:
            manager (Type["DatasetManager"]): The dataset manager to be registered.
        """
        cls.registry.append(manager)

    @classmethod
    def get(cls, name: str) -> Type["DatasetManager"] | None:
        """Gets the dataset manager for a specific dataset.

        Args:
            name (str): The name of the dataset.

        Returns:
            (Type["DatasetManager"] | None): The dataset manager for the dataset, or None if not found.
        """
        for manager in sorted(cls.registry, key=lambda m: m.priority, reverse=True):
            if manager.can_handle(name):
                return manager
        return None

get `classmethod`

get(name: str) -> Type[DatasetManager] | None

Gets the dataset manager for a specific dataset.

Parameters:

Name	Type	Description	Default
`name`	`str`	The name of the dataset.	required

Returns:

Type	Description
`Type[DatasetManager] \| None`	The dataset manager for the dataset, or None if not found.

Source code in evalsense/datasets/dataset_manager.py

@classmethod
def get(cls, name: str) -> Type["DatasetManager"] | None:
    """Gets the dataset manager for a specific dataset.

    Args:
        name (str): The name of the dataset.

    Returns:
        (Type["DatasetManager"] | None): The dataset manager for the dataset, or None if not found.
    """
    for manager in sorted(cls.registry, key=lambda m: m.priority, reverse=True):
        if manager.can_handle(name):
            return manager
    return None

register `classmethod`

register(manager: Type[DatasetManager])

Registers a new dataset manager.

Parameters:

Name	Type	Description	Default
`manager`	`Type[DatasetManager]`	The dataset manager to be registered.	required

Source code in evalsense/datasets/dataset_manager.py

@classmethod
def register(cls, manager: Type["DatasetManager"]):
    """Registers a new dataset manager.

    Args:
        manager (Type["DatasetManager"]): The dataset manager to be registered.
    """
    cls.registry.append(manager)

DatasetMetadata

Bases: BaseModel

The metadata for a dataset.

Attributes:

Name	Type	Description
`name`	`str`	The name of the dataset
`versions`	`dict[str, VersionMetadata]`	The dataset versions
`source`	`OnlineSource \| LocalSource`	The immediate source of the dataset (use `effective_source` to access the effective source, which may be inherited)

Methods:

Name	Description
`get_files`	Gets the files for the specified version and splits.
`get_splits`	Gets the dataset splits for the specified version.

Source code in evalsense/datasets/dataset_config.py

class DatasetMetadata(BaseModel):
    """The metadata for a dataset.

    Attributes:
        name (str): The name of the dataset
        versions (dict[str, VersionMetadata]): The dataset versions
        source (OnlineSource | LocalSource, optional): The immediate source of
            the dataset (use `effective_source` to access the effective source,
            which may be inherited)
    """

    name: str
    versions: dict[str, VersionMetadata]
    source: OnlineSource | LocalSource | None = None

    @field_validator("versions", mode="before")
    @classmethod
    def convert_list_to_dict(cls, versions):
        if isinstance(versions, list):
            return {version["name"]: version for version in versions}
        return versions

    @override
    def model_post_init(self, _):
        for version in self.versions.values():
            version.parent = self

    @property
    def effective_source(self) -> OnlineSource | LocalSource:
        """The effective source of the dataset.

        Returns:
            (OnlineSource | LocalSource): The effective source.
        """
        if self.source is not None:
            return self.source
        raise ValueError("No effective source exists.")

    def get_files(self, version: str, splits: list[str]) -> dict[str, FileMetadata]:
        """Gets the files for the specified version and splits.

        Args:
            version (str): The name of the version.
            splits (list[str]): The names of the splits.

        Returns:
            (dict[str, FileMetadata]): The files for the version and splits.
        """
        if version not in self.versions:
            raise ValueError(f"Version '{version}' not found for dataset {self.name}.")
        return self.versions[version].get_files(splits)

    def get_splits(self, version: str) -> dict[str, SplitMetadata]:
        """Gets the dataset splits for the specified version.

        Args:
            version (str): The name of the version.

        Returns:
            (dict[str, SplitMetadata]): The splits for the version.
        """
        if version not in self.versions:
            raise ValueError(f"Version '{version}' not found for dataset {self.name}.")
        return self.versions[version].splits

effective_source `property`

effective_source: OnlineSource | LocalSource

The effective source of the dataset.

Returns:

Type	Description
`OnlineSource \| LocalSource`	The effective source.

get_files

get_files(
    version: str, splits: list[str]
) -> dict[str, FileMetadata]

Gets the files for the specified version and splits.

Parameters:

Name	Type	Description	Default
`version`	`str`	The name of the version.	required
`splits`	`list[str]`	The names of the splits.	required

Returns:

Type	Description
`dict[str, FileMetadata]`	The files for the version and splits.

Source code in evalsense/datasets/dataset_config.py

def get_files(self, version: str, splits: list[str]) -> dict[str, FileMetadata]:
    """Gets the files for the specified version and splits.

    Args:
        version (str): The name of the version.
        splits (list[str]): The names of the splits.

    Returns:
        (dict[str, FileMetadata]): The files for the version and splits.
    """
    if version not in self.versions:
        raise ValueError(f"Version '{version}' not found for dataset {self.name}.")
    return self.versions[version].get_files(splits)

get_splits

get_splits(version: str) -> dict[str, SplitMetadata]

Gets the dataset splits for the specified version.

Parameters:

Name	Type	Description	Default
`version`	`str`	The name of the version.	required

Returns:

Type	Description
`dict[str, SplitMetadata]`	The splits for the version.

Source code in evalsense/datasets/dataset_config.py

def get_splits(self, version: str) -> dict[str, SplitMetadata]:
    """Gets the dataset splits for the specified version.

    Args:
        version (str): The name of the version.

    Returns:
        (dict[str, SplitMetadata]): The splits for the version.
    """
    if version not in self.versions:
        raise ValueError(f"Version '{version}' not found for dataset {self.name}.")
    return self.versions[version].splits

DatasetRecord

Bases: BaseModel

A record identifying a dataset.

Attributes:

Name	Type	Description
`name`	`str`	The name of the dataset.
`version`	`str`	The version of the dataset.
`splits`	`list[str]`	The used dataset splits.

Methods:

Name	Description
`__eq__`	Checks if this record is equal to another record.
`__hash__`	Returns a hash of the record.
`__lt__`	Checks if this record is less than another record.

Source code in evalsense/datasets/dataset_manager.py

@total_ordering
class DatasetRecord(BaseModel, frozen=True):
    """A record identifying a dataset.

    Attributes:
        name (str): The name of the dataset.
        version (str): The version of the dataset.
        splits (list[str]): The used dataset splits.
    """

    name: str
    version: str
    splits: tuple[str, ...]

    def __eq__(self, other: object) -> bool:
        """Checks if this record is equal to another record.

        Args:
            other (object): The other record to compare with.

        Returns:
            (bool): True if the records are equal, False otherwise.
        """
        if not isinstance(other, DatasetRecord) or type(self) is not type(other):
            return NotImplemented
        return (
            self.name == other.name
            and self.version == other.version
            and self.splits == other.splits
        )

    def __lt__(self, other: object) -> bool:
        """Checks if this record is less than another record.

        Args:
            other (object): The other record to compare with.

        Returns:
            (bool): True if this record is less than the other, False otherwise.
        """
        if not isinstance(other, DatasetRecord) or type(self) is not type(other):
            return NotImplemented
        return (
            self.name,
            self.version,
            self.splits,
        ) < (
            other.name,
            other.version,
            other.splits,
        )

    def __hash__(self) -> int:
        """Returns a hash of the record.

        Returns:
            (int): The hash of the record.
        """
        return hash((self.name, self.version, self.splits))

eq

__eq__(other: object) -> bool

Checks if this record is equal to another record.

Parameters:

Name	Type	Description	Default
`other`	`object`	The other record to compare with.	required

Returns:

Type	Description
`bool`	True if the records are equal, False otherwise.

Source code in evalsense/datasets/dataset_manager.py

def __eq__(self, other: object) -> bool:
    """Checks if this record is equal to another record.

    Args:
        other (object): The other record to compare with.

    Returns:
        (bool): True if the records are equal, False otherwise.
    """
    if not isinstance(other, DatasetRecord) or type(self) is not type(other):
        return NotImplemented
    return (
        self.name == other.name
        and self.version == other.version
        and self.splits == other.splits
    )

hash

__hash__() -> int

Returns a hash of the record.

Returns:

Type	Description
`int`	The hash of the record.

Source code in evalsense/datasets/dataset_manager.py

def __hash__(self) -> int:
    """Returns a hash of the record.

    Returns:
        (int): The hash of the record.
    """
    return hash((self.name, self.version, self.splits))

lt

__lt__(other: object) -> bool

Checks if this record is less than another record.

Parameters:

Name	Type	Description	Default
`other`	`object`	The other record to compare with.	required

Returns:

Type	Description
`bool`	True if this record is less than the other, False otherwise.

Source code in evalsense/datasets/dataset_manager.py

def __lt__(self, other: object) -> bool:
    """Checks if this record is less than another record.

    Args:
        other (object): The other record to compare with.

    Returns:
        (bool): True if this record is less than the other, False otherwise.
    """
    if not isinstance(other, DatasetRecord) or type(self) is not type(other):
        return NotImplemented
    return (
        self.name,
        self.version,
        self.splits,
    ) < (
        other.name,
        other.version,
        other.splits,
    )

FileBasedDatasetManager

Bases: DatasetManager

An abstract class for managing datasets.

Attributes:

Name	Type	Description
`priority`	`int`	The priority of the dataset manager. Ranges from 0 (the lowest priority) to 10 (the highest priority). A class attribute.
`name`	`str`	The name of the dataset.
`version`	`str`	The used dataset version.
`splits`	`list[str]`	The dataset splits to retrieve.
`data_path`	`Path`	The top-level directory for storing all datasets.
`dataset`	`Dataset \| None`	The loaded dataset.
`dataset_dict`	`DatasetDict \| None`	The loaded dataset dictionary.
`config`	`DatasetConfig`	The configuration for the dataset.
`all_splits`	`list[str]`	list[str]: All available dataset splits.

Methods:

Name	Description
`__init__`	Initializes a new DatasetManager.
`can_handle`	Checks if the DatasetManager can handle the given dataset.
`create`	Creates a new dataset manager for the specified dataset.
`is_retrieved`	Checks if the dataset at the specific version is already downloaded.
`load`	Loads the dataset as a HuggingFace dataset.
`remove`	Deletes the dataset at the specific version from disk.
`retrieve`	Downloads and preprocesses a dataset.
`unload`	Unloads the dataset from memory.

Source code in evalsense/datasets/dataset_manager.py

class FileBasedDatasetManager(DatasetManager):
    """An abstract class for managing datasets.

    Attributes:
        priority (int): The priority of the dataset manager. Ranges from
            0 (the lowest priority) to 10 (the highest priority).
            A class attribute.
        name (str): The name of the dataset.
        version (str): The used dataset version.
        splits (list[str]): The dataset splits to retrieve.
        data_path (Path): The top-level directory for storing all datasets.
        dataset (Dataset | None): The loaded dataset.
        dataset_dict (DatasetDict | None): The loaded dataset dictionary.
        config (DatasetConfig): The configuration for the dataset.
        all_splits: list[str]: All available dataset splits.
    """

    config: DatasetConfig
    all_splits: list[str]

    def __init__(
        self,
        name: str,
        version: str = DEFAULT_VERSION_NAME,
        splits: list[str] | None = None,
        data_dir: str | None = None,
        **kwargs,
    ):
        """Initializes a new DatasetManager.

        Args:
            name (str): The name of the dataset.
            version (str): The dataset version to retrieve.
            splits (list[str], optional): The dataset splits to retrieve.
            data_dir (str, optional): The top-level directory for storing all
                datasets. Defaults to "datasets" in the user cache directory.
            **kwargs (dict): Additional keyword arguments.
        """
        self.config = DatasetConfig(name)
        self.all_splits = list(self.config.get_splits(version).keys())
        if splits is None:
            splits = self.all_splits

        super().__init__(
            name=name,
            version=version,
            splits=splits,
            data_dir=data_dir,
            **kwargs,
        )

    def _retrieve_files(self, **kwargs) -> None:
        """Retrieves  dataset files.

        This method retrieves all the dataset files for the specified splits
        into the `self.version_path` directory.

        Args:
            **kwargs (dict): Additional keyword arguments.
        """
        for filename, file_metadata in self.config.get_files(
            self.version, self.all_splits
        ).items():
            effective_source = file_metadata.effective_source
            if effective_source is not None and isinstance(
                effective_source, OnlineSource
            ):
                download_file(
                    effective_source.url_template.format(
                        version=self.version, filename=filename
                    ),
                    self.version_path / filename,
                    expected_hash=file_metadata.hash,
                    hash_type=file_metadata.hash_type,
                )

    @abstractmethod
    def _preprocess_files(self, **kwargs) -> None:
        """Preprocesses the downloaded dataset files.

        This method preprocesses the retrieved dataset files and saves them
        as a HuggingFace DatasetDict in the `self.main_data_path` directory.

        Args:
            **kwargs (dict): Additional keyword arguments.
        """
        pass

    @override
    def retrieve(self, **kwargs) -> None:
        """Downloads and preprocesses a dataset.

        Args:
            **kwargs (dict): Additional keyword arguments.
        """
        self.version_path.mkdir(parents=True, exist_ok=True)
        self._retrieve_files(**kwargs)
        self._preprocess_files(**kwargs)

dataset_path `property`

dataset_path: Path

The top-level directory for storing this dataset.

Returns:

Type	Description
`Path`	The dataset directory.

main_data_path `property`

main_data_path: Path

The path for storing the preprocessed dataset files for a specific version.

Returns:

Type	Description
`Path`	The main dataset directory.

record `property`

record: DatasetRecord

Returns a record identifying the dataset.

Returns:

Type	Description
`DatasetRecord`	The dataset record.

version_path `property`

version_path: Path

The directory for storing a specific version of this dataset.

Returns:

Type	Description
`Path`	The dataset version directory.

init

__init__(
    name: str,
    version: str = DEFAULT_VERSION_NAME,
    splits: list[str] | None = None,
    data_dir: str | None = None,
    **kwargs,
)

Initializes a new DatasetManager.

Parameters:

Name	Type	Description	Default
`name`	`str`	The name of the dataset.	required
`version`	`str`	The dataset version to retrieve.	`DEFAULT_VERSION_NAME`
`splits`	`list[str]`	The dataset splits to retrieve.	`None`
`data_dir`	`str`	The top-level directory for storing all datasets. Defaults to "datasets" in the user cache directory.	`None`
`**kwargs`	`dict`	Additional keyword arguments.	`{}`

Source code in evalsense/datasets/dataset_manager.py

def __init__(
    self,
    name: str,
    version: str = DEFAULT_VERSION_NAME,
    splits: list[str] | None = None,
    data_dir: str | None = None,
    **kwargs,
):
    """Initializes a new DatasetManager.

    Args:
        name (str): The name of the dataset.
        version (str): The dataset version to retrieve.
        splits (list[str], optional): The dataset splits to retrieve.
        data_dir (str, optional): The top-level directory for storing all
            datasets. Defaults to "datasets" in the user cache directory.
        **kwargs (dict): Additional keyword arguments.
    """
    self.config = DatasetConfig(name)
    self.all_splits = list(self.config.get_splits(version).keys())
    if splits is None:
        splits = self.all_splits

    super().__init__(
        name=name,
        version=version,
        splits=splits,
        data_dir=data_dir,
        **kwargs,
    )

can_handle `abstractmethod` `classmethod`

can_handle(name: str) -> bool

Checks if the DatasetManager can handle the given dataset.

Parameters:

Name	Type	Description	Default
`name`	`str`	The name of the dataset.	required

Returns:

Type	Description
`bool`	True if the manager can handle the dataset, False otherwise.

Source code in evalsense/datasets/dataset_manager.py

@classmethod
@abstractmethod
def can_handle(cls, name: str) -> bool:
    """Checks if the DatasetManager can handle the given dataset.

    Args:
        name (str): The name of the dataset.

    Returns:
        (bool): True if the manager can handle the dataset, False otherwise.
    """
    pass

create `classmethod`

create(
    name: str,
    splits: list[str],
    version: str | None = None,
    data_dir: str | None = None,
    **kwargs: dict,
) -> DatasetManager

Creates a new dataset manager for the specified dataset.

Parameters:

Name	Type	Description	Default
`name`	`str`	The name of the dataset.	required
`splits`	`list[str]`	The dataset splits to retrieve.	required
`version`	`str \| None`	The dataset version to retrieve.	`None`
`data_dir`	`str \| None`	The top-level directory for storing all datasets.	`None`
`**kwargs`	`dict`	Additional keyword arguments.	`{}`

Returns:

Type	Description
`DatasetManager`	The created dataset manager.

Source code in evalsense/datasets/dataset_manager.py

@classmethod
def create(
    cls,
    name: str,
    splits: list[str],
    version: str | None = None,
    data_dir: str | None = None,
    **kwargs: dict,
) -> "DatasetManager":
    """Creates a new dataset manager for the specified dataset.

    Args:
        name (str): The name of the dataset.
        splits (list[str]): The dataset splits to retrieve.
        version (str | None): The dataset version to retrieve.
        data_dir (str | None): The top-level directory for storing all datasets.
        **kwargs (dict): Additional keyword arguments.

    Returns:
        (DatasetManager): The created dataset manager.
    """
    manager = DatasetManagerRegistry.get(name)
    if manager is not None:
        return manager(
            name=name,
            splits=splits,
            version=version,
            data_dir=data_dir,
            **kwargs,
        )
    raise ValueError(f"No suitable dataset manager found for {name}")

is_retrieved

is_retrieved() -> bool

Checks if the dataset at the specific version is already downloaded.

Returns:

Type	Description
`bool`	True if the dataset exists locally, False otherwise.

Source code in evalsense/datasets/dataset_manager.py

def is_retrieved(self) -> bool:
    """Checks if the dataset at the specific version is already downloaded.

    Returns:
        (bool): True if the dataset exists locally, False otherwise.
    """
    return self.main_data_path.exists()

load

load(
    *,
    retrieve: bool = True,
    cache: bool = True,
    force_retrieve: bool = False,
    load_as_dict: Literal[False] = ...,
) -> Dataset

load(
    *,
    retrieve: bool = True,
    cache: bool = True,
    force_retrieve: bool = False,
    load_as_dict: Literal[True],
) -> DatasetDict

load(
    *,
    retrieve: bool = True,
    cache: bool = True,
    force_retrieve: bool = False,
    load_as_dict: bool = False,
) -> Dataset | DatasetDict

Loads the dataset as a HuggingFace dataset.

Parameters:

Name	Type	Description	Default
`retrieve`	`bool`	Whether to retrieve the dataset if it does not exist locally. Defaults to True.	`True`
`cache`	`bool`	Whether to cache the dataset in memory. Defaults to True.	`True`
`force_retrieve`	`bool`	Whether to force retrieving and reloading the dataset even if it is already cached. Overrides the `retrieve` flag if set to True. Defaults to False.	`False`
`load_as_dict`	`bool`	Whether to load the dataset with multiple splits as a DatasetDict. If False (the default), the selected dataset splits are concatenated into a single dataset.	`False`

Returns:

Type	Description
`Dataset \| DatasetDict`	The loaded dataset.

Source code in evalsense/datasets/dataset_manager.py

def load(
    self,
    *,
    retrieve: bool = True,
    cache: bool = True,
    force_retrieve: bool = False,
    load_as_dict: bool = False,
) -> Dataset | DatasetDict:
    """Loads the dataset as a HuggingFace dataset.

    Args:
        retrieve (bool, optional): Whether to retrieve the dataset if it
            does not exist locally. Defaults to True.
        cache (bool, optional): Whether to cache the dataset in memory.
            Defaults to True.
        force_retrieve (bool, optional): Whether to force retrieving and
            reloading the dataset even if it is already cached. Overrides
            the `retrieve` flag if set to True. Defaults to False.
        load_as_dict (bool, optional): Whether to load the dataset with
            multiple splits as a DatasetDict. If False (the default),
            the selected dataset splits are concatenated into a single
            dataset.

    Returns:
        (Dataset | DatasetDict): The loaded dataset.
    """
    # Return quickly if we already have the dataset cached
    if not load_as_dict and self.dataset is not None and not force_retrieve:
        return self.dataset
    if load_as_dict and self.dataset_dict is not None and not force_retrieve:
        return self.dataset_dict

    # Retrieve the dataset if needed
    if (not self.is_retrieved() and retrieve) or force_retrieve:
        self.retrieve()
    elif not self.is_retrieved():
        raise ValueError(
            f"Dataset {self.name} is not available locally and "
            "retrieve is set to False. Either `retrieve` the dataset first or "
            "set the retrieve flag to True."
        )

    # Load the retrieved dataset
    hf_dataset = load_from_disk(self.main_data_path)
    if not isinstance(hf_dataset, DatasetDict):
        raise ValueError(
            "Expected dataset to be DatasetDict, but got regular Dataset."
        )
    try:
        hf_dataset = DatasetDict({sid: hf_dataset[sid] for sid in self.splits})
    except KeyError as e:
        raise ValueError(f"No such split {e}.")

    if load_as_dict:
        # Return the dataset as a dictionary
        if cache:
            self.dataset_dict = hf_dataset
        return hf_dataset

    # Concatenate the splits and return the data as a single Dataset object
    hf_dataset = concatenate_datasets(
        [
            hf_dataset[s].cast(hf_dataset[self.splits[0]].features)
            for s in self.splits
        ]
    )
    if cache:
        self.dataset = hf_dataset
    return hf_dataset

remove

remove() -> None

Deletes the dataset at the specific version from disk.

Source code in evalsense/datasets/dataset_manager.py

def remove(self) -> None:
    """Deletes the dataset at the specific version from disk."""
    if self.version_path.exists():
        shutil.rmtree(self.version_path)

retrieve

retrieve(**kwargs) -> None

Downloads and preprocesses a dataset.

Parameters:

Name	Type	Description	Default
`**kwargs`	`dict`	Additional keyword arguments.	`{}`

Source code in evalsense/datasets/dataset_manager.py

@override
def retrieve(self, **kwargs) -> None:
    """Downloads and preprocesses a dataset.

    Args:
        **kwargs (dict): Additional keyword arguments.
    """
    self.version_path.mkdir(parents=True, exist_ok=True)
    self._retrieve_files(**kwargs)
    self._preprocess_files(**kwargs)

unload

unload() -> None

Unloads the dataset from memory.

Source code in evalsense/datasets/dataset_manager.py

def unload(self) -> None:
    """Unloads the dataset from memory."""
    self.dataset = None
    self.dataset_dict = None

FileMetadata

Bases: BaseModel

The metadata for a dataset file.

Attributes:

Name	Type	Description
`name`	`str`	The name of the dataset file
`hash`	`str`	The hash of the dataset file
`hash_type`	`str`	The type of hash used for the dataset file
`source`	`OnlineSource \| LocalSource`	The immediate source of the dataset file (use `effective_source` to access the effective source, which may be inherited)
`parent`	`SplitMetadata`	The parent split metadata

Source code in evalsense/datasets/dataset_config.py

class FileMetadata(BaseModel):
    """The metadata for a dataset file.

    Attributes:
        name (str): The name of the dataset file
        hash (str, optional): The hash of the dataset file
        hash_type (str): The type of hash used for the dataset file
        source (OnlineSource | LocalSource, optional): The immediate source of
            the dataset file (use `effective_source` to access the effective source,
            which may be inherited)
        parent (SplitMetadata): The parent split metadata
    """

    name: str
    hash: str | None = None
    hash_type: str = DEFAULT_HASH_TYPE
    source: OnlineSource | LocalSource | None = None
    parent: Optional["SplitMetadata"] = None

    @property
    def effective_source(self) -> OnlineSource | LocalSource:
        """The effective source of the dataset file.

        Returns:
            (OnlineSource | LocalSource): The effective source.
        """
        if self.source is not None:
            return self.source
        if self.parent is None:
            raise RuntimeError("Parent metadata not filled. Please report this issue.")
        return self.parent.effective_source

effective_source `property`

effective_source: OnlineSource | LocalSource

The effective source of the dataset file.

Returns:

Type	Description
`OnlineSource \| LocalSource`	The effective source.

LocalSource

Bases: BaseModel

The local source of the dataset file(s).

Attributes:

Name	Type	Description
`path`	`str`	The path to the dataset file(s)

Source code in evalsense/datasets/dataset_config.py

class LocalSource(BaseModel):
    """The local source of the dataset file(s).

    Attributes:
        path (str): The path to the dataset file(s)
    """

    online: Literal[False]
    path: Path

OnlineSource

Bases: BaseModel

The online source of the dataset file(s).

Attributes:

Name	Type	Description
`url_template`	`str`	The URL template for the dataset file(s), optionally taking a version and filename
`requires_auth`	`bool`	Whether accessing the dataset file(s) requires authentication

Source code in evalsense/datasets/dataset_config.py

class OnlineSource(BaseModel):
    """The online source of the dataset file(s).

    Attributes:
        url_template (str): The URL template for the dataset file(s),
            optionally taking a version and filename
        requires_auth (bool, optional): Whether accessing the dataset file(s)
            requires authentication
    """

    online: Literal[True]
    url_template: str
    requires_auth: bool = False

SplitMetadata

Bases: BaseModel

The metadata for a dataset split.

Attributes:

Name	Type	Description
`name`	`str`	The name of the dataset split
`files`	`dict[str, FileMetadata]`	The dataset files in the split
`source`	`OnlineSource \| LocalSource`	The immediate source of the dataset split (use `effective_source` to access the effective source, which may be inherited)
`parent`	`VersionMetadata`	The parent version metadata

Source code in evalsense/datasets/dataset_config.py

class SplitMetadata(BaseModel):
    """The metadata for a dataset split.

    Attributes:
        name (str): The name of the dataset split
        files (dict[str, FileMetadata]): The dataset files in the split
        source (OnlineSource | LocalSource, optional): The immediate source of
            the dataset split (use `effective_source` to access the effective source,
            which may be inherited)
        parent (VersionMetadata): The parent version metadata
    """

    name: str
    files: dict[str, FileMetadata]
    source: OnlineSource | LocalSource | None = None
    parent: Optional["VersionMetadata"] = None

    @field_validator("files", mode="before")
    @classmethod
    def convert_list_to_dict(cls, files):
        if isinstance(files, list):
            return {file["name"]: file for file in files}
        return files

    @override
    def model_post_init(self, _):
        for file in self.files.values():
            file.parent = self

    @property
    def effective_source(self) -> OnlineSource | LocalSource:
        """The effective source of the dataset split.

        Returns:
            (OnlineSource | LocalSource): The effective source.
        """
        if self.source is not None:
            return self.source
        if self.parent is None:
            raise RuntimeError("Parent metadata not filled. Please report this issue.")
        return self.parent.effective_source

effective_source `property`

effective_source: OnlineSource | LocalSource

The effective source of the dataset split.

Returns:

Type	Description
`OnlineSource \| LocalSource`	The effective source.

VersionMetadata

Bases: BaseModel

The metadata for a dataset version.

Attributes:

Name	Type	Description
`name`	`str`	The name of the dataset version
`splits`	`dict[str, SplitMetadata]`	The dataset splits in the version
`files`	`dict[str, FileMetadata]`	The dataset files in the version
`source`	`OnlineSource \| LocalSource`	The immediate source of the dataset version (use `effective_source` to access the effective source, which may be inherited)
`parent`	`DatasetMetadata`	The parent dataset metadata

Methods:

Name	Description
`get_files`	Gets the files for the specified splits.

Source code in evalsense/datasets/dataset_config.py

class VersionMetadata(BaseModel):
    """The metadata for a dataset version.

    Attributes:
        name (str): The name of the dataset version
        splits (dict[str, SplitMetadata], optional): The dataset splits in the version
        files (dict[str, FileMetadata], optional): The dataset files in the version
        source (OnlineSource | LocalSource, optional): The immediate source of
            the dataset version (use `effective_source` to access the effective source,
            which may be inherited)
        parent (DatasetMetadata): The parent dataset metadata
    """

    name: str
    splits: dict[str, SplitMetadata]
    files: dict[str, FileMetadata] | None = None
    source: OnlineSource | LocalSource | None = None
    parent: Optional["DatasetMetadata"] = None

    @field_validator("splits", "files", mode="before")
    @classmethod
    def convert_list_to_dict(cls, vs):
        if isinstance(vs, list):
            return {v["name"]: v for v in vs}
        return vs

    @override
    def model_post_init(self, _):
        for split in self.splits.values():
            split.parent = self

    @property
    def effective_source(self) -> OnlineSource | LocalSource:
        """The effective source of the dataset version.

        Returns:
            (OnlineSource | LocalSource): The effective source.
        """
        if self.source is not None:
            return self.source
        if self.parent is None:
            raise RuntimeError("Parent metadata not filled. Please report this issue.")
        return self.parent.effective_source

    def get_files(self, splits: list[str]) -> dict[str, FileMetadata]:
        """Gets the files for the specified splits.

        Args:
            splits (list[str]): The names of the splits.

        Returns:
            (dict[str, FileMetadata]): The files for the splits.
        """
        files = {}
        if self.files is not None:
            files.update(self.files)
        for split_name in splits:
            if split_name not in self.splits:
                raise ValueError(
                    f"Split '{split_name}' not found for version {self.name}."
                )
            files.update(self.splits[split_name].files)
        return files

effective_source `property`

effective_source: OnlineSource | LocalSource

The effective source of the dataset version.

Returns:

Type	Description
`OnlineSource \| LocalSource`	The effective source.

get_files

get_files(splits: list[str]) -> dict[str, FileMetadata]

Gets the files for the specified splits.

Parameters:

Name	Type	Description	Default
`splits`	`list[str]`	The names of the splits.	required

Returns:

Type	Description
`dict[str, FileMetadata]`	The files for the splits.

Source code in evalsense/datasets/dataset_config.py

def get_files(self, splits: list[str]) -> dict[str, FileMetadata]:
    """Gets the files for the specified splits.

    Args:
        splits (list[str]): The names of the splits.

    Returns:
        (dict[str, FileMetadata]): The files for the splits.
    """
    files = {}
    if self.files is not None:
        files.update(self.files)
    for split_name in splits:
        if split_name not in self.splits:
            raise ValueError(
                f"Split '{split_name}' not found for version {self.name}."
            )
        files.update(self.splits[split_name].files)
    return files

manager

manager(
    manager: Type[DatasetManager],
) -> Type[DatasetManager]

Decorator to register a dataset manager.

Parameters:

Name	Type	Description	Default
`manager`	`Type[DatasetManager]`	The dataset manager to register.	required

Returns:

Type	Description
`Type[DatasetManager]`	Type["DatasetManager"]: The registered dataset manager.

Source code in evalsense/datasets/dataset_manager.py

def manager(manager: Type["DatasetManager"]) -> Type["DatasetManager"]:
    """Decorator to register a dataset manager.

    Args:
        manager (Type["DatasetManager"]): The dataset manager to register.

    Returns:
        Type["DatasetManager"]: The registered dataset manager.
    """
    DatasetManagerRegistry.register(manager)
    return manager

Datasets

DatasetConfig

__init__

get_files

get_splits

DatasetManager

dataset_path property

main_data_path property

record property

version_path property

__init__

can_handle abstractmethod classmethod

create classmethod

is_retrieved

load

remove

retrieve abstractmethod

unload

DatasetManagerRegistry

get classmethod

register classmethod

DatasetMetadata

effective_source property

get_files

get_splits

DatasetRecord

__eq__

__hash__

__lt__

FileBasedDatasetManager

dataset_path property

main_data_path property

record property

version_path property

__init__

can_handle abstractmethod classmethod

create classmethod

is_retrieved

load

remove

retrieve

unload

FileMetadata

effective_source property

LocalSource

OnlineSource

SplitMetadata

effective_source property

VersionMetadata

effective_source property

get_files

manager

init

dataset_path `property`

main_data_path `property`

record `property`

version_path `property`

init

can_handle `abstractmethod` `classmethod`

create `classmethod`

retrieve `abstractmethod`

get `classmethod`

register `classmethod`

effective_source `property`

eq

hash

lt

dataset_path `property`

main_data_path `property`

record `property`

version_path `property`

init

can_handle `abstractmethod` `classmethod`

create `classmethod`

effective_source `property`

effective_source `property`

effective_source `property`