ksuit.providers.dataset_config_provider

Classes

DatasetConfigProvider

Defines where datasets are located and how to load them (e.g., copy them to a local SSD before training).

Module Contents

class ksuit.providers.dataset_config_provider.DatasetConfigProvider(global_dataset_paths, local_dataset_path=None, data_source_modes=None)

Defines where datasets are located and how to load them (e.g., copy them to a local SSD before training). It differentiates between a global storage (e.g., slow but big NFS storage where datasets are persistently stored) and a local storage (e.g., fast but relatively small local SSD on a compute node) that might not be persistent (e.g., SLURM jobs can be configured to wipe the local SSD after each job). For the most efficient dataloading, it typically requires copying the dataset from the global storage to the local storage before training. Global paths are the exact folders where a dataset is located (e.g., /data/CIFAR10) wheras the local path is typically shared among all datasets (e.g., /localSSD). Datasets should copy their data before training to the local root (e.g., copy /data/CIFAR10 to /localSSD to create /localSSD/CIFAR10) but are free to implement the copying process with whatever is optimal for a given dataset. If a dataset is always present on a compute node (e.g., because it was manually copied before the run) one can also simply put the path to the dataset on the local storage as global path (e.g., /data/CIFAR10 was copied before the run to all compute nodes as /localSSD/CIFAR10, /localSSD/CIFAR10 can then be used in global_dataset_paths).

Parameters:
  • global_dataset_paths (dict[str, str]) – Mapping from dataset identifiers (e.g., “cifar10”) to the location where it is stored on the global storage (i.e., on a globally accessible, potentially slow, persistent storage).

  • local_dataset_path (str | None) – Path to a location where datasets can be stored locally on a per-node basis. This location is typically the same on each node, but accesses typically a storage that is local to the current node. If dataloading speed is important, datasets should be copied to this path before training and data should be loaded from the local disk instead of the global one. Optional, if not defined, the global dataset path will be used.

  • data_source_modes (dict[str, Literal['global', 'local']] | None) – Specifies if loading from local_dataset_path is necessary for a given dataset identifier (data_source_mode[ds_identifier] = “local”) or not (data_source_mode[ds_identifier] = “global”). Some datasets are so small/lightweight that they can be loaded from the global storage directly into memory or it is simply fast enough to load it from the global storage. If this is set to false for a certain dataset identifier, it will ignore the local_dataset_path (if it is specified) and load the data from the global storage. If no data_source_mode is defined for a dataset identifier, it will default to local if local_dataset_path is defined and global if local_dataset_path is not defined.

logger
setup_source_root(identifier, copy_to_local_fn=None)

This method allows the following dataset implementation structure:

``` class MyCifar10Dataset(Dataset):

def __init__( self, …, source_root: str | None = None, dataset_config_provider: DatasetConfigProvider | None = None,

):

… if source_root is None:

source_root = dataset_config_provider.setup_source_root(

identifier=”cifar10”, copy_to_local_fn=lambda global_root, local_root: shutil.copytree(global_root, local_root),

)

# load dataset from source_root, wherever source_root might be self.dataset = CIFAR10(root=source_root, …)

```

which will: - Load data from the source_root of the MyCifar10Dataset, if it is defined (e.g., useful if the dataset is

instantiated from a standalone script/notebook if dataset_config_provider is None ).

  • Retrieve the source_root from the DatasetConfigProvider via self.global_dataset_paths[identifier] if

    source_root is None. This is typical if the dataset is instantiated as part of a ksuit training run where paths to dataset can be abstracted away into a setup-specific configuration file. This allows easy dataset path configuration for development setups where dataset locations might differ from the production environment setups, for example to develop on a laptop. This also avoids redundant configuration in the configuration file of a training/evaluation run.

  • Automatically copy the dataset from its location on the global storage to the local storage for fast

    dataloading. This is only done if a copy_to_local_fn is provided and the dataset is configured in the DatasetConfigProvider to be copied to the local disk, i.e., self.local_dataset_path is not None and self.data_source_modes.get(identifier, None) in [“local”, None].

This makes the whole process of automatically copying to the local disk easy to implement from the datasets as only the copy_to_local_fn needs to be implemented.

In multi-gpu and distributed setups, only 1 process per node (is_data_rank0) will invoke the copy_to_local_fn function. The other processes will wait for the copy_to_local_fn to finish.

Parameters:
  • identifier (str) – String identifier of the dataset. This identifier will be used to retrieve the storage location of the dataset from self.global_dataset_paths[identifier] and if it should be loaded from the global or local storage via self.data_source_modes[identifier].

  • copy_to_local_fn (collections.abc.Callable[[pathlib.Path, pathlib.Path], None]) – If provided and self.data_source_modes.get(identifier, None) in [“local”, None], the provided function will be called with the dataset location on the global storage (e.g., /global_storage/CIFAR10) and where the dataset should be copied to on the local storage (e.g., /local_ssd/cifar10, where /local_ssd would be the self.local_dataset_path and cifar10 is the passed identifier). The function should copy the whole dataset from the global to the local storage. /local_ssd/cifar10 is not created automatically and can be used to check if the dataset already exists on the local storage (e.g., if the local storage is persistent and a previous run already copied the dataset. However, this does not check if the dataset copying process was completed successfully and further checks are highly recommended.).

Returns:

Path to the source_root folder, i.e., the folder from which to load data from.

Return type:

pathlib.Path

property local_dataset_path: pathlib.Path

Returns the path to the local storage as pathlib.Path object.

Return type:

pathlib.Path