ksuit.providers.dataset_config_provider ======================================= .. py:module:: ksuit.providers.dataset_config_provider Classes ------- .. autoapisummary:: ksuit.providers.dataset_config_provider.DatasetConfigProvider Module Contents --------------- .. py:class:: DatasetConfigProvider(global_dataset_paths, local_dataset_path = None, data_source_modes = None) Defines where datasets are located and how to load them (e.g., copy them to a local SSD before training). It differentiates between a global storage (e.g., slow but big NFS storage where datasets are persistently stored) and a local storage (e.g., fast but relatively small local SSD on a compute node) that might not be persistent (e.g., SLURM jobs can be configured to wipe the local SSD after each job). For the most efficient dataloading, it typically requires copying the dataset from the global storage to the local storage before training. Global paths are the exact folders where a dataset is located (e.g., /data/CIFAR10) wheras the local path is typically shared among all datasets (e.g., /localSSD). Datasets should copy their data before training to the local root (e.g., copy /data/CIFAR10 to /localSSD to create /localSSD/CIFAR10) but are free to implement the copying process with whatever is optimal for a given dataset. If a dataset is always present on a compute node (e.g., because it was manually copied before the run) one can also simply put the path to the dataset on the local storage as global path (e.g., /data/CIFAR10 was copied before the run to all compute nodes as /localSSD/CIFAR10, /localSSD/CIFAR10 can then be used in global_dataset_paths). :param global_dataset_paths: Mapping from dataset identifiers (e.g., "cifar10") to the location where it is stored on the global storage (i.e., on a globally accessible, potentially slow, persistent storage). :param local_dataset_path: Path to a location where datasets can be stored locally on a per-node basis. This location is typically the same on each node, but accesses typically a storage that is local to the current node. If dataloading speed is important, datasets should be copied to this path before training and data should be loaded from the local disk instead of the global one. Optional, if not defined, the global dataset path will be used. :param data_source_modes: Specifies if loading from local_dataset_path is necessary for a given dataset identifier (`data_source_mode[ds_identifier] = "local"`) or not (`data_source_mode[ds_identifier] = "global"`). Some datasets are so small/lightweight that they can be loaded from the global storage directly into memory or it is simply fast enough to load it from the global storage. If this is set to false for a certain dataset identifier, it will ignore the local_dataset_path (if it is specified) and load the data from the global storage. If no `data_source_mode` is defined for a dataset identifier, it will default to `local` if `local_dataset_path` is defined and `global` if `local_dataset_path` is not defined. .. py:attribute:: logger .. py:method:: setup_source_root(identifier, copy_to_local_fn = None) This method allows the following dataset implementation structure: ``` class MyCifar10Dataset(Dataset): def __init__( self, ..., source_root: str | None = None, dataset_config_provider: DatasetConfigProvider | None = None, ): ... if source_root is None: source_root = dataset_config_provider.setup_source_root( identifier="cifar10", copy_to_local_fn=lambda global_root, local_root: shutil.copytree(global_root, local_root), ) # load dataset from source_root, wherever source_root might be self.dataset = CIFAR10(root=source_root, ...) ``` which will: - Load data from the `source_root` of the `MyCifar10Dataset`, if it is defined (e.g., useful if the dataset is instantiated from a standalone script/notebook if `dataset_config_provider is None` ). - Retrieve the `source_root` from the DatasetConfigProvider via `self.global_dataset_paths[identifier]` if `source_root is None`. This is typical if the dataset is instantiated as part of a ksuit training run where paths to dataset can be abstracted away into a setup-specific configuration file. This allows easy dataset path configuration for development setups where dataset locations might differ from the production environment setups, for example to develop on a laptop. This also avoids redundant configuration in the configuration file of a training/evaluation run. - Automatically copy the dataset from its location on the global storage to the local storage for fast dataloading. This is only done if a `copy_to_local_fn` is provided and the dataset is configured in the DatasetConfigProvider to be copied to the local disk, i.e., `self.local_dataset_path is not None` and `self.data_source_modes.get(identifier, None) in ["local", None]`. This makes the whole process of automatically copying to the local disk easy to implement from the datasets as only the `copy_to_local_fn` needs to be implemented. In multi-gpu and distributed setups, only 1 process per node (`is_data_rank0`) will invoke the `copy_to_local_fn` function. The other processes will wait for the `copy_to_local_fn` to finish. :param identifier: String identifier of the dataset. This identifier will be used to retrieve the storage location of the dataset from `self.global_dataset_paths[identifier]` and if it should be loaded from the global or local storage via `self.data_source_modes[identifier]`. :param copy_to_local_fn: If provided and `self.data_source_modes.get(identifier, None) in ["local", None]`, the provided function will be called with the dataset location on the global storage (e.g., `/global_storage/CIFAR10`) and where the dataset should be copied to on the local storage (e.g., `/local_ssd/cifar10`, where `/local_ssd` would be the `self.local_dataset_path` and `cifar10` is the passed `identifier`). The function should copy the whole dataset from the global to the local storage. `/local_ssd/cifar10` is not created automatically and can be used to check if the dataset already exists on the local storage (e.g., if the local storage is persistent and a previous run already copied the dataset. However, this does not check if the dataset copying process was completed successfully and further checks are highly recommended.). :returns: Path to the `source_root` folder, i.e., the folder from which to load data from. .. py:property:: local_dataset_path :type: pathlib.Path Returns the path to the local storage as pathlib.Path object.