ksuit.providers.dataset_config_provider
=======================================

.. py:module:: ksuit.providers.dataset_config_provider


Classes
-------

.. autoapisummary::

   ksuit.providers.dataset_config_provider.DatasetConfigProvider


Module Contents
---------------

.. py:class:: DatasetConfigProvider(global_dataset_paths, local_dataset_path = None, data_source_modes = None)

   Defines where datasets are located and how to load them (e.g., copy them to a local SSD before training).
   It differentiates between a global storage (e.g., slow but big NFS storage where datasets are persistently stored)
   and a local storage (e.g., fast but relatively small local SSD on a compute node) that might not be persistent
   (e.g., SLURM jobs can be configured to wipe the local SSD after each job). For the most efficient dataloading,
   it typically requires copying the dataset from the global storage to the local storage before training.
   Global paths are the exact folders where a dataset is located (e.g., /data/CIFAR10) wheras the local path is
   typically shared among all datasets (e.g., /localSSD). Datasets should copy their data before training to the
   local root (e.g., copy /data/CIFAR10 to /localSSD to create /localSSD/CIFAR10) but are free to implement the
   copying process with whatever is optimal for a given dataset. If a dataset is always present on a compute node
   (e.g., because it was manually copied before the run) one can also simply put the path to the dataset on the local
   storage as global path (e.g., /data/CIFAR10 was copied before the run to all compute nodes as /localSSD/CIFAR10,
   /localSSD/CIFAR10 can then be used in global_dataset_paths).

   :param global_dataset_paths: Mapping from dataset identifiers (e.g., "cifar10") to the location where it is stored on
                                the global storage (i.e., on a globally accessible, potentially slow, persistent storage).
   :param local_dataset_path: Path to a location where datasets can be stored locally on a per-node basis. This location
                              is typically the same on each node, but accesses typically a storage that is local to the current node.
                              If dataloading speed is important, datasets should be copied to this path before training and data should
                              be loaded from the local disk instead of the global one. Optional, if not defined, the global dataset
                              path will be used.
   :param data_source_modes: Specifies if loading from local_dataset_path is necessary for a given dataset identifier
                             (`data_source_mode[ds_identifier] = "local"`) or not (`data_source_mode[ds_identifier] = "global"`). Some
                             datasets are so small/lightweight that they can be loaded from the global storage directly into memory or
                             it is simply fast enough to load it from the global storage. If this is set to false for a certain dataset
                             identifier, it will ignore the local_dataset_path (if it is specified) and load the data from the global
                             storage. If no `data_source_mode` is defined for a dataset identifier, it will default to `local` if
                             `local_dataset_path` is defined and `global` if `local_dataset_path` is not defined.


   .. py:attribute:: logger


   .. py:method:: setup_source_root(identifier, copy_to_local_fn = None)

      This method allows the following dataset implementation structure:

      ```
      class MyCifar10Dataset(Dataset):
          def __init__(
          self,
          ...,
          source_root: str | None = None,
          dataset_config_provider: DatasetConfigProvider | None = None,
      ):
          ...
          if source_root is None:
              source_root = dataset_config_provider.setup_source_root(
                  identifier="cifar10",
                  copy_to_local_fn=lambda global_root, local_root: shutil.copytree(global_root, local_root),
              )
          # load dataset from source_root, wherever source_root might be
          self.dataset = CIFAR10(root=source_root, ...)
      ```

      which will:
      - Load data from the `source_root` of the `MyCifar10Dataset`, if it is defined (e.g., useful if the dataset is
          instantiated from a standalone script/notebook if `dataset_config_provider is None` ).
      - Retrieve the `source_root` from the DatasetConfigProvider via `self.global_dataset_paths[identifier]` if
          `source_root is None`. This is typical if the dataset is instantiated as part of a ksuit training run where
          paths to dataset can be abstracted away into a setup-specific configuration file. This allows easy dataset
          path configuration for development setups where dataset locations might differ from the production
          environment setups, for example to develop on a laptop. This also avoids redundant configuration in the
          configuration file of a training/evaluation run.
      - Automatically copy the dataset from its location on the global storage to the local storage for fast
          dataloading. This is only done if a `copy_to_local_fn` is provided and the dataset is configured in the
          DatasetConfigProvider to be copied to the local disk, i.e., `self.local_dataset_path is not None` and
          `self.data_source_modes.get(identifier, None) in ["local", None]`.

      This makes the whole process of automatically copying to the local disk easy to implement from the datasets as
      only the `copy_to_local_fn` needs to be implemented.

      In multi-gpu and distributed setups, only 1 process per node (`is_data_rank0`) will invoke the
      `copy_to_local_fn` function. The other processes will wait for the `copy_to_local_fn` to finish.

      :param identifier: String identifier of the dataset. This identifier will be used to retrieve the storage location
                         of the dataset from `self.global_dataset_paths[identifier]` and if it should be loaded from the global
                         or local storage via `self.data_source_modes[identifier]`.
      :param copy_to_local_fn: If provided and `self.data_source_modes.get(identifier, None) in ["local", None]`, the
                               provided function will be called with the dataset location on the global storage (e.g.,
                               `/global_storage/CIFAR10`) and where the dataset should be copied to on the local storage (e.g.,
                               `/local_ssd/cifar10`, where `/local_ssd` would be the `self.local_dataset_path` and `cifar10` is the
                               passed `identifier`). The function should copy the whole dataset from the global to the local storage.
                               `/local_ssd/cifar10` is not created automatically and can be used to check if the dataset already
                               exists on the local storage (e.g., if the local storage is persistent and a previous run already copied
                               the dataset. However, this does not check if the dataset copying process was completed successfully and
                               further checks are highly recommended.).

      :returns: Path to the `source_root` folder, i.e., the folder from which to load data from.


   .. py:property:: local_dataset_path
      :type: pathlib.Path


      Returns the path to the local storage as pathlib.Path object.