ksuit.utils.data.data_container

Classes

DataContainer

Container that holds datasets and provides utilities for datasets and data loading.

Module Contents

class ksuit.utils.data.data_container.DataContainer(datasets, num_workers=None, pin_memory=True)

Container that holds datasets and provides utilities for datasets and data loading.

Parameters:
  • datasets (dict[str, ksuit.data.Dataset]) – A dictionary with datasets for the training run.

  • num_workers (int | None) – Number of data loading workers to use. If None, will use (#CPUs / #GPUs - 1) workers. The -1 keeps 1 CPU free for the main process. Defaults to None.

  • pin_memory (bool) – Is passed as pin_memory to torch.utils.data.DataLoader. Defaults to True.

logger
datasets
num_workers = None
pin_memory = True
get_dataset(key=None, properties=None, max_size=None, shuffle_seed=None)

Returns the dataset identified by key (or the first dataset if no key is provided) with optional wrapping into a ShuffleWrapper (via shuffle_seed), a SubsetWrapper (via max_size) or a PropertySubsetWrapper. Note that the wrappers can be used at once or individually, in case when all arguments are provided the order will be:

Dataset -> ShuffleWrapper(Optional) -> SubsetWrapper(Optional) -> PropertySubsetWrapper(Optional)

Parameters:
  • key (str | None) – Identifier of the dataset. If None, returns the first dataset of the DataContainer. Defaults to None.

  • properties (set[str] | None) – If defined, overrides the properties to load from the dataset. If not defined, uses the properties defined in the dataset itself or all properties if none are defined.

  • max_size (int | None) – If defined, wraps the dataset into a SubsetWrapper with the specified max_size. Default: None (no wrapping)

  • shuffle_seed (int | None) – If defined, wraps the dataset into a ShuffleWrapper with the specified shuffle_seed. Defaults to None (=no wrapping).

Returns:

Dataset of the DataContainer optionally wrapped into dataset wrappers.

Return type:

Dataset

get_main_sampler(train_dataset, shuffle=True)

Creates the main_sampler for data loading.

Parameters:
  • train_dataset (ksuit.data.Dataset) – Dataset that is used for training.

  • shuffle (bool) – Either or not to randomly shuffle the sampled indices before every epoch. Defaults to True.

Returns:

Sampler to be used for sampling indices of the train_dataset.

Return type:

Sampler

get_data_loader(train_sampler, train_collator, batch_size, epochs, updates, samples, callback_samplers, start_epoch=None)

Creates a torch.utils.data.DataLoader that can be used for efficient data loading by utilizing an InterleavedSampler based on the main_sampler, configs and other arguments that are passed to this method.

Parameters:
  • train_sampler (torch.utils.data.Sampler) – Sampler to be used for the main dataset (i.e., training dataset).

  • train_collator (ksuit.data.pipeline.collator.CollatorType | None) – Collator to collate samples from the main dataset (i.e., training dataset).

  • batch_size (int) – batch_size to use for training.

  • epochs (int | None) – For how many epochs does the training last.

  • updates (int | None) – For how many updates does the training last.

  • samples (int | None) – For how many samples does the training last.

  • callback_samplers (list[ksuit.data.samplers.SamplerIntervalConfig]) – List of SamplerIntervalConfigs to use for callback sampling.

  • start_epoch (int | None) – At which epoch to start (used for resuming training). Mutually exclusive with start_update and start_sample.

Returns:

Object from which data can be loaded according to the specified configuration.

Return type:

DataLoader