ksuit.utils.data.data_container¶

Classes¶

DataContainer

Container that holds datasets and provides utilities for datasets and data loading.

Module Contents¶

class ksuit.utils.data.data_container.DataContainer(datasets, num_workers=None, pin_memory=True)¶

Container that holds datasets and provides utilities for datasets and data loading.

Parameters:

datasets (dict[str, ksuit.data.Dataset]) – A dictionary with datasets for the training run.
num_workers (int | None) – Number of data loading workers to use. If None, will use (#CPUs / #GPUs - 1) workers. The -1 keeps 1 CPU free for the main process. Defaults to None.
pin_memory (bool) – Is passed as pin_memory to torch.utils.data.DataLoader. Defaults to True.

logger¶

datasets¶

num_workers = None¶

pin_memory = True¶

get_dataset(key=None, properties=None, max_size=None, shuffle_seed=None)¶

Returns the dataset identified by key (or the first dataset if no key is provided) with optional wrapping into a ShuffleWrapper (via shuffle_seed), a SubsetWrapper (via max_size) or a PropertySubsetWrapper. Note that the wrappers can be used at once or individually, in case when all arguments are provided the order will be:

Dataset -> ShuffleWrapper(Optional) -> SubsetWrapper(Optional) -> PropertySubsetWrapper(Optional)

Parameters:

key (str | None) – Identifier of the dataset. If None, returns the first dataset of the DataContainer. Defaults to None.
properties (set[str] | None) – If defined, overrides the properties to load from the dataset. If not defined, uses the properties defined in the dataset itself or all properties if none are defined.
max_size (int | None) – If defined, wraps the dataset into a SubsetWrapper with the specified max_size. Default: None (no wrapping)
shuffle_seed (int | None) – If defined, wraps the dataset into a ShuffleWrapper with the specified shuffle_seed. Defaults to None (=no wrapping).

Returns:

Dataset of the DataContainer optionally wrapped into dataset wrappers.

Return type:

Dataset

get_main_sampler(train_dataset, shuffle=True)¶

Creates the main_sampler for data loading.

Parameters:

train_dataset (ksuit.data.Dataset) – Dataset that is used for training.
shuffle (bool) – Either or not to randomly shuffle the sampled indices before every epoch. Defaults to True.

Returns:

Sampler to be used for sampling indices of the train_dataset.

Return type:

Sampler

get_data_loader(train_sampler, train_collator, batch_size, epochs, updates, samples, callback_samplers, start_epoch=None)¶

Creates a torch.utils.data.DataLoader that can be used for efficient data loading by utilizing an InterleavedSampler based on the main_sampler, configs and other arguments that are passed to this method.

Parameters:

train_sampler (torch.utils.data.Sampler) – Sampler to be used for the main dataset (i.e., training dataset).
train_collator (ksuit.data.pipeline.collator.CollatorType | None) – Collator to collate samples from the main dataset (i.e., training dataset).
batch_size (int) – batch_size to use for training.
epochs (int | None) – For how many epochs does the training last.
updates (int | None) – For how many updates does the training last.
samples (int | None) – For how many samples does the training last.
callback_samplers (list[ksuit.data.samplers.SamplerIntervalConfig]) – List of SamplerIntervalConfigs to use for callback sampling.
start_epoch (int | None) – At which epoch to start (used for resuming training). Mutually exclusive with start_update and start_sample.

Returns:

Object from which data can be loaded according to the specified configuration.

Return type:

DataLoader