ksuit.utils.data.data_container¶
Classes¶
Container that holds datasets and provides utilities for datasets and data loading. |
Module Contents¶
- class ksuit.utils.data.data_container.DataContainer(datasets, num_workers=None, pin_memory=True)¶
Container that holds datasets and provides utilities for datasets and data loading.
- Parameters:
datasets (dict[str, ksuit.data.Dataset]) – A dictionary with datasets for the training run.
num_workers (int | None) – Number of data loading workers to use. If None, will use (#CPUs / #GPUs - 1) workers. The -1 keeps 1 CPU free for the main process. Defaults to None.
pin_memory (bool) – Is passed as pin_memory to torch.utils.data.DataLoader. Defaults to True.
- logger¶
- datasets¶
- num_workers = None¶
- pin_memory = True¶
- get_dataset(key=None, properties=None, max_size=None, shuffle_seed=None)¶
Returns the dataset identified by key (or the first dataset if no key is provided) with optional wrapping into a
ShuffleWrapper(via shuffle_seed), aSubsetWrapper(via max_size) or aPropertySubsetWrapper. Note that the wrappers can be used at once or individually, in case when all arguments are provided the order will be:Dataset -> ShuffleWrapper(Optional) -> SubsetWrapper(Optional) -> PropertySubsetWrapper(Optional)
- Parameters:
key (str | None) – Identifier of the dataset. If None, returns the first dataset of the DataContainer. Defaults to None.
properties (set[str] | None) – If defined, overrides the properties to load from the dataset. If not defined, uses the properties defined in the dataset itself or all properties if none are defined.
max_size (int | None) – If defined, wraps the dataset into a SubsetWrapper with the specified max_size. Default: None (no wrapping)
shuffle_seed (int | None) – If defined, wraps the dataset into a ShuffleWrapper with the specified shuffle_seed. Defaults to None (=no wrapping).
- Returns:
Dataset of the DataContainer optionally wrapped into dataset wrappers.
- Return type:
- get_main_sampler(train_dataset, shuffle=True)¶
Creates the main_sampler for data loading.
- Parameters:
train_dataset (ksuit.data.Dataset) – Dataset that is used for training.
shuffle (bool) – Either or not to randomly shuffle the sampled indices before every epoch. Defaults to True.
- Returns:
Sampler to be used for sampling indices of the train_dataset.
- Return type:
Sampler
- get_data_loader(train_sampler, train_collator, batch_size, epochs, updates, samples, callback_samplers, start_epoch=None)¶
Creates a torch.utils.data.DataLoader that can be used for efficient data loading by utilizing an InterleavedSampler based on the main_sampler, configs and other arguments that are passed to this method.
- Parameters:
train_sampler (torch.utils.data.Sampler) – Sampler to be used for the main dataset (i.e., training dataset).
train_collator (ksuit.data.pipeline.collator.CollatorType | None) – Collator to collate samples from the main dataset (i.e., training dataset).
batch_size (int) – batch_size to use for training.
epochs (int | None) – For how many epochs does the training last.
updates (int | None) – For how many updates does the training last.
samples (int | None) – For how many samples does the training last.
callback_samplers (list[ksuit.data.samplers.SamplerIntervalConfig]) – List of SamplerIntervalConfigs to use for callback sampling.
start_epoch (int | None) – At which epoch to start (used for resuming training). Mutually exclusive with start_update and start_sample.
- Returns:
Object from which data can be loaded according to the specified configuration.
- Return type:
DataLoader