emmi_data_management.interfaces.huggingface

Attributes

Functions

estimate_hf_repo_size(repo_id[, repo_type, revision, ...])

Estimate total size (bytes) of all files in a HF repo (model or dataset),

fetch_huggingface_repo_snapshot(repo_id, local_dir)

Downloads all content from the specific HuggingFace repository.

fetch_huggingface_file(repo_id, filename, local_dir[, ...])

Downloads a specific file from a HuggingFace repository into a local directory.

fetch_huggingface_by_extension(repo_id, extension, ...)

Downloads specific files from a HuggingFace repository with given extension.

Module Contents

emmi_data_management.interfaces.huggingface.HFRepoType
emmi_data_management.interfaces.huggingface.estimate_hf_repo_size(repo_id, repo_type='model', revision='main', extension=None)

Estimate total size (bytes) of all files in a HF repo (model or dataset), optionally filtering by file-extension.

Parameters:
  • repo_id (str) – HF repo ID, e.g. “bert-base-uncased” or “user/my-dataset”

  • repo_type (HFRepoType) – “model” or “dataset”

  • revision (str) – branch/tag (default “main”)

  • extension (str | None) – if given (e.g. “.jsonl”), only count files ending with this

Returns:

  • Integer value for the total size in bytes.

Return type:

int

emmi_data_management.interfaces.huggingface.fetch_huggingface_repo_snapshot(repo_id, local_dir)

Downloads all content from the specific HuggingFace repository.

Parameters:
  • repo_id (str) – ID of the HuggingFace repository.

  • local_dir (pathlib.Path) – Local directory to download content to.

Returns:

  • None

Return type:

None

emmi_data_management.interfaces.huggingface.fetch_huggingface_file(repo_id, filename, local_dir, repo_type='model', revision='main')

Downloads a specific file from a HuggingFace repository into a local directory.

Parameters:
  • repo_id (str) – ID of the HuggingFace repository.

  • filename (str) – Filename to download.

  • local_dir (pathlib.Path) – Local directory to download the file to.

  • repo_type (HFRepoType) – Repo type, either “model” or “dataset”. Defaults to “model”.

  • revision (str) – Revision of the repository. Defaults to “main”.

Returns:

  • None

Return type:

None

emmi_data_management.interfaces.huggingface.fetch_huggingface_by_extension(repo_id, extension, local_dir, revision='main', repo_type='dataset', max_workers=8)

Downloads specific files from a HuggingFace repository with given extension.

Parameters:
  • repo_id (str) – ID of the HuggingFace repository.

  • extension (str) – File extension to download.

  • local_dir (pathlib.Path) – Local directory to download the file to.

  • revision (str) – Revision of the repository. Defaults to “main”.

  • repo_type (HFRepoType) – Repo type, either “model” or “dataset”. Defaults to “dataset”.

  • max_workers (int) – Maximum number of workers to use for downloading.

Returns:

  • A list of downloaded files.

Return type:

list[str]