emmi_data_management.interfaces.s3

Classes

AWSSecrets

str(object='') -> str

S3Object

dict() -> new empty dictionary

Functions

get_s3_client()

Construct an S3 client from managed credentials (env or config).

list_s3_objects(bucket, prefix[, extension])

List S3 objects under bucket/prefix with an optional extension filter. Skips directory placeholders (keys ending with '/') and normalizes quoted ETags.

estimate_s3_size(bucket, prefix[, extension])

Estimate size of objects under bucket/prefix with an optional extension filter.

fetch_s3_file(bucket, key, local_dir)

Download file from S3 bucket to local directory, preserving the key's subpath.

iter_s3_object_chunks(bucket, key, *[, chunk_size])

Stream an S3 object as chunks of bytes. Intended to be used with higher-level atomic writers / hashing

head_s3_object(bucket, key)

Lightweight HEAD to retrieve content length and etag (if available).

fetch_s3_prefix(bucket, prefix, local_dir[, ...])

Download all objects under bucket/prefix with an optional extension filter into a local directory.

Module Contents

class emmi_data_management.interfaces.s3.AWSSecrets

Bases: str, enum.Enum

str(object=’’) -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.__str__() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to ‘strict’.

Initialize self. See help(type(self)) for accurate signature.

AWS_ACCESS_KEY_ID = 'AWS_ACCESS_KEY_ID'
AWS_SECRET_ACCESS_KEY = 'AWS_SECRET_ACCESS_KEY'
AWS_SESSION_TOKEN = 'AWS_SESSION_TOKEN'
AWS_REGION = 'AWS_REGION'
AWS_DEFAULT_REGION = 'AWS_DEFAULT_REGION'
AWS_ENDPOINT_URL = 'AWS_ENDPOINT_URL'
class emmi_data_management.interfaces.s3.S3Object

Bases: TypedDict

dict() -> new empty dictionary dict(mapping) -> new dictionary initialized from a mapping object’s

(key, value) pairs

dict(iterable) -> new dictionary initialized as if via:

d = {} for k, v in iterable:

d[k] = v

dict(**kwargs) -> new dictionary initialized with the name=value pairs

in the keyword argument list. For example: dict(one=1, two=2)

Initialize self. See help(type(self)) for accurate signature.

key: str
size: int
etag: str | None
emmi_data_management.interfaces.s3.get_s3_client()

Construct an S3 client from managed credentials (env or config). Expected keys (matching env names):

  • AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY

  • optional AWS_SESSION_TOKEN

  • AWS_DEFAULT_REGION or AWS_REGION

  • optional AWS_ENDPOINT_URL (for MinIO / on‑prem / custom endpoints)

Falls back to unsigned access for public buckets when no non‑empty credentials are provided.

Return type:

botocore.client.BaseClient

emmi_data_management.interfaces.s3.list_s3_objects(bucket, prefix, extension=None)

List S3 objects under bucket/prefix with an optional extension filter. Skips directory placeholders (keys ending with ‘/’) and normalizes quoted ETags.

Parameters:
  • bucket (str)

  • prefix (str)

  • extension (str | None)

Return type:

list[S3Object]

emmi_data_management.interfaces.s3.estimate_s3_size(bucket, prefix, extension=None)

Estimate size of objects under bucket/prefix with an optional extension filter.

Parameters:
  • bucket (str) – Name of the S3 bucket.

  • prefix (str) – File prefix.

  • extension (str | None) – Optional file extension. Defaults to None.

Returns:

  • A tuple with estimated size in bytes and total number of objects.

Return type:

tuple[int, int]

emmi_data_management.interfaces.s3.fetch_s3_file(bucket, key, local_dir)

Download file from S3 bucket to local directory, preserving the key’s subpath.

Parameters:
  • bucket (str) – Name of the S3 bucket.

  • key (str) – File key.

  • local_dir (pathlib.Path) – Path to local directory.

Returns:

  • Local file path.

Return type:

pathlib.Path

emmi_data_management.interfaces.s3.iter_s3_object_chunks(bucket, key, *, chunk_size=1024 * 1024)

Stream an S3 object as chunks of bytes. Intended to be used with higher-level atomic writers / hashing in the CLI.

Parameters:
  • bucket (str) – S3 bucket name.

  • key (str) – S3 object key.

  • chunk_size (int) – Size of chunks in bytes.

Yields:

Byte chunks from the object body.

Return type:

collections.abc.Iterator[bytes]

emmi_data_management.interfaces.s3.head_s3_object(bucket, key)

Lightweight HEAD to retrieve content length and etag (if available). :returns: (size_bytes, etag) with etag normalized (quotes stripped).

Parameters:
Return type:

tuple[int | None, str | None]

emmi_data_management.interfaces.s3.fetch_s3_prefix(bucket, prefix, local_dir, extension=None, max_workers=8)

Download all objects under bucket/prefix with an optional extension filter into a local directory.

Parameters:
  • bucket (str) – Name of the S3 bucket.

  • prefix (str) – File prefix.

  • local_dir (pathlib.Path) – Path to local directory.

  • extension (str | None) – Optional file extension. Defaults to None.

  • max_workers (int) – Number of workers to use for downloading. Defaults to 8.

Returns:

  • A list of relative paths (keys) written.

Return type:

list[str]