emmi_data_management.diskcache.lru_cache

LRU cache filesystem implementations with size management and cleanup.

This module provides filesystem cache implementations that wrap other filesystems (local or remote) and maintain a local cache of accessed files. The cache is managed using a Least Recently Used (LRU) eviction policy to stay within configurable size limits.

The module includes:
  • LRUCacheFileSystem: A basic LRU cache implementation using filesystem metadata.

  • SqliteLRUCacheFileSystem: An enhanced LRU cache using SQLite for metadata tracking, suitable for concurrent access scenarios.

Key Features:
  • Automatic cache size management with configurable watermarks

  • Thread-safe file locking to prevent race conditions

  • LRU-based eviction of cached files when size limits are exceeded

  • Support for any fsspec-compatible filesystem as the backing store

  • SQLite-based metadata tracking for improved concurrency

Example

Basic usage with an S3 filesystem:

fs = LRUCacheFileSystem(
    cache_size=10**9,  # 1 GB limit

with fs.open("s3://my-bucket/data.csv", "rb") as f:
    data = f.read() # standard file-like object

Note

The LRU tracking updates file access times on every read, which may impact performance on some filesystems. The SqliteLRUCacheFileSystem is recommended for multi-threaded or concurrent access patterns.

emmi_data_management.diskcache.lru_cache.LOCK_FILE_MODE

Default file mode for lock files (0o660).

Type:

int

emmi_data_management.diskcache.lru_cache.LOCK_FILE_SUFFIX

Suffix appended to cached file paths for lock files.

Type:

str

emmi_data_management.diskcache.lru_cache.CLEANUP_LOCK_FILE

Name of the global cleanup lock file.

Type:

str

Attributes

Classes

LRUCacheFileSystem

Caches whole remote files on first access, with LRU cache eviction.

SqliteLRUCacheFileSystem

Caches whole remote files on first access, with SQLite metadata

Module Contents

emmi_data_management.diskcache.lru_cache.logger
emmi_data_management.diskcache.lru_cache.LOCK_FILE_MODE = 432
emmi_data_management.diskcache.lru_cache.LOCK_FILE_SUFFIX = '.lock'
emmi_data_management.diskcache.lru_cache.CLEANUP_LOCK_FILE = '.cleanup_lock'
class emmi_data_management.diskcache.lru_cache.LRUCacheFileSystem(storage, cache_size, target_protocol=None, target_options=None, fs=None, cache_storage_dmode=504, cache_storage_mode=432, enforce_size_every_seconds=2, cache_cleanup_high_watermark=0.95, cache_cleanup_low_watermark=0.8, **kwargs)

Bases: fsspec.implementations.cached.CachingFileSystem

Caches whole remote files on first access, with LRU cache eviction. This class is intended as a layer over any other file system, and will make a local copy of each file accessed, so that all subsequent reads are local. This implementation only copies whole files.

The cache is kept within a specified size limit by removing the least recently used files when the limit is exceeded. File access times are updated on each access to ensure accurate LRU tracking. NOTE every file access requires a write to the filesystem to update the access time, which may lead to performance issues on some filesystems. See fsspec.implementations.cached.SimpleCacheFileSystem for a simpler implementation that does not delete old files.

Examples

Parameters:
  • fs (fsspec.AbstractFileSystem | None) – The target filesystem to wrap with caching.

  • storage (str) – Path to the local directory where cached files will be stored.

  • cache_size (int | None) – Maximum size of the cache in bytes. If None, no size limit is enforced.

  • cache_storage_dmode – Directory permissions mode for the cache storage directory.

  • cache_storage_mode – File permissions mode for cached files.

  • enforce_size_every_seconds – Minimum time in seconds between cache size enforcement checks.

  • cache_cleanup_high_watermark – Fraction of cache_size at which cleanup is triggered.

  • cache_cleanup_low_watermark – Target fraction of cache_size to reduce to during cleanup.

  • target_protocol (str | None)

  • target_options (dict | None)

protocol = 'lrucache'
local_file = True
fs
storage
cache_storage_size
cache_storage_mode = 432
enforce_size_every_seconds = 2
cache_cleanup_high_watermark = 0.95
cache_cleanup_low_watermark = 0.8
cat_ranges(paths, starts, ends, max_gap=None, on_error='return', **kwargs)

Get the contents of byte ranges from one or more files

Parameters

paths: list

A list of of filepaths on this filesystems

starts, ends: int or list

Bytes limits of the read. If using a single int, the same value will be used to read all the specified files.

Parameters:
class emmi_data_management.diskcache.lru_cache.SqliteLRUCacheFileSystem(*args, **kwargs)

Bases: LRUCacheFileSystem

Caches whole remote files on first access, with SQLite metadata

This class is intended as a layer over any other file system, and will make a local copy of each file accessed, so that all subsequent reads are local. This implementation only copies whole files, and keeps metadata about the download time and file details in a SQLite database. It is therefore safer to use in multi-threaded/concurrent situations.

Parameters:
  • fs – The target filesystem to wrap with caching.

  • storage – Path to the local directory where cached files will be stored.

  • cache_size – Maximum size of the cache in bytes. If None, no size limit is enforced.

  • cache_storage_dmode – Directory permissions mode for the cache storage directory.

  • cache_storage_mode – File permissions mode for cached files.

  • enforce_size_every_seconds – Minimum time in seconds between cache size enforcement checks.

  • cache_cleanup_high_watermark – Fraction of cache_size at which cleanup is triggered.

  • cache_cleanup_low_watermark – Target fraction of cache_size to reduce to during cleanup.

protocol = 'sqlitecache'