emmi_data_management.diskcache.lru_cache¶
LRU cache filesystem implementations with size management and cleanup.
This module provides filesystem cache implementations that wrap other filesystems (local or remote) and maintain a local cache of accessed files. The cache is managed using a Least Recently Used (LRU) eviction policy to stay within configurable size limits.
- The module includes:
LRUCacheFileSystem: A basic LRU cache implementation using filesystem metadata.
SqliteLRUCacheFileSystem: An enhanced LRU cache using SQLite for metadata tracking, suitable for concurrent access scenarios.
- Key Features:
Automatic cache size management with configurable watermarks
Thread-safe file locking to prevent race conditions
LRU-based eviction of cached files when size limits are exceeded
Support for any fsspec-compatible filesystem as the backing store
SQLite-based metadata tracking for improved concurrency
Example
Basic usage with an S3 filesystem:
fs = LRUCacheFileSystem(
cache_size=10**9, # 1 GB limit
with fs.open("s3://my-bucket/data.csv", "rb") as f:
data = f.read() # standard file-like object
Note
The LRU tracking updates file access times on every read, which may impact performance on some filesystems. The SqliteLRUCacheFileSystem is recommended for multi-threaded or concurrent access patterns.
- emmi_data_management.diskcache.lru_cache.LOCK_FILE_MODE¶
Default file mode for lock files (0o660).
- Type:
- emmi_data_management.diskcache.lru_cache.LOCK_FILE_SUFFIX¶
Suffix appended to cached file paths for lock files.
- Type:
- emmi_data_management.diskcache.lru_cache.CLEANUP_LOCK_FILE¶
Name of the global cleanup lock file.
- Type:
Attributes¶
Classes¶
Caches whole remote files on first access, with LRU cache eviction. |
|
Caches whole remote files on first access, with SQLite metadata |
Module Contents¶
- emmi_data_management.diskcache.lru_cache.logger¶
- emmi_data_management.diskcache.lru_cache.LOCK_FILE_MODE = 432¶
- emmi_data_management.diskcache.lru_cache.LOCK_FILE_SUFFIX = '.lock'¶
- emmi_data_management.diskcache.lru_cache.CLEANUP_LOCK_FILE = '.cleanup_lock'¶
- class emmi_data_management.diskcache.lru_cache.LRUCacheFileSystem(storage, cache_size, target_protocol=None, target_options=None, fs=None, cache_storage_dmode=504, cache_storage_mode=432, enforce_size_every_seconds=2, cache_cleanup_high_watermark=0.95, cache_cleanup_low_watermark=0.8, **kwargs)¶
Bases:
fsspec.implementations.cached.CachingFileSystemCaches whole remote files on first access, with LRU cache eviction. This class is intended as a layer over any other file system, and will make a local copy of each file accessed, so that all subsequent reads are local. This implementation only copies whole files.
The cache is kept within a specified size limit by removing the least recently used files when the limit is exceeded. File access times are updated on each access to ensure accurate LRU tracking. NOTE every file access requires a write to the filesystem to update the access time, which may lead to performance issues on some filesystems. See fsspec.implementations.cached.SimpleCacheFileSystem for a simpler implementation that does not delete old files.
Examples
- Parameters:
fs (fsspec.AbstractFileSystem | None) – The target filesystem to wrap with caching.
storage (str) – Path to the local directory where cached files will be stored.
cache_size (int | None) – Maximum size of the cache in bytes. If None, no size limit is enforced.
cache_storage_dmode – Directory permissions mode for the cache storage directory.
cache_storage_mode – File permissions mode for cached files.
enforce_size_every_seconds – Minimum time in seconds between cache size enforcement checks.
cache_cleanup_high_watermark – Fraction of cache_size at which cleanup is triggered.
cache_cleanup_low_watermark – Target fraction of cache_size to reduce to during cleanup.
target_protocol (str | None)
target_options (dict | None)
- protocol = 'lrucache'¶
- local_file = True¶
- fs¶
- storage¶
- cache_storage_size¶
- cache_storage_mode = 432¶
- enforce_size_every_seconds = 2¶
- cache_cleanup_high_watermark = 0.95¶
- cache_cleanup_low_watermark = 0.8¶
- cat_ranges(paths, starts, ends, max_gap=None, on_error='return', **kwargs)¶
Get the contents of byte ranges from one or more files
Parameters¶
- paths: list
A list of of filepaths on this filesystems
- starts, ends: int or list
Bytes limits of the read. If using a single int, the same value will be used to read all the specified files.
- class emmi_data_management.diskcache.lru_cache.SqliteLRUCacheFileSystem(*args, **kwargs)¶
Bases:
LRUCacheFileSystemCaches whole remote files on first access, with SQLite metadata
This class is intended as a layer over any other file system, and will make a local copy of each file accessed, so that all subsequent reads are local. This implementation only copies whole files, and keeps metadata about the download time and file details in a SQLite database. It is therefore safer to use in multi-threaded/concurrent situations.
- Parameters:
fs – The target filesystem to wrap with caching.
storage – Path to the local directory where cached files will be stored.
cache_size – Maximum size of the cache in bytes. If None, no size limit is enforced.
cache_storage_dmode – Directory permissions mode for the cache storage directory.
cache_storage_mode – File permissions mode for cached files.
enforce_size_every_seconds – Minimum time in seconds between cache size enforcement checks.
cache_cleanup_high_watermark – Fraction of cache_size at which cleanup is triggered.
cache_cleanup_low_watermark – Target fraction of cache_size to reduce to during cleanup.
- protocol = 'sqlitecache'¶