zarrs-python¶

This project serves as a bridge between zarrs (Rust) and zarr (zarr-python) via PyO3. The main goal of the project is to speed up i/o (see zarr_benchmarks).

To use the project, simply install our package (which depends on zarr-python>=3.0.0), and run:

import zarr
zarr.config.set({"codec_pipeline.path": "zarrs.ZarrsCodecPipeline"})

You can then use your zarr as normal (with some caveats)!

API¶

We export a ZarrsCodecPipeline class so that zarr-python can use the class but it is not meant to be instantiated and we do not guarantee the stability of its API beyond what is required so that zarr-python can use it. Therefore, it is not documented here.

At the moment, we only support a subset of the zarr-python stores:

LocalStore (local filesystem)
ObjectStore (cloud storage)
HTTPFileSystem via FsspecStore

A NotImplementedError will be raised if a store is not supported.

Configuration¶

ZarrsCodecPipeline options are exposed through zarr.config.

Standard zarr.config options control some functionality (see the defaults in the config.py of zarr-python):

threading.max_workers: the maximum number of threads used internally by the ZarrsCodecPipeline on the Rust side.
- Defaults to the number of threads in the global rayon thread pool if set to None, which is typically the number of logical CPUs.
array.write_empty_chunks: whether or not to store empty chunks.
- Defaults to false if None. Note that checking for emptiness has some overhead, see here for more info.

The ZarrsCodecPipeline specific options are:

codec_pipeline.chunk_concurrent_maximum: the maximum number of chunks stored/retrieved concurrently.
- Defaults to the number of logical CPUs if None. It is constrained by threading.max_workers as well.
codec_pipeline.chunk_concurrent_minimum: the minimum number of chunks retrieved/stored concurrently when balancing chunk/codec concurrency.
- Defaults to 4 if None. See here for more info.
codec_pipeline.validate_checksums: enable checksum validation (e.g. with the CRC32C codec).
- Defaults to True. See here for more info.
codec_pipeline.direct_io: enable O_DIRECT read/write, needs support from the operating system (currently only Linux) and file system.
- Defaults to False.
codec_pipeline.strict: raise exceptions for unsupported operations instead of falling back to the default codec pipeline of zarr-python.
- Defaults to False.

For example:

zarr.config.set({
    "threading.max_workers": None,
    "array.write_empty_chunks": False,
    "codec_pipeline": {
        "path": "zarrs.ZarrsCodecPipeline",
        "validate_checksums": True,
        "chunk_concurrent_maximum": None,
        "chunk_concurrent_minimum": 4,
        "direct_io": False,
        "strict": False
    }
})

If the ZarrsCodecPipeline is pickled, and then un-pickled, and during that time one of chunk_concurrent_minimum, chunk_concurrent_maximum, or num_threads has changed, the newly un-pickled version will pick up the new value. However, once a ZarrsCodecPipeline object has been instantiated, these values are then fixed. This may change in the future as guidance from the zarr community becomes clear.

Concurrency¶

Concurrency can be classified into two types:

chunk (outer) concurrency: the number of chunks retrieved/stored concurrently.
- This is chosen automatically based on various factors, such as the chunk size and codecs.
- It is constrained between codec_pipeline.chunk_concurrent_minimum and codec_pipeline.chunk_concurrent_maximum for operations involving multiple chunks.
codec (inner) concurrency: the number of threads encoding/decoding a chunk.
- This is chosen automatically in combination with the chunk concurrency.

The product of the chunk and codec concurrency will approximately match threading.max_workers.

Chunk concurrency is typically favored because:

parallel encoding/decoding can have a high overhead with some codecs, especially with small chunks, and
it is advantageous to retrieve/store multiple chunks concurrently, especially with high latency stores.

zarrs-python will often favor codec concurrency with sharded arrays, as they are well suited to codec concurrency.

Supported Indexing Methods¶

The following methods will trigger use with the old zarr-python pipeline:

Any oindex or vindex integer np.ndarray indexing with dimensionality >=3 i.e.,

arr[np.array([...]), :, np.array([...])]
arr[np.array([...]), np.array([...]), np.array([...])]
arr[np.array([...]), np.array([...]), np.array([...])] = ...
arr.oindex[np.array([...]), np.array([...]), np.array([...])] = ...

Any vindex or oindex discontinuous integer np.ndarray indexing for writes in 2D

arr[np.array([0, 5]), :] = ...
arr.oindex[np.array([0, 5]), :] = ...

vindex writes in 2D where both indexers are integer np.ndarray indices i.e.,
```
arr[np.array([...]), np.array([...])] = ...
```
Ellipsis indexing. We have tested some, but others fail even with zarr-python’s default codec pipeline. Thus for now we advise proceeding with caution here.
```
arr[0:10, ..., 0:5]
```

Furthermore, using anything except contiguous (i.e., slices or consecutive integer) np.ndarray for numeric data will fall back to the default zarr-python implementation.

Please file an issue if you believe we have more holes in our coverage than we are aware of or you wish to contribute! For example, we have an issue in zarrs for integer-array indexing that would unblock a lot the use of the rust pipeline for that use-case (very useful for mini-batch training perhaps!).

Further, any codecs not supported by zarrs will also automatically fall back to the python implementation.

zarrs-python¶

API¶

Configuration¶

Concurrency¶

Supported Indexing Methods¶

zarrs

Navigation

Related Topics