Overview
While COGs are perfect for 2D imagery (satellite photos), specialized data like weather forecasts, ocean temperatures, and climate models are often N-Dimensional (Latitude, Longitude, Time, Elevation, Variable).
Zarr is a cloud-native specification for storing these chunked, compressed, N-dimensional arrays. It is important to distinguish between the Zarr Specification (the rules for how data is stored) and the Zarr Format (the actual files on disk). This decoupling allows Zarr to work across any key-value store, from local filesystems to S3 and even browser memory.
It is effectively the cloud-native equivalent of NetCDF or HDF5.
Key Concepts
1. Chunking
Similar to COG tiles, Zarr splits a massive array (e.g., a 10TB climate model) into small, equal-sized chunks. Each chunk is saved as a separate file (object) in cloud storage.
2. Metadata (JSON)
Each Zarr group has a lightweight .zgroup or .zarray JSON file that
describes the shape, data types, and compression of the array. Clients read this
small JSON first, then only fetch the parallel chunks they need.
3. Parallel IO
Because every chunk is a separate file (key) on S3, you can read thousands of chunks in parallel using distributed computing tools like Dask. This allows for massive throughput that isn't possible with a single monolithic NetCDF file.
Why Zarr is Important
- Climate & Weather: It has become the standard for large-scale climate data (e.g., CMIP6).
- Write Performance: Proper parallel writing is easier with Zarr than with HDF5 in the cloud.
- No "File Locking": Multiple workers can write to different chunks of the same dataset simultaneously without corruption.
Zarr v3: The Next Generation
The Zarr v3 specification (the stable, published specification) introduces several critical features for modern cloud-native geospatial pipelines:
1. Sharding
In Zarr v2, every chunk is a separate file. For massive datasets, this creates a "small file problem"—millions of tiny objects that overwhelm cloud storage metadata and increase API costs.
Sharding in v3 allows you to group multiple logical chunks into a single physical "shard" (object). This decouples your logical chunk size (optimized for your analysis) from the physical storage size (optimized for S3 performance).
2. Storage Transformers
V3 introduces a formal extension mechanism for Storage Transformers. This allows for advanced key-value mapping, such as variable-length chunks or alternative directory structures, without breaking compatibility with standard readers.
3. Unified Specification
Unlike v2, which was largely defined by the Python implementation, v3 is a formal, language-agnostic specification. This has led to high-performance implementations in C++, Rust, and JavaScript, making Zarr a first-class citizen in browser-based geospatial tools.
Tools and Libraries
- Xarray: The primary Python library for labeled multi-dimensional arrays. It supports Zarr natively.
- Zarr-Python: The core library.
- Dask: For parallel computing on Zarr chunks.
Practical Exercises
Two exercises demonstrate Zarr's core strengths: writing a chunked dummy climate dataset with xarray, then reading it back selectively to see how lazy loading and chunking work together to avoid unnecessary data transfer.
What happened?
- Chunking: We chunked by time. When we asked for the "January Mean," Xarray knew it only needed the first couple of chunk files.
- Lazy Loading: The
ds.sel(...)command didn't touch the disk. It only built a task graph.computeexecuted it. This lazy pattern is critical when working with petabyte-scale datasets on the cloud.
Related Sessions from CNG Conference
<!-- TODO: Verify this YouTube ID is the correct Zarr format video -->- Zarr: A Cloud-Native Data Format for Large Multi-Dimensional Arrays by Julia Signell [Element 84]
For more, browse the full CNG Conference archive.
Next Steps
We've covered Rasters (COG) and N-D Arrays (Zarr). Now let's tackle the format that powers our "Scout" project's vector analytics: GeoParquet.