Thomas' Learning Hub
Completefundamentalstutorial

Zarr & Chunked Arrays

Multi-dimensional datasets for the cloud using the Zarr v3 specification and sharding.

Techniques Learned

Chunking StrategiesSharding (Zarr v3)Concurrent Access

Tools Introduced

ZarrXarrayDask

Overview

While COGs are perfect for 2D imagery (satellite photos), specialized data like weather forecasts, ocean temperatures, and climate models are often N-Dimensional (Latitude, Longitude, Time, Elevation, Variable).

Zarr is a cloud-native specification for storing these chunked, compressed, N-dimensional arrays. It is important to distinguish between the Zarr Specification (the rules for how data is stored) and the Zarr Format (the actual files on disk). This decoupling allows Zarr to work across any key-value store, from local filesystems to S3 and even browser memory.

It is effectively the cloud-native equivalent of NetCDF or HDF5.

Key Concepts

1. Chunking

Similar to COG tiles, Zarr splits a massive array (e.g., a 10TB climate model) into small, equal-sized chunks. Each chunk is saved as a separate file (object) in cloud storage.

2. Metadata (JSON)

Each Zarr group has a lightweight .zgroup or .zarray JSON file that describes the shape, data types, and compression of the array. Clients read this small JSON first, then only fetch the parallel chunks they need.

3. Parallel IO

Because every chunk is a separate file (key) on S3, you can read thousands of chunks in parallel using distributed computing tools like Dask. This allows for massive throughput that isn't possible with a single monolithic NetCDF file.

Why Zarr is Important

  • Climate & Weather: It has become the standard for large-scale climate data (e.g., CMIP6).
  • Write Performance: Proper parallel writing is easier with Zarr than with HDF5 in the cloud.
  • No "File Locking": Multiple workers can write to different chunks of the same dataset simultaneously without corruption.

Zarr v3: The Next Generation

The Zarr v3 specification (the stable, published specification) introduces several critical features for modern cloud-native geospatial pipelines:

1. Sharding

In Zarr v2, every chunk is a separate file. For massive datasets, this creates a "small file problem"—millions of tiny objects that overwhelm cloud storage metadata and increase API costs.

Sharding in v3 allows you to group multiple logical chunks into a single physical "shard" (object). This decouples your logical chunk size (optimized for your analysis) from the physical storage size (optimized for S3 performance).

2. Storage Transformers

V3 introduces a formal extension mechanism for Storage Transformers. This allows for advanced key-value mapping, such as variable-length chunks or alternative directory structures, without breaking compatibility with standard readers.

3. Unified Specification

Unlike v2, which was largely defined by the Python implementation, v3 is a formal, language-agnostic specification. This has led to high-performance implementations in C++, Rust, and JavaScript, making Zarr a first-class citizen in browser-based geospatial tools.

Tools and Libraries

  • Xarray: The primary Python library for labeled multi-dimensional arrays. It supports Zarr natively.
  • Zarr-Python: The core library.
  • Dask: For parallel computing on Zarr chunks.

Practical Exercises

Two exercises demonstrate Zarr's core strengths: writing a chunked dummy climate dataset with xarray, then reading it back selectively to see how lazy loading and chunking work together to avoid unnecessary data transfer.

What happened?

  1. Chunking: We chunked by time. When we asked for the "January Mean," Xarray knew it only needed the first couple of chunk files.
  2. Lazy Loading: The ds.sel(...) command didn't touch the disk. It only built a task graph. compute executed it. This lazy pattern is critical when working with petabyte-scale datasets on the cloud.
<!-- TODO: Verify this YouTube ID is the correct Zarr format video -->

For more, browse the full CNG Conference archive.

Next Steps

We've covered Rasters (COG) and N-D Arrays (Zarr). Now let's tackle the format that powers our "Scout" project's vector analytics: GeoParquet.

Practical Implementation

Source files from src/exercises/zarr/

Download exercise files from GitHub
Zarr & Chunked Arrays | Cloud-Native Geospatial Tutorial