DataCube Fundamentals

Overview

In the previous module, we managed vast collections of vector data using Apache Iceberg. Now, we turn our attention back to raster data—specifically, how to manage collections of images as a single multi-dimensional object.

A Datacube is a representation of geospatial data where different datasets (like Sentinel-2, Landsat, or weather models) are aligned into a common coordinate system and time axis. Datacubes make it straightforward to ask questions like "how did vegetation change over this area across an entire year?" without manually wrangling individual images.

Key Concepts

1. The Datacube Abstraction

A geospatial datacube organizes raster imagery as an N-dimensional array with labeled axes: typically time, band, Y (latitude/northing), and X (longitude/easting). This structure lets you slice, aggregate, and compute across any combination of dimensions using the same xarray syntax — no matter how many images or bands are involved.

2. Stacking Imagery

Raw STAC items arrive with mismatched coordinate systems, different spatial resolutions, and pixel grids that don't align. Stacking aligns all of them onto a common grid by reprojecting and snapping pixels to a consistent reference. The stackstac library automates this process, turning a list of STAC items into a single 4D xarray.DataArray that is ready for analysis.

3. STAC-to-Cube Pipeline

The typical workflow starts with a STAC API search, filters by cloud cover and date, then passes the resulting items to stackstac.stack(). Because stackstac uses Dask internally, the cube is built lazily — no data is downloaded until you call .compute() or .plot(). This means you can build and inspect the full cube structure before committing to any network I/O.

From "Assets" to "Cubes"

When you search a STAC API, you get a "Collection of Items." Each item has assets (bands like Red, Green, NIR) but they might:

Be in different Coordinate Reference Systems (CRS).
Have different spatial resolutions (10m vs 60m).
Be slightly shifted from each other.

To analyze them together (e.g., to see how vegetation changes over a year), you need to align them.

Key Tool: stackstac

stackstac is a Python library that turns a list of STAC items into a 4D xarray.DataArray. It handles the heavy lifting of:

Reprojection: Warping images to a matching CRS.
Alignment: Snapping pixels to a consistent grid.
Laziness: It uses Dask under the hood, so it doesn't download any data until you actually perform a calculation.

The 4 Dimensions

A typical geospatial datacube has four axes:

Time: When the image was taken.
Band: Which spectral band (Red, Blue, NIR, etc).
Y: Latitude or Northing.
X: Longitude or Easting.

Practical Exercises

One exercise builds a Sentinel-2 datacube over San Francisco using stackstac, calculates mean NDVI across a time range, and saves a visualization — showing the full STAC-to-cube-to-analysis pipeline in a single script.

Next Steps

Now that you can work with high-level multidimensional arrays, you are ready to tackle Cloud Storage patterns, where you'll learn how to optimize access to these massive datasets directly from object storage (S3/Azure/GCS).

Techniques Learned

Tools Introduced