EO Series — End-to-End STAC to Gridded Rasters
An end-to-end Earth Observation (EO) example series built on GeoBrix's RasterX functions, Databricks built-in Spatial SQL functions, and Microsoft Planetary Computer as the STAC source.
The four main notebooks move from vector area-of-interest → STAC discovery → band download → gridded (H3) raster tables → multi-band stacking and clipping. Sentinel-2 L2A over Alaska is used as the working dataset (scoped to a single county — Ketchikan, GEOID=2130 — to fit Planetary Computer free-tier limits).
notebooks/examples/eo-series — download the folder and import config_nb.ipynb, library.py, and the four numbered notebooks into your Databricks workspace to run.
Downloads are throttled on the Planetary Computer free tier. Notebooks are written to be safely interruptible and idempotent — re-runs skip files that already exist at size, and a Delta MERGE-based retry path (update_assets / download_missing_assets) repairs files corrupted by throttled auth responses.
Notebooks at a glance
01 — Discover EO imagery via STAC

- Spatially-indexed STAC search — tessellate any AOI polygon to H3 res-2 cells and query Planetary Computer per cell, so results are pre-keyed to the grid you'll join on later.
- Shapefile I/O without unzipping — the
shapefile_ogrreader pulls TIGER counties straight from a.zipblob in the Volume; no scratch-disk shuffling required. - Persisted, time-travel-friendly catalog — every search lands in a timestamped
cell_assets_<ts>.deltadirectory, giving an auditable handoff into notebook 02.
02 — Parallel band download with idempotent retry

- Spark-driven concurrent download — a
pandas_udf(download_band) fans out per-(item, band) HTTPS retrievals across the cluster, writing files into a Volume and one Delta table per band. - Idempotent and self-healing — a 1 KB validity threshold detects throttled auth-error payloads, and a Delta MERGE retry path (
update_assets/download_missing_assets) repairs corrupt files without re-downloading the whole catalog. - Cleanly bounded scope — only the bands you ask for (B02 / B03 / B04 / B08 by default) are pulled; the same flow extends to any other Sentinel-2 band.
03 — Tessellate rasters to H3 cells

- One-step raster ingestion — the
gdalreader (and thebinaryFile→rst_fromcontentpattern) materializes a typedtilecolumn with bytes, bbox, SRID, and standardized nodata in a single pass. - Spatial-indexed raster tables —
rst_h3_tessellateshreds each Sentinel-2 scene into H3 resolution-7 cells, producingband_b0X_h3Delta tables that join cleanly across bands and dates. - Raster analytics from SQL/PySpark —
rst_summaryfor per-tile stats,h3_kring+rst_merge_aggfor spatial neighborhoods, andrasterio_lambdafor raster-to-timeseries projection — no driver-side rasterio loops.
04 — Band Stacking + Clipping

- Multi-band assembly from grid joins — joins
band_b02_h3/b03/b04/b08on(cellid, date), thenrst_frombandsproduces a single 4-band (R, G, B, NIR) tile per cell-date. - Round-trip GeoTIFF writes — the
gdalwriter (mode("append"),option("ext", "tif")) materializes the stacked rasters back to disk in a Volume, ready for downstream tools. - CRS-safe geometry clipping — clip cutlines built from
st_envelope/st_bufferare passed as EWKB with embedded SRID, andrst_clipreprojects automatically — no per-tile CRS bookkeeping.
Files
| File | Purpose |
|---|---|
config_nb.ipynb | Shared setup (%run ./config_nb from every main notebook). Installs pip deps, imports Spark/Delta/GeoBrix, sets Unity Catalog catalog_name / schema_name, creates the /Volumes/<cat>/<schema>/data/alaska ETL tree, and defines helper functions (download_band, update_assets, download_missing_assets, finalize_tiled_band_tbl, gen_tessellate_tiled_band, viz helpers as_gdf / cells_as_gdf). |
library.py | Python module (not a notebook) with reusable functions imported from config_nb: pystac_client access, pandas UDFs for STAC search (get_items, get_assets) and asset download (download_asset, download_asset_v2), H3 cell generation (generate_cells), and raster/rasterio plotting helpers (plot_raster, plot_file, rasterio_lambda, to_numpy_arr). |
01. Search STACs.ipynb | Loads the TIGER US Counties shapefile via the shapefile_ogr reader, filters to Ketchikan, tessellates into H3 resolution-2 cells, converts each cell to GeoJSON, and queries Planetary Computer for sentinel-2-l2a items intersecting each cell. Writes the resulting STAC asset metadata to a timestamped Delta directory (cell_assets_<ts>.delta). |
02. Download STACs.ipynb | Reads cell_assets_*.delta, consolidates to unique item_ids, and downloads GeoTIFFs for bands of interest (B02, B03, B04, B08) into /Volumes/.../alaska/<band>/ using the download_band helper. Creates one band_<band> Delta table per band with out_file_path / out_file_sz / is_out_file_valid columns. Includes download_missing_assets / update_assets flows to patch files corrupted by free-tier throttling. |
03. Gridded EO Data.ipynb | For each band, joins the Delta band table with the gdal reader (GTiff), materializes band_<band>_tile (adds size, bbox, srid, and standardized nodata), then tessellates each tile to H3 resolution 7 into band_<band>_h3. Demonstrates rst_summary, bounding-box reprojection, h3_kring with rst_merge_agg, and raster → timeseries projection via rasterio_lambda. |
04. Band Stacking + Clipping.ipynb | Joins the four band_<band>_h3 tables on (cellid, date), stacks bands in (R, G, B, NIR) order with rst_frombands into the band_stack table, writes multi-band TIFs back out via the gdal writer, and demonstrates per-tile clipping with rst_clip using a centroid-envelope buffer built from Databricks built-in ST functions. |
Prerequisites
- Databricks Runtime 17.3 LTS (Scala 2.13 / Spark 4 / Python 3.12).
- GeoBrix installed on the cluster (JAR + Python wheel). The notebooks
importthe Python bindings directly fromdatabricks.labs.gbx.rasterx. - Unity Catalog: edit
config_nb.ipynbto setcatalog_nameandschema_nameto your own locations. A Volume nameddatamust already exist under<catalog>/<schema>. The notebooks create a schema if missing but will not create the Volume for you. - Compute sizing (the values used for the captured runs):
- Notebooks 01/02 (search + download): AWS
m5d.xlarge, 2–16 workers auto-scaling (up to ~64 concurrent downloads). - Notebooks 03/04 (raster processing): AWS
r6id.2xlarge, 20 workers. Anx86instance is required for the GDAL JNI natives; memory/disk-optimized variants are recommended. For a single county a much smaller cluster is sufficient.
- Notebooks 01/02 (search + download): AWS
Run order
- Open
config_nb.ipynb, setcatalog_name/schema_name, and verify the Volume exists. - Run notebooks in numeric order: 01 → 02 → 03 → 04. Each notebook starts with
%run ./config_nbso the shared state is re-established every time.
Each notebook is safe to re-run — Delta tables use do_overwrite=False / do_append=False by default, and file downloads skip anything already present above library.FILE_SIZE_THRESHOLD (1 KB, used to detect Planetary Computer auth-error responses masquerading as tiny "downloads").
Data flow
TIGER shapefile (shapefile_ogr reader)
│
▼ Ketchikan polygon → H3 res-2 cells → GeoJSON
STAC search (Planetary Computer, sentinel-2-l2a)
│
▼ cell_assets_<ts>.delta (nb 01)
Per-band asset download
│
▼ band_b02, band_b03, band_b04, band_b08 (nb 02)
GDAL read + tile metadata
│
▼ band_b0X_tile (nb 03)
H3 tessellation @ resolution 7
│
▼ band_b0X_h3 (nb 03)
Join on (cellid, date) + rst_frombands
│
▼ band_stack (nb 04)
GDAL writer → stacked TIFs + rst_clip
│
▼ /Volumes/.../alaska/out/stacked-tif
Key GeoBrix / Databricks functions shown
- GeoBrix RasterX (
rx.rst_*):rst_h3_tessellate,rst_h3_tessellateexplode,rst_memsize,rst_initnodata,rst_boundingbox,rst_srid,rst_tryopen,rst_summary,rst_metadata,rst_numbands,rst_frombands,rst_fromcontent,rst_merge_agg(aggregator),rst_clip,rst_isempty. - GeoBrix readers/writers:
shapefile_ogr(zipped shapefiles without unzipping),gdal(GTiff reader + writer withmode("append")andoption("ext", "tif")),binaryFile→rst_fromcontentpattern. - Databricks built-in ST / H3 (
DBF.*):st_geomfromwkt,st_transform,st_buffer,st_simplify,st_astext,st_asgeojson,st_aswkb,st_asewkb,st_centroid,st_envelope,h3_tessellateaswkb,h3_boundaryasgeojson,h3_boundaryaswkt,h3_toparent,h3_kring.
Gotchas
- Antimeridian: Alaska straddles the 180° meridian, so folium renderings can show results on both sides of the map.
- SRID awareness: Sentinel-2 tiles arrive in UTM zones (e.g.
32608,32609), not EPSG:4326 — reproject bboxes before plotting on a web map. - Free-tier auth-failure payloads: Planetary Computer returns a ~550-byte XML error body when SAS tokens expire or rate limits hit. The
is_out_file_validcolumn uses the 1 KBFILE_SIZE_THRESHOLDto detect these and enables a Delta MERGE-based retry viaupdate_assets/download_missing_assets. - Shuffle partitioning: Several helpers temporarily disable
spark.sql.adaptive.coalescePartitions.enabledand raisespark.sql.shuffle.partitionsduring download / stacking to keep parallelism high, then restore the original value. - Prefer EWKB for
rst_clipcutlines: notebook 04 usesDBF.st_asewkb(DBF.st_envelope("buffer"))so the cutline's SRID travels with the bytes intorst_clip. Plain WKB (no SRID) is assumed to already be in the raster's CRS and is not reprojected; EWKB (or EWKT) with a valid SRID triggers reprojection when it differs from the raster CRS. Usest_asewkb/st_asewktfor robust, CRS-agnostic clipping. - Scalar booleans pass through directly: in 0.3.0
rx.rst_clip("tile", "clip_wkb", True)accepts a bare PythonTruefor thecutToCutlineflag — noF.lit(True)wrapping needed.