Skip to main content

Helios — Distributed Tiling to PMTiles

A four-notebook series that takes one San Francisco bounding box and turns it into self-contained PMTiles archives — one per data modality — then shows how to shard a layer into a multi-archive mosaic for web-scale delivery, using GeoBrix on Databricks.

The notebooks follow a solar site-selection narrative: identify candidate rooftops (vector, NB01), overlay an aerial basemap for visual validation (raster, NB02), and layer terrain-derived slope, aspect, and hillshade to score each candidate by sun exposure (elevation, NB03). A final notebook re-publishes the vector layer as a distributed, sharded PMTiles mosaic — many archives plus a catalog and manifest — for web-scale delivery (NB04). The single-archive outputs feed a single plot_pmtiles / plot_cog viewer, giving a single-browser-tab deliverable with no tile server.

This series is an on-ramp to Databricks-native spatial: NB01 uses the built-in st_area / st_centroid and h3_longlatash3 functions directly on the WKB column for roof metrics and H3 density; NB03 bins slope and aspect into H3 cells via gbx_rst_h3_rastertogridavg and joins them with native Databricks SQL to produce a solar_score — no format conversion required at any step.

View on GitHub

notebooks/examples/helios — download the folder and import config_nb.ipynb and the four numbered notebooks into your Databricks workspace to run.

Runs on the lightweight tier (Serverless) by default

The series uses the lightweight tier — pure Python/PySpark bindings (databricks.labs.gbx.pyrx, databricks.labs.gbx.pyvx) plus the geobrix[light,stac,vizx,overture] wheel, installed by config_nb — so it runs on Serverless with no JAR. Visualization helpers come from databricks.labs.gbx.vizx. To run it on the heavyweight tier instead, flip the commented option-2 in config_nb.ipynb (rasterx) and attach the GeoBrix JAR + GDAL init script to a classic x86 cluster. Spark-conf tuning is routed through a set_conf_safe() helper that no-ops on Serverless (where runtime config mutation is disallowed) and applies on classic clusters. See Execution Tiers.

PMTiles viewer — inline, no tile server

plot_pmtiles (from gbx.vizx) renders a .pmtiles archive directly in the notebook by embedding the archive as base64 in a self-contained MapLibre GL JS page (FileSource; in a Databricks notebook an archive over ~4–5 MB — after the cell-output cap and displayHTML's ~2–3× inflation — falls back to a static thumbnail). plot_cog renders a Cloud-Optimized GeoTIFF over a contextily basemap as a static matplotlib figure. See the PMTiles function reference for the archive aggregator and writer.

The notebooks ship with INTERACTIVE_PLOTS = False so the committed .ipynb renders fast, GitHub-compatible static maps. Set INTERACTIVE_PLOTS = True in config_nb or in a cell right after %run ./config_nb for interactive MapLibre multi-layer maps.


Notebooks at a glance

01 — Vector Engine (MVT)

Notebook 01 — Overture buildings → MVT tiles → PMTiles archive → plot_pmtiles

  • Distributed MVT encodinggbx_st_asmvt + gbx_st_asmvt_pyramid fans out tile generation across the cluster in parallel, producing a full zoom-range MVT pyramid without driver-side loops or single-node bottlenecks.
  • Databricks-native spatial composition — built-in st_area / st_centroid compute per-building roof area and centroid directly on the WKB column; h3_longlatash3 bins centroids into H3 cells for roof-density scoring. These native functions compose cleanly with GeoBrix MVT encoding — no format conversion required.
  • Single-archive deliverygbx_pmtiles_agg merges the distributed tile output into one sf_buildings.pmtiles archive on the Volume; plot_pmtiles (from gbx.vizx) renders it inline.

02 — Visual Basemap (XYZ)

Notebook 02 — NAIP aerial → Web Mercator → XYZ pyramid → PMTiles archive → plot_pmtiles

  • Distributed Web Mercator reprojectiongbx_rst_to_webmercator reprojects each NAIP scene across the cluster to the standard slippy-map CRS, scaling linearly with tile count rather than running serially on the driver.
  • Full-resolution XYZ pyramidgbx_rst_xyzpyramid generates all zoom levels in a single distributed pass, matching the zoom range of the vector layer in NB01.
  • Raster PMTilesgbx_pmtiles_agg writes the XYZ tile set into sf_naip.pmtiles; plot_pmtiles renders it inline. With INTERACTIVE_PLOTS = True, plot_interactive layers the aerial basemap under the NB01 building footprints in a single MapLibre map (buildings layer degrades gracefully if NB01 has not run).

03 — Analytical Core (COG + STAC)

Notebook 03 — 3DEP DEM → COGs + STAC Delta → slope/aspect/hillshade → PMTiles → solar score per H3 cell

  • DemDownloader → COG + STAC Delta catalog — the gbx.sample DemDownloader stages a 3DEP DEM (windowed to the AOI), then gbx_rst_cog_convert converts it to a Cloud-Optimized GeoTIFF in a distributed pass; the resulting COG paths are written to a managed STAC Delta table for incremental updates and downstream notebook access.
  • Terrain analytics at scalegbx_rst_slope, gbx_rst_aspect, and gbx_rst_hillshade derive terrain layers in parallel across all DEM tiles; gbx_rst_xyzpyramid + gbx_pmtiles_agg package the hillshade into sf_hillshade.pmtiles. With INTERACTIVE_PLOTS = True, plot_interactive layers the hillshade and the NB01 building footprints as PMTiles and the H3 solar_score as a grid layer — all in one MapLibre map.
  • Databricks-native solar scoringgbx_rst_h3_rastertogridavg bins slope and aspect rasters into H3 cells; the resulting per-cell values are joined and scored with native Databricks SQL expressions to produce a solar_score column; h3_centeraswkb reconstructs geometry for map rendering — all without leaving the warehouse.

04 — Distributed Sharding & Mosaic

Notebook 04 — MVT pyramid → shard by parent tile → per-shard PMTiles archives → mosaic catalog + manifest

  • Spatial sharding into parallel work units — each pyramid tile is assigned to a coarse parent tile (z11) via shiftright(x, z-11), then groupBy(shard).agg(gbx_pmtiles_agg(...)) fans archive packing out across the cluster — one .pmtiles per shard instead of one monolithic file. The SF AOI spans four z11 parent tiles, but only the three containing buildings produce an archive (the SW tile is open water) — shards form only where there is data.
  • Shard catalog + mosaic manifest — a sf_building_shards Delta table maps each shard key to its archive path and bounds (pmtiles_info), and a mosaic.json manifest lets a client discover and assemble the shards with no tile server.
  • Web-scale delivery pattern — buffering the source query while keeping output tiles non-overlapping across shards keeps boundaries clean and avoids double-rendering; the mosaic pattern scales a single AOI into the per-file size range that object stores and CDNs serve efficiently.

Files

FilePurpose
config_nb.ipynbShared setup (%run ./config_nb from every main notebook). Installs the geobrix[light,stac,vizx,overture] wheel (2-step: --no-deps first, then with extras), selects the tier (option-1 pyrx/pyvx default / option-2 heavyweight), registers functions and light readers/writers, imports visualization helpers (plot_pmtiles, plot_cog, pmtiles_info), sets catalog_name / schema_name, creates the /Volumes/<cat>/<schema>/data/helios ETL tree, instantiates OvertureClient, and exposes FORCE_REBUILD and INTERACTIVE_PLOTS toggles.
01. Vector Engine (MVT).ipynbLoads SF building footprints via OvertureClient, computes roof area and H3 roof density with Databricks built-in ST/H3 functions, encodes MVT tiles with gbx_st_asmvt + gbx_st_asmvt_pyramid, and packages the result into sf_buildings.pmtiles with gbx_pmtiles_agg.
02. Visual Basemap (XYZ).ipynbDownloads NAIP aerial imagery for the SF AOI via NaipDownloader, reprojects to Web Mercator with gbx_rst_to_webmercator, generates an XYZ tile pyramid with gbx_rst_xyzpyramid, and writes sf_naip.pmtiles with gbx_pmtiles_agg.
03. Analytical Core (COG + STAC).ipynbDownloads a 3DEP DEM via DemDownloader, converts it to a COG with gbx_rst_cog_convert, builds a STAC Delta catalog, derives slope/aspect/hillshade, packages sf_hillshade.pmtiles, and computes a per-H3-cell solar_score with gbx_rst_h3_rastertogridavg + native Databricks SQL.
04. Distributed Sharding & Mosaic.ipynbRe-publishes the SF buildings vector layer as a multi-archive PMTiles mosaic: assigns each pyramid tile to a coarse parent shard (z11), packs one .pmtiles per shard with groupBy(shard).agg(gbx_pmtiles_agg), and writes a sf_building_shards Delta catalog + mosaic.json manifest (bounds via pmtiles_info) for client-side mosaic assembly.

Prerequisites

  • Databricks Runtime 17.3 LTS / 18 LTS, or Serverless (Scala 2.13 / Spark 4 / Python 3.12). The lightweight default runs on Serverless (set Environment to version 5+); the heavyweight option requires a classic x86 cluster.
  • GeoBrix 0.4.0. config_nb.ipynb %pip-installs geobrix[light,stac,vizx,overture] from a staged Volume wheel using the 2-step pattern (force-reinstall --no-deps first, then with the extras). For the heavyweight option, flip option-2 in config_nb.ipynb and attach the GeoBrix JAR + GDAL init script to the cluster.
  • Unity Catalog. Edit config_nb.ipynb to set catalog_name and schema_name. A Volume named data must already exist under <catalog>/<schema> — the notebooks create sub-directories inside it but will not create the Volume itself.
  • Network access. NB01 reads Overture Maps via OvertureClient; NB02 fetches NAIP via NaipDownloader and NB03 fetches 3DEP via DemDownloader, both from Planetary Computer STAC (online-only — no offline fallback). Classic cluster outbound internet is sufficient; Serverless has it by default.

Run order

  1. Open config_nb.ipynb, set catalog_name / schema_name, and verify the Volume exists.
  2. Run notebooks in numeric order: 01 → 02 → 03 → 04. Each notebook starts with %run ./config_nb so the shared state is re-established every time. NB04 reuses the building footprints from NB01, so run NB01 first (if overture_buildings_meta is absent, NB04 re-fetches them from Overture).

Each notebook is safe to re-run — outputs are written with skip-guards so already-built files are not re-downloaded or re-tiled. Set FORCE_REBUILD = True in a cell right after %run ./config_nb to force a full rebuild of that notebook's outputs.


Data flow

San Francisco AOI (one bbox, reused across the series)

┌─────┴───────────────┬─────────────────────────────┐
▼ ▼ ▼
Overture buildings NAIP aerial (helper) USGS 3DEP DEM (helper)
(OvertureClient) │ │
│ ▼ gbx_rst_to_webmercator ▼ gbx_rst_cog_convert
▼ gbx_st_asmvt │ │ → COGs + STAC Delta
+ st_asmvt_pyramid ▼ gbx_rst_xyzpyramid ▼ slope/aspect/hillshade
│ │ ▼ gbx_rst_xyzpyramid
▼ gbx_pmtiles_agg ▼ gbx_pmtiles_agg ▼ gbx_pmtiles_agg
sf_buildings.pmtiles sf_naip.pmtiles sf_hillshade.pmtiles
│ │ │
└─────────────────────┴──────────────┬──────────────┘

plot_pmtiles / plot_cog (inline)
→ solar site-selection view

Key GeoBrix / Databricks functions shown


Gotchas

  • PMTiles is driver-side only. gbx_pmtiles_agg produces the archive via a distributed reduce but the final file lands on the driver Volume path — it cannot be read back via spark.read. Use pmtiles_info / plot_pmtiles for inspection.
  • plot_pmtiles base64 size guard. The inline embed is bounded by the notebook cell-output cap (~10 MB default, 20 MB max), and displayHTML inflates the payload ~2–3× — so an archive over ~4–5 MB falls back to a static thumbnail. Scope the AOI/zoom range, use interactive_fit="downzoom", or stream from an https:// URL if the archive exceeds the ceiling.
  • Overture cloud-path vs. HTTP-href. OvertureClient first tries a direct cloud-storage path; if that is not reachable from the cluster network it falls back to an HTTP href. Classic clusters with S3 VPC endpoints reach the cloud path; Serverless always reaches the HTTP href.
  • NAIP and 3DEP network reachability. Both sources require outbound internet access via Planetary Computer and have no offline fallback — NB02 (NaipDownloader) and NB03 (DemDownloader) skip or error if the endpoint is unreachable.
  • Repartition by column on Serverless. A number-only repartition(N) is AQE-coalesced on Serverless. Always repartition by a data column (e.g., repartition(N, "tile_x")) before distributed UDF calls.
  • Wheel install is a 2-step pattern. config_nb installs the wheel with --no-deps first (to force fresh bytes), then reinstalls with geobrix[light,stac,vizx,overture] (to pull extras). A bare single-step --no-deps install drops the extras and causes ModuleNotFoundError at import.