Skip to main content

Benchmarking

GeoBrix ships a benchmark suite that compares the two execution tiersHeavyweight (rasterx) and Lightweight (pyrx) — on the same inputs, reporting both performance (per-function timing) and output consistency (do the two tiers produce the same result?).

This page explains what the benchmark measures, how to run it on a Databricks cluster (the primary, same-hardware comparison), how to run it locally, how to read the output, and full tables of representative results — both the pure-core (algorithm-in-isolation) view and the spark-path at scale view from a cluster.

Runtime support vs. benchmark environment

GeoBrix is supported on both DBR 17.3 LTS and DBR 18 LTS. The per-result Environment stamps below record the specific cluster each benchmark actually ran on (DBR 17.3 LTS); they are factual records of the run, not a statement that DBR 18 is unsupported.

What it measures

Each function is timed under two independent models:

  • Pure-core — the raster operation in isolation: open a single tile, call the function, measure. It is always one tile per measurement (repeated once for each tile shape in the corpus — each tile-size / band-count / dtype / SRID combination), and it ignores --row-counts entirely. This is the fairest apples-to-apples view of the algorithm itself, with no Spark or serialization in the path.
  • Spark-path — the registered function (rst_*) applied to a Spark DataFrame of N rows (the only model that uses --row-counts). This includes the realistic per-row overhead (UDF dispatch, serialization, Python-worker round-trips for the lightweight tier), and is swept across a row ladder (e.g. 10 → 100 → 1,000 → 10,000 rows).

Both models run the function --warmup times untimed (to absorb cold caches, JIT, and worker spin-up) and then --measured times timed; the reported figure is the median over the measured passes. Locally the defaults are 2 warmup / 5 measured; on a cluster they are 1 / 3 for pure-core and 1 / 1 for spark-path (one full N-tile iteration is already substantial).

Alongside timing, every pure-core result carries an output fingerprint (per-band statistics) so the two tiers can be checked for consistency:

  • exact — every statistic is bitwise-equal across tiers.
  • within_tol — every statistic agrees to a relative tolerance of 1e-3 or an absolute tolerance of 1e-3 (the absolute floor handles near-zero values where a relative comparison is meaningless).
  • divergent — neither tolerance is met.

The goal of the one-line tier swap is that results stay consistent; the benchmark is how that guarantee is verified.

note

The benchmark is run from the GeoBrix source repository (it uses the repo's gbx:bench:* commands and a job notebook). A wheel-only install does not include these tools — the cluster benchmark submits a job to your own provisioned cluster from a checkout of the repo.

Running on a cluster

Running on a provisioned cluster is the true comparison: both tiers execute on the same hardware, against the same corpus, and the full row ladder and larger tiles are within reach. Results append to a bench_results Delta table and a comparison.csv / summary.md land on the configured Volume.

Prerequisites

Provision a cluster and stage the artifacts per the installation guide:

TierClusterArtifacts
Heavyweight (rasterx)x86 · DBR 17.3 LTSInit script + bundle + GeoBrix wheel + the bench tests JAR (geobrix-*-tests.jar)
Lightweight (pyrx)x86 or ARMThe [light] wheel only

Then fill in the cluster configuration file (notebooks/tests/databricks_cluster_config.env) with your cluster ID and Volume paths.

Run

# Both tiers, same cluster, full row ladder, all functions
bash scripts/commands/gbx-bench-cluster.sh \
--cluster-id <your-cluster-id> \
--run-id cluster-2026-06 \
--modes both \
--row-counts 10,100,1000,10000

Scale the run with the options below. --cluster-id and --run-id identify the run; the rest are optional. --row-counts and --functions take a comma-separated list; --modes and --set take a single value.

OptionPurpose
--modes pure-core | spark-path | bothTiming model — pick only one mode (default both, which runs both models).
--row-counts 10,100,1000,10000Spark-path row ladder — each value is the number of distinct tiles processed in one timed iteration (the main scale dimension); one iteration is measured per rung. The largest value must be ≤ the corpus row-pool size (the bench refuses to under-fill). Default 10,100,1000,10000.
--set core | fullWhich function set to benchmark — the representative core set or every benchmarked function (full). Default core.
--functions rst_slope,rst_ndviComma-separated list restricting the run to specific functions; overrides --set. Default: unset (benchmarks the --set).
--warmup / --measuredWarmup and measured iteration counts. Defaults: pure-core 1 / 3, spark-path 1 / 1.
--lightweight-only / --heavyweight-onlyRun a single tier (mutually exclusive). --lightweight-only is required on ARM clusters (heavyweight is x86-only); --heavyweight-only skips the lightweight leg. Default: both tiers run.
--no-waitSubmit the job without blocking on completion. Default: waits for the run to finish.
# ARM cluster: lightweight only
bash scripts/commands/gbx-bench-cluster.sh --cluster-id <arm-cluster-id> --lightweight-only

Tile sizes, band counts, data types, and projections are set when the corpus is generated (see the local section's scale knobs); the same corpus is reused for both tiers so the comparison is fair.

note

A spark-path iteration processes max(--row-counts) distinct tiles drawn from the corpus row pool — it does not recycle a small pool to reach the row count. The largest --row-counts value must therefore be ≤ the corpus row-pool size; if it isn't, the bench refuses to run rather than silently under-fill (which would report a row count it never actually processed). Generate a larger pool or lower the row ladder.

Output

  • bench_results Delta table — every measured row (tagged with environment + run ID), so runs accumulate and can be queried/visualized over time.
  • summary.md / comparison.csv on the Volume — the human-readable speedup + consistency report (see Reading the output), also rendered inline in the run notebook.

Running locally

The local pipeline runs the heavyweight tier in the geobrix-dev Docker container and the lightweight tier in an isolated Python virtual environment, then compares them. Local runs are intentionally single-tile for pure-core (the algorithm cost in isolation); the full row ladder and at-scale spark-path numbers belong on a cluster, but can still be exercised in the local Docker environment at a modest scale.

# Full local pipeline: generate corpus -> heavyweight -> lightweight -> compare
bash scripts/commands/gbx-bench-all.sh --run-id local-1 --modes pure-core

The heavyweight and lightweight legs run one after the other (never concurrently) so they don't contend for CPU and skew each other's timings. Outputs land in test-logs/bench/<run-id>/.

Set (core vs full)

--set selects how many functions the run benchmarks:

  • --set core (the default) runs a small, representative set covering each function family — accessors, terrain, band math, warps. It's fast and is the right choice for a routine check.
  • --set full runs every benchmarked function. It takes longer but gives the complete coverage and parity picture.
note

The core here is the function set (--set) — how many functions run. Don't confuse it with pure-core, the timing model (--modes) that times one tile in isolation. They are independent: you can run --set core --modes pure-core, --set full --modes spark-path, or any other combination.

An explicit --functions list overrides --set.

# Routine check (default): the representative core set
bash scripts/commands/gbx-bench-all.sh --modes pure-core

# Complete coverage: every benchmarked function
bash scripts/commands/gbx-bench-all.sh --set full --modes pure-core

Scale and shape the corpus with the options below. Each takes a comma-separated list to sweep several values (e.g. --tile-px 256,512,1024) or a single value (e.g. --tile-px 1024); the corpus is the combination across the options you set.

OptionPurpose
--tile-px 256,512,1024,2048Tile sizes (pixels per side) — any size; larger tiles (1024², 2048², …) make the per-tile algorithm cost dominate the fixed overhead. Default 256,512.
--bands 1,4Band counts — any positive integer count. Default 2.
--dtypes uint8,int16,float32Pixel data types — these three are the full supported set (closed; other dtypes are not generated). Default float32.
--srids 4326,3857,32618,27700Projections — the full supported set (closed): 4326 (WGS84 geographic), 3857 (WebMercator), 32618 (UTM 18N), 27700 (British National Grid); other SRIDs are not generated. Default 4326,32618. Pick at least one geographic (lat/long) and one projected (metre) CRS to exercise both.
--nodata-frac 0.0,0.25Fraction of pixels set to NoData — any value in 0.01.0 (0% to 100%; continuous, not a fixed set). Default 0.0. Pass a comma-separated list to sweep several fractions.
--row-counts 10,100,1000,10000Spark-path row ladder — each value is the number of distinct tiles processed in one timed iteration (the scale dimension); the bench measures one iteration per rung. The largest value must be ≤ the corpus row-pool size (the bench refuses to under-fill). Default 2,4 (laptop-modest); run the full ladder on a cluster.
--modes pure-core | spark-path | bothTiming model — pick only one mode. Default both (runs both models).
--warmup / --measuredUntimed warmup passes / timed passes per measurement (median of the timed passes is reported). Defaults 2 / 5 locally.

The individual stages are also available as standalone commands — gbx:bench:gen-data, gbx:bench:heavyweight, gbx:bench:lightweight, and gbx:bench:compare — if you want to regenerate just one part of the pipeline.

Reading the output

The comparison summary.md opens with insights (biggest wins, the consistency tally) followed by a per-function table. For a spark-path run it looks like:

Consistency (7 compared cells): exact 0 - within-tol 7 - divergent 0

Tile scale: 1000 tiles/iteration (spark-path) — every tile processed each timed iteration.

| fn | hw_iter_s | lw_iter_s | hw_per_tile_s | lw_per_tile_s | speedup | consistency |
| rst_dtmfromgeoms_agg | 175.10 | 9.94 | 0.17510 | 0.00994 | 17.61 | within_tol |
| rst_slope | 7.31 | 7.57 | 0.00731 | 0.00757 | 0.97 | timing-only |
| rst_proximity | 0.52 | 8.12 | 0.00052 | 0.00812 | 0.06 | timing-only |
  • hw_iter_s / lw_iter_s are the median wall-clock of one full iteration over all N tiles (the whole distributed job). hw_per_tile_s / lw_per_tile_s are that ÷ N — the amortized per-tile cost. (A pure-core summary uses hw_ms / lw_ms and hw_mpix/s / lw_mpix/s for the single-tile algorithm cost instead.)
  • speedup is heavy / light — greater than 1 means the lightweight tier is faster.
  • consistency is the exact / within_tol / divergent label defined above (timing-only where the output can't be fingerprinted for a direct comparison — readers, metadata accessors, and most non-aggregator spark-path cells). Per-cell deltas are in comparison.csv (the max_rel_delta column).

A per-engine summary is also written for each tier (heavyweight.summary.md, lightweight.summary.md) with that tier's own timing and throughput in isolation.

Results

Representative results across the benchmark families, organized by function group.

Raster readers & writers

The raster reader/writer is benchmarked the same way on both tiers: read 1,000 × 1024² GeoTIFF tiles from a UC Volume (count), and write the same 1,000 tiles back out as GeoTIFF. Lightweight uses raster_gbx / gtiff_gbx (rasterio); heavyweight uses gdal / gtiff_gdal (native GDAL in the JVM). Run with gbx:bench:cluster --readers-only --spark-warmup 0 --spark-measured 1.

Environment: Databricks · DBR 17.3 LTS · 1,000 × 1024² GeoTIFF tiles · whole-job wall-clock median, one measured iteration (I/O, not JIT-sensitive — no warm-up).

Operationrowslightheavyratio
Read GeoTIFF (raster_gbx / gdal)1,00022.7 s9.5 s2.4× (heavy faster)
Write GeoTIFF (gtiff_gbx / gtiff_gdal)1,0004.7 s4.0 s1.18× (≈ parity)

Both tiers emit the same tile schema, so they are drop-in swaps. The writer runs at parity (~1.2×): the GeoTIFF encode dominates and costs about the same whether it runs in the JVM or via rasterio. The reader trails ~2.4× — the cost is moving each decoded tile's bytes across the Spark Python DataSource boundary (JVM↔Python ser/de), not the decode itself. That boundary is the structural tax of the pure-Python tier, and it is the trade for running where the heavyweight tier can't (Serverless, ARM, no JAR / init script). At scale the read still parallelizes across the cluster (~23 ms/tile here, spread over the workers).

Tiled output (PMTiles writer)

The PMTiles writer is benchmarked separately from the rst_* functions: it is a write that packages a tile pyramid into a .pmtiles archive, not a per-tile transform. The lightweight pmtiles_gbx writer and the heavyweight pmtiles writer take the same (z, x, y, bytes) input; the benchmark writes 1,000 tiles to a single archive on each tier and verifies decoded-tile parity between the two outputs. Run it with gbx:bench:cluster --pmtiles-only --row-counts 1000.

Environment: Databricks · DBR 17.3 LTS · 1,000 synthetic PNG tiles → one .pmtiles archive per tier · whole-job wall-clock median.

WriterTierWhole-job median (1,000 tiles)Speedup
pmtiles_gbxlightweight18.4 s
pmtilesheavyweight20.7 s1.13× (lightweight faster)

Parity: the two archives decode to the same 1,000 (z, x, y) tiles with byte-identical tile data. This is a hard gate — the benchmark fails if the tiers diverge — so the timing above is a like-for-like comparison of identical output.

Both writers are two-phase (executors write intermediates, the driver assembles the archive), so the benchmark stages intermediates on a shared filesystem rather than node-local disk; the lightweight writer streams its per-partition scratch and assembles on the driver, the same model the sharded/mosaic output uses at scale.

Tiled output aggregate (pmtiles_agg)

pmtiles_agg is the grouped-aggregate companion to the PMTiles writer: it folds a group of (tile, z, x, y) rows into one PMTiles archive per group key inside a groupBy(...).agg(pmtiles_agg(...)) job. It is format-agnostic — the same function archives raster or vector tiles — and is registered from both lightweight tiers (pyrx and pyvx), reusing the same archive assembler as the PMTiles writer. The benchmark folds 1,000 tiles into a single archive on each tier and verifies that every group emits a non-empty archive before timing. Run it with gbx:bench:cluster --pmtiles-agg-only --row-counts 1000.

Environment: Databricks · DBR 17.3 LTS · 1,000 synthetic tiles → one .pmtiles archive per tier · groupBy().agg() whole-job wall-clock · 5 measured iterations, median reported.

AggregatorTierWhole-job median (1,000 tiles)Per-tileSpeedup
pmtiles_agglightweight0.74 s0.74 ms~1.07× (≈ parity)
pmtiles_aggheavyweight0.70 s0.70 ms

The lightweight tier runs at the heavyweight median ×1.07 (0.742 s vs 0.695 s). The distributions overlap — light min 0.654 s sits below the heavyweight p90 of 0.758 s — so the gap is inside the run-to-run noise band for a grouped aggregate of this size; the two tiers are effectively at parity. Both produce spec-valid PMTiles v3 archives that decode to the same tile set. The archives are not byte-identical: the lightweight writer GZIP-compresses the internal directories while the heavyweight writer leaves them uncompressed (none) — both are spec-valid and decode identically. Decoded-tile parity (light == heavy) is checked separately by the JAR-gated test/ds/test_pmtiles_agg_parity.py. The lightweight aggregator is a grouped-aggregate Arrow UDF (pandas_udf) folding each group's tiles into an in-memory archive — the same execution shape as st_asmvt.

Vector readers

The vector readers are benchmarked on a scaled corpus: 5 × 1,000,000-feature files (≈5M polygons) per format, read from a UC Volume into a Delta table — the realistic "ingest vector files into Delta" pipeline. Each format runs light (*_gbx) vs heavy (*_ogr) as its own isolated job, one measured iteration (I/O is not JIT-sensitive). Cluster: DBR 17.3 LTS. Run with gbx:bench:cluster --vector-only --vector-scale.

Formatrowslight (*_gbx)heavy (*_ogr)light edge
GeoJSON — FeatureCollection (multi=false)5M29.6 s130.6 s4.4×
GeoJSON — GeoJSONSeq (multi=true)5M31.9 s113.9 s3.6×
Shapefile5M27.8 s26.3 s~par
GeoPackage5M23.5 s62.1 s2.6×
FileGDB1M13.8 slight-only on Volumes¹

The lightweight readers are Arrow-native (columnar pyogrio batches — no per-row Python) and stage GeoPackage/FileGDB to worker-local temp for random-access reads; they win or tie every format. GeoJSON is read one partition per file in both tiers (the GeoJSON driver re-parses the whole FeatureCollection on each open, so feature-offset chunking is counterproductive); heavy GeoJSON's per-feature OGR→row construction is what puts it behind.

¹ FileGDB on a Volume: the heavy OGR FileGDB reader opens a native .gdb/.gdb.zip from local/cluster storage but does not read one directly from a UC Volume (FileGDB's seeked multi-file I/O does not serve well over object storage); the lightweight reader stages it locally and reads it from the Volume. So FileGDB reads are light-only at Volume/directory scale (a single 1M archive is shown).

Parity: for each format both tiers ingest the same files and the row counts must match (checked inline; the run flags any divergence).

Vector writers

The single-file vector writers are lightweight-only — their OGR write paths aren't implemented in the heavyweight tier. The sharded GeoJSONL writer is available in both tiers (as the table shows). Each writes a 14,000,000-feature Delta table to a single file (the two-phase writer merges the partition fragments on the driver). Run with gbx:bench:cluster --vector-only --vector-scale --vector-legs writer --writer-rows 14000000.

Formatrowslight (*_gbx)throughput
GeoJSON (single file)14M279.1 s~50k rows/s
Shapefile (single file)14M152.6 s~92k rows/s
GeoPackage (single file)14M155.9 s~90k rows/s
FileGDB (single file)14M368.3 s~38k rows/s
GeoJSONL (sharded) — light geojsonl_gbx / heavy geojsonl14M16.4 s / 16.1 s (80 shards)~855k / ~870k rows/s

The single-file writers are driver-bound — the partition fragments are merged into one file on the driver (GeoJSON is slowest: text encoding; FileGDB next: native osgeo, per-feature). The sharded GeoJSONL writer skips the merge entirely — one newline-delimited shard per partition, written in parallel — so it is ~17× faster than the single-file GeoJSON writer at 14M (~16 s vs 279 s) and scales with partitions. It is the only vector writer available in both tiers (the others are lightweight-only), and the two tiers run at parity (16.4 s light vs 16.1 s heavy). It's the recommended writer for large / any-scale output; maxRecordsPerFile caps features per shard.

Format capacity — the limit is the format's, not GeoBrix's. 14M box polygons sits Shapefile near its 2 GB .shp ceiling; the other formats carry it easily and are specified far higher:

FormatCeilingCause
Shapefile2 GB per .shp/.dbf (~15.8M box polygons)32-bit record offsets
GeoPackage~17.6 TB (billions of rows)SQLite page_size × max_page_count
FileGDB2.1 B rows / 1 TB+ per feature classOBJECTID is signed int32
GeoJSONnone (RFC 7946)text; bounded only by disk / parse memory

Caveats

  • Absolute timings depend on the machine, the corpus, and the tier's install; treat the relative speedup and the consistency outcome as the durable signal, not the millisecond values.
  • The two tables answer different questions. Pure-core isolates the algorithm and is where the lightweight tier's biggest wins show; spark-path at scale adds the per-row UDF and serialization overhead and is the better predictor of end-to-end job cost — and the better place to decide which tier to run a given operation on.
  • Consistency is checked on output statistics with tolerance, not byte-equality: GDAL and NumPy can differ in the last bits, and neighborhood / NoData-boundary operations legitimately differ on the one-pixel kernel border or in how masked pixels propagate. The benchmark surfaces those as divergent so they're visible, not hidden.