Benchmarking
GeoBrix ships a benchmark suite that compares the two execution tiers — Heavyweight (rasterx) and Lightweight (pyrx) — on the same inputs, reporting both performance (per-function timing) and output consistency (do the two tiers produce the same result?).
This page explains what the benchmark measures, how to run it on a Databricks cluster (the primary, same-hardware comparison), how to run it locally, how to read the output, and full tables of representative results — both the pure-core (algorithm-in-isolation) view and the spark-path at scale view from a cluster.
GeoBrix is supported on both DBR 17.3 LTS and DBR 18 LTS. The per-result Environment stamps below record the specific cluster each benchmark actually ran on (DBR 17.3 LTS); they are factual records of the run, not a statement that DBR 18 is unsupported.
What it measures
Each function is timed under two independent models:
- Pure-core — the raster operation in isolation: open a single tile, call the function, measure. It is always one tile per measurement (repeated once for each tile shape in the corpus — each tile-size / band-count / dtype / SRID combination), and it ignores
--row-countsentirely. This is the fairest apples-to-apples view of the algorithm itself, with no Spark or serialization in the path. - Spark-path — the registered function (
rst_*) applied to a Spark DataFrame of N rows (the only model that uses--row-counts). This includes the realistic per-row overhead (UDF dispatch, serialization, Python-worker round-trips for the lightweight tier), and is swept across a row ladder (e.g. 10 → 100 → 1,000 → 10,000 rows).
Both models run the function --warmup times untimed (to absorb cold caches, JIT, and worker spin-up) and then --measured times timed; the reported figure is the median over the measured passes. Locally the defaults are 2 warmup / 5 measured; on a cluster they are 1 / 3 for pure-core and 1 / 1 for spark-path (one full N-tile iteration is already substantial).
Alongside timing, every pure-core result carries an output fingerprint (per-band statistics) so the two tiers can be checked for consistency:
- exact — every statistic is bitwise-equal across tiers.
- within_tol — every statistic agrees to a relative tolerance of
1e-3or an absolute tolerance of1e-3(the absolute floor handles near-zero values where a relative comparison is meaningless). - divergent — neither tolerance is met.
The goal of the one-line tier swap is that results stay consistent; the benchmark is how that guarantee is verified.
The benchmark is run from the GeoBrix source repository (it uses the repo's gbx:bench:* commands and a job notebook). A wheel-only install does not include these tools — the cluster benchmark submits a job to your own provisioned cluster from a checkout of the repo.
Running on a cluster
Running on a provisioned cluster is the true comparison: both tiers execute on the same hardware, against the same corpus, and the full row ladder and larger tiles are within reach. Results append to a bench_results Delta table and a comparison.csv / summary.md land on the configured Volume.
Prerequisites
Provision a cluster and stage the artifacts per the installation guide:
| Tier | Cluster | Artifacts |
|---|---|---|
| Heavyweight (rasterx) | x86 · DBR 17.3 LTS | Init script + bundle + GeoBrix wheel + the bench tests JAR (geobrix-*-tests.jar) |
| Lightweight (pyrx) | x86 or ARM | The [light] wheel only |
Then fill in the cluster configuration file (notebooks/tests/databricks_cluster_config.env) with your cluster ID and Volume paths.
Run
# Both tiers, same cluster, full row ladder, all functions
bash scripts/commands/gbx-bench-cluster.sh \
--cluster-id <your-cluster-id> \
--run-id cluster-2026-06 \
--modes both \
--row-counts 10,100,1000,10000
Scale the run with the options below. --cluster-id and --run-id identify the run; the rest are optional. --row-counts and --functions take a comma-separated list; --modes and --set take a single value.
| Option | Purpose |
|---|---|
--modes pure-core | spark-path | both | Timing model — pick only one mode (default both, which runs both models). |
--row-counts 10,100,1000,10000 | Spark-path row ladder — each value is the number of distinct tiles processed in one timed iteration (the main scale dimension); one iteration is measured per rung. The largest value must be ≤ the corpus row-pool size (the bench refuses to under-fill). Default 10,100,1000,10000. |
--set core | full | Which function set to benchmark — the representative core set or every benchmarked function (full). Default core. |
--functions rst_slope,rst_ndvi | Comma-separated list restricting the run to specific functions; overrides --set. Default: unset (benchmarks the --set). |
--warmup / --measured | Warmup and measured iteration counts. Defaults: pure-core 1 / 3, spark-path 1 / 1. |
--lightweight-only / --heavyweight-only | Run a single tier (mutually exclusive). --lightweight-only is required on ARM clusters (heavyweight is x86-only); --heavyweight-only skips the lightweight leg. Default: both tiers run. |
--no-wait | Submit the job without blocking on completion. Default: waits for the run to finish. |
# ARM cluster: lightweight only
bash scripts/commands/gbx-bench-cluster.sh --cluster-id <arm-cluster-id> --lightweight-only
Tile sizes, band counts, data types, and projections are set when the corpus is generated (see the local section's scale knobs); the same corpus is reused for both tiers so the comparison is fair.
A spark-path iteration processes max(--row-counts) distinct tiles drawn from the corpus row pool — it does not recycle a small pool to reach the row count. The largest --row-counts value must therefore be ≤ the corpus row-pool size; if it isn't, the bench refuses to run rather than silently under-fill (which would report a row count it never actually processed). Generate a larger pool or lower the row ladder.
Output
bench_resultsDelta table — every measured row (tagged with environment + run ID), so runs accumulate and can be queried/visualized over time.summary.md/comparison.csvon the Volume — the human-readable speedup + consistency report (see Reading the output), also rendered inline in the run notebook.
Running locally
The local pipeline runs the heavyweight tier in the geobrix-dev Docker container and the lightweight tier in an isolated Python virtual environment, then compares them. Local runs are intentionally single-tile for pure-core (the algorithm cost in isolation); the full row ladder and at-scale spark-path numbers belong on a cluster, but can still be exercised in the local Docker environment at a modest scale.
# Full local pipeline: generate corpus -> heavyweight -> lightweight -> compare
bash scripts/commands/gbx-bench-all.sh --run-id local-1 --modes pure-core
The heavyweight and lightweight legs run one after the other (never concurrently) so they don't contend for CPU and skew each other's timings. Outputs land in test-logs/bench/<run-id>/.
Set (core vs full)
--set selects how many functions the run benchmarks:
--set core(the default) runs a small, representative set covering each function family — accessors, terrain, band math, warps. It's fast and is the right choice for a routine check.--set fullruns every benchmarked function. It takes longer but gives the complete coverage and parity picture.
The core here is the function set (--set) — how many functions run. Don't confuse it with pure-core, the timing model (--modes) that times one tile in isolation. They are independent: you can run --set core --modes pure-core, --set full --modes spark-path, or any other combination.
An explicit --functions list overrides --set.
# Routine check (default): the representative core set
bash scripts/commands/gbx-bench-all.sh --modes pure-core
# Complete coverage: every benchmarked function
bash scripts/commands/gbx-bench-all.sh --set full --modes pure-core
Scale and shape the corpus with the options below. Each takes a comma-separated list to sweep several values (e.g. --tile-px 256,512,1024) or a single value (e.g. --tile-px 1024); the corpus is the combination across the options you set.
| Option | Purpose |
|---|---|
--tile-px 256,512,1024,2048 | Tile sizes (pixels per side) — any size; larger tiles (1024², 2048², …) make the per-tile algorithm cost dominate the fixed overhead. Default 256,512. |
--bands 1,4 | Band counts — any positive integer count. Default 2. |
--dtypes uint8,int16,float32 | Pixel data types — these three are the full supported set (closed; other dtypes are not generated). Default float32. |
--srids 4326,3857,32618,27700 | Projections — the full supported set (closed): 4326 (WGS84 geographic), 3857 (WebMercator), 32618 (UTM 18N), 27700 (British National Grid); other SRIDs are not generated. Default 4326,32618. Pick at least one geographic (lat/long) and one projected (metre) CRS to exercise both. |
--nodata-frac 0.0,0.25 | Fraction of pixels set to NoData — any value in 0.0–1.0 (0% to 100%; continuous, not a fixed set). Default 0.0. Pass a comma-separated list to sweep several fractions. |
--row-counts 10,100,1000,10000 | Spark-path row ladder — each value is the number of distinct tiles processed in one timed iteration (the scale dimension); the bench measures one iteration per rung. The largest value must be ≤ the corpus row-pool size (the bench refuses to under-fill). Default 2,4 (laptop-modest); run the full ladder on a cluster. |
--modes pure-core | spark-path | both | Timing model — pick only one mode. Default both (runs both models). |
--warmup / --measured | Untimed warmup passes / timed passes per measurement (median of the timed passes is reported). Defaults 2 / 5 locally. |
The individual stages are also available as standalone commands — gbx:bench:gen-data, gbx:bench:heavyweight, gbx:bench:lightweight, and gbx:bench:compare — if you want to regenerate just one part of the pipeline.
Reading the output
The comparison summary.md opens with insights (biggest wins, the consistency tally) followed by a per-function table. For a spark-path run it looks like:
Consistency (7 compared cells): exact 0 - within-tol 7 - divergent 0
Tile scale: 1000 tiles/iteration (spark-path) — every tile processed each timed iteration.
| fn | hw_iter_s | lw_iter_s | hw_per_tile_s | lw_per_tile_s | speedup | consistency |
| rst_dtmfromgeoms_agg | 175.10 | 9.94 | 0.17510 | 0.00994 | 17.61 | within_tol |
| rst_slope | 7.31 | 7.57 | 0.00731 | 0.00757 | 0.97 | timing-only |
| rst_proximity | 0.52 | 8.12 | 0.00052 | 0.00812 | 0.06 | timing-only |
hw_iter_s/lw_iter_sare the median wall-clock of one full iteration over all N tiles (the whole distributed job).hw_per_tile_s/lw_per_tile_sare that ÷ N — the amortized per-tile cost. (A pure-core summary useshw_ms/lw_msandhw_mpix/s/lw_mpix/sfor the single-tile algorithm cost instead.)speedupisheavy / light— greater than 1 means the lightweight tier is faster.consistencyis theexact/within_tol/divergentlabel defined above (timing-onlywhere the output can't be fingerprinted for a direct comparison — readers, metadata accessors, and most non-aggregator spark-path cells). Per-cell deltas are incomparison.csv(themax_rel_deltacolumn).
A per-engine summary is also written for each tier (heavyweight.summary.md, lightweight.summary.md) with that tier's own timing and throughput in isolation.
Results
Representative results across the benchmark families, organized by function group.
- Readers & Writers
- Raster
- Vector
- Grid
Raster readers & writers
The raster reader/writer is benchmarked the same way on both tiers: read 1,000 ×
1024² GeoTIFF tiles from a UC Volume (count), and write the same 1,000 tiles back
out as GeoTIFF. Lightweight uses raster_gbx / gtiff_gbx (rasterio); heavyweight
uses gdal / gtiff_gdal (native GDAL in the JVM). Run with
gbx:bench:cluster --readers-only --spark-warmup 0 --spark-measured 1.
Environment: Databricks · DBR 17.3 LTS · 1,000 × 1024² GeoTIFF tiles · whole-job wall-clock median, one measured iteration (I/O, not JIT-sensitive — no warm-up).
| Operation | rows | light | heavy | ratio |
|---|---|---|---|---|
Read GeoTIFF (raster_gbx / gdal) | 1,000 | 22.7 s | 9.5 s | 2.4× (heavy faster) |
Write GeoTIFF (gtiff_gbx / gtiff_gdal) | 1,000 | 4.7 s | 4.0 s | 1.18× (≈ parity) |
Both tiers emit the same tile schema, so they are drop-in swaps. The writer runs at parity (~1.2×): the GeoTIFF encode dominates and costs about the same whether it runs in the JVM or via rasterio. The reader trails ~2.4× — the cost is moving each decoded tile's bytes across the Spark Python DataSource boundary (JVM↔Python ser/de), not the decode itself. That boundary is the structural tax of the pure-Python tier, and it is the trade for running where the heavyweight tier can't (Serverless, ARM, no JAR / init script). At scale the read still parallelizes across the cluster (~23 ms/tile here, spread over the workers).
Tiled output (PMTiles writer)
The PMTiles writer is benchmarked separately from the rst_* functions: it is a
write that packages a tile pyramid into a .pmtiles archive, not a per-tile
transform. The lightweight pmtiles_gbx writer and the heavyweight pmtiles
writer take the same (z, x, y, bytes) input; the benchmark writes 1,000 tiles
to a single archive on each tier and verifies decoded-tile parity between the
two outputs. Run it with gbx:bench:cluster --pmtiles-only --row-counts 1000.
Environment: Databricks · DBR 17.3 LTS · 1,000 synthetic PNG tiles → one
.pmtilesarchive per tier · whole-job wall-clock median.
| Writer | Tier | Whole-job median (1,000 tiles) | Speedup |
|---|---|---|---|
pmtiles_gbx | lightweight | 18.4 s | — |
pmtiles | heavyweight | 20.7 s | 1.13× (lightweight faster) |
Parity: the two archives decode to the same 1,000 (z, x, y) tiles with
byte-identical tile data. This is a hard gate — the benchmark fails if the tiers
diverge — so the timing above is a like-for-like comparison of identical output.
Both writers are two-phase (executors write intermediates, the driver assembles the archive), so the benchmark stages intermediates on a shared filesystem rather than node-local disk; the lightweight writer streams its per-partition scratch and assembles on the driver, the same model the sharded/mosaic output uses at scale.
Tiled output aggregate (pmtiles_agg)
pmtiles_agg is the grouped-aggregate companion to the PMTiles writer: it
folds a group of (tile, z, x, y) rows into one PMTiles archive per group key
inside a groupBy(...).agg(pmtiles_agg(...)) job. It is format-agnostic — the same
function archives raster or vector tiles — and is registered from both lightweight
tiers (pyrx and pyvx), reusing the same archive assembler as the PMTiles writer.
The benchmark folds 1,000 tiles into a single archive on each tier and verifies
that every group emits a non-empty archive before timing. Run it with
gbx:bench:cluster --pmtiles-agg-only --row-counts 1000.
Environment: Databricks · DBR 17.3 LTS · 1,000 synthetic tiles → one
.pmtilesarchive per tier ·groupBy().agg()whole-job wall-clock · 5 measured iterations, median reported.
| Aggregator | Tier | Whole-job median (1,000 tiles) | Per-tile | Speedup |
|---|---|---|---|---|
pmtiles_agg | lightweight | 0.74 s | 0.74 ms | ~1.07× (≈ parity) |
pmtiles_agg | heavyweight | 0.70 s | 0.70 ms | — |
The lightweight tier runs at the heavyweight median ×1.07 (0.742 s vs 0.695 s).
The distributions overlap — light min 0.654 s sits below the heavyweight p90 of
0.758 s — so the gap is inside the run-to-run noise band for a grouped aggregate of
this size; the two tiers are effectively at parity. Both produce spec-valid PMTiles
v3 archives that decode to the same tile set. The archives are not
byte-identical: the lightweight writer GZIP-compresses the internal directories while
the heavyweight writer leaves them uncompressed (none) — both are spec-valid and
decode identically. Decoded-tile parity (light == heavy) is checked separately by the
JAR-gated test/ds/test_pmtiles_agg_parity.py. The lightweight aggregator is a
grouped-aggregate Arrow UDF (pandas_udf) folding each group's tiles into an
in-memory archive — the same execution shape as st_asmvt.
Vector readers
The vector readers are benchmarked on a scaled corpus: 5 × 1,000,000-feature files
(≈5M polygons) per format, read from a UC Volume into a Delta table — the realistic
"ingest vector files into Delta" pipeline. Each format runs light (*_gbx) vs heavy
(*_ogr) as its own isolated job, one measured iteration (I/O is not JIT-sensitive).
Cluster: DBR 17.3 LTS. Run with gbx:bench:cluster --vector-only --vector-scale.
| Format | rows | light (*_gbx) | heavy (*_ogr) | light edge |
|---|---|---|---|---|
GeoJSON — FeatureCollection (multi=false) | 5M | 29.6 s | 130.6 s | 4.4× |
GeoJSON — GeoJSONSeq (multi=true) | 5M | 31.9 s | 113.9 s | 3.6× |
| Shapefile | 5M | 27.8 s | 26.3 s | ~par |
| GeoPackage | 5M | 23.5 s | 62.1 s | 2.6× |
| FileGDB | 1M | 13.8 s | — | light-only on Volumes¹ |
The lightweight readers are Arrow-native (columnar pyogrio batches — no per-row Python) and
stage GeoPackage/FileGDB to worker-local temp for random-access reads; they win or tie
every format. GeoJSON is read one partition per file in both tiers (the GeoJSON driver
re-parses the whole FeatureCollection on each open, so feature-offset chunking is
counterproductive); heavy GeoJSON's per-feature OGR→row construction is what puts it behind.
¹ FileGDB on a Volume: the heavy OGR FileGDB reader opens a native .gdb/.gdb.zip from
local/cluster storage but does not read one directly from a UC Volume (FileGDB's seeked
multi-file I/O does not serve well over object storage); the lightweight reader stages it
locally and reads it from the Volume. So FileGDB reads are light-only at Volume/directory
scale (a single 1M archive is shown).
Parity: for each format both tiers ingest the same files and the row counts must match (checked inline; the run flags any divergence).
Vector writers
The single-file vector writers are lightweight-only — their OGR write paths aren't
implemented in the heavyweight tier. The sharded GeoJSONL writer is available in both
tiers (as the table shows). Each writes a 14,000,000-feature Delta table to a single
file (the two-phase writer merges the partition fragments on the driver). Run with
gbx:bench:cluster --vector-only --vector-scale --vector-legs writer --writer-rows 14000000.
| Format | rows | light (*_gbx) | throughput |
|---|---|---|---|
| GeoJSON (single file) | 14M | 279.1 s | ~50k rows/s |
| Shapefile (single file) | 14M | 152.6 s | ~92k rows/s |
| GeoPackage (single file) | 14M | 155.9 s | ~90k rows/s |
| FileGDB (single file) | 14M | 368.3 s | ~38k rows/s |
GeoJSONL (sharded) — light geojsonl_gbx / heavy geojsonl | 14M | 16.4 s / 16.1 s (80 shards) | ~855k / ~870k rows/s |
The single-file writers are driver-bound — the partition fragments are merged into one
file on the driver (GeoJSON is slowest: text encoding; FileGDB next: native osgeo,
per-feature). The sharded GeoJSONL writer skips the merge entirely — one newline-delimited
shard per partition, written in parallel — so it is ~17× faster than the single-file
GeoJSON writer at 14M (~16 s vs 279 s) and scales with partitions. It is the only vector
writer available in both tiers (the others are lightweight-only), and the two tiers run at
parity (16.4 s light vs 16.1 s heavy). It's the recommended writer for large / any-scale
output; maxRecordsPerFile caps features per shard.
Format capacity — the limit is the format's, not GeoBrix's. 14M box polygons sits
Shapefile near its 2 GB .shp ceiling; the other formats carry it easily and are specified
far higher:
| Format | Ceiling | Cause |
|---|---|---|
| Shapefile | 2 GB per .shp/.dbf (~15.8M box polygons) | 32-bit record offsets |
| GeoPackage | ~17.6 TB (billions of rows) | SQLite page_size × max_page_count |
| FileGDB | 2.1 B rows / 1 TB+ per feature class | OBJECTID is signed int32 |
| GeoJSON | none (RFC 7946) | text; bounded only by disk / parse memory |
Not every function appears in both timing models — the gaps are deliberate measurement choices, not functional gaps (the lightweight tier implements all 107):
- Pure-core: 100 / 107. The 7 absent are the grouped aggregators (
rst_*_agg) — a UDAF has no single-tile, single-row form to time in isolation, so they appear only in spark-path. - Spark-path: 83 / 107. The 24 absent are pure-core-only because their call shape doesn't fit the spark-path tile-DataFrame model: geometry-input constructors (
rst_rasterize,rst_gridfrompoints,rst_dtmfromgeoms— the tile DataFrame carries no geometry column), the path reader (rst_fromfile), functions needing an in-extent coordinate/geometry literal valid across the multi-CRS corpus (rst_clip,rst_sample,rst_viewshed,rst_worldtorastercoord*,rst_resample_to_res), render-engine-divergent tiles (rst_tilexyz,rst_xyzpyramid,rst_color_relief), and metadata/scalar accessors (rst_metadata,rst_type,rst_summary, …) whose spark-path cell would be an uncomparable timing-only result.
Pure-core (local, 1024²)
The pure-core table is the algorithm in isolation: open one tile, call the function, measure — no Spark, no serialization. It is the fairest view of the raw implementation. It is one snapshot — absolute timings are environment-dependent — but the relative picture and the consistency outcome are stable. Per-tile timing is shown in milliseconds; cells whose output cannot be fingerprinted for a direct comparison (readers, metadata accessors) are timed and labelled timing-only.
Environment: amd64 · Linux · GeoBrix 0.4.0 · GDAL 3.11.4 · 2-band
float32tiles at 1024² (EPSG:32618) with 2% NoData. All rows are pure-core (single tile).Consistency: of 100 functions — 26 exact, 47 within-tolerance, 23 timing-only (no fingerprint), and 4 divergent. The four divergences are all NoData / edge handling, where the bulk of the raster matches but the boundary or masked pixels differ:
rst_convolve— a one-pixel edge ring (GDAL applies a block-halo convolution that no single lightweight boundary mode reproduces exactly); interior pixels match.rst_derivedbandandrst_resample— differ in how NoData is masked/propagated through the operation; valid-pixel values agree within tolerance.rst_contour— a vector output whose feature count differs (a segmentation artifact at NoData boundaries), not a pixel value.These boundary behaviors are tracked for alignment to the heavyweight (GDAL) semantics; they do not affect interior/valid-pixel results.
The table is ordered by Speedup =
Heavy / Light(highest first). > 1× means the lightweight tier is faster; < 1× means the heavyweight tier is faster.
| Function | Heavy (ms) | Light (ms) | Speedup | Consistency |
|---|---|---|---|---|
rst_fromcontent | 114 | 0.0002 | 684,479× | within_tol |
rst_threshold | 718 | 5.63 | 127× | timing-only |
rst_index | 762 | 7.00 | 109× | within_tol |
rst_ndwi | 747 | 7.33 | 102× | within_tol |
rst_savi | 769 | 7.55 | 102× | within_tol |
rst_nbr | 743 | 7.43 | 100× | within_tol |
rst_ndvi | 768 | 7.77 | 99× | within_tol |
rst_dtmfromgeoms | 4915 | 58.4 | 84× | within_tol |
rst_evi | 771 | 9.33 | 83× | within_tol |
rst_mapalgebra | 735 | 11.2 | 66× | within_tol |
rst_separatebands | 220 | 4.53 | 49× | within_tol |
rst_convolve | 788 | 21.4 | 37× | divergent |
rst_initnodata | 159 | 4.32 | 37× | within_tol |
rst_fillnodata | 417 | 12.9 | 32× | within_tol |
rst_setsrid | 133 | 4.48 | 30× | within_tol |
rst_band | 76.9 | 2.76 | 28× | within_tol |
rst_maketiles | 227 | 9.78 | 23× | within_tol |
rst_updatetype | 162 | 7.54 | 22× | within_tol |
rst_asformat | 94.5 | 4.47 | 21× | within_tol |
rst_frombands | 184 | 8.81 | 21× | within_tol |
rst_derivedband | 108 | 6.15 | 18× | divergent |
rst_retile | 280 | 17.9 | 16× | within_tol |
rst_rasterize | 49.3 | 3.22 | 15× | within_tol |
rst_h3_tessellate | 495 | 35.3 | 14× | exact |
rst_tooverlappingtiles | 423 | 31.7 | 13× | within_tol |
rst_resample | 659 | 54.8 | 12× | divergent |
rst_resample_to_size | 65.3 | 5.83 | 11× | within_tol |
rst_pixelcount | 47.9 | 4.53 | 11× | exact |
rst_avg | 46.5 | 4.75 | 9.80× | within_tol |
rst_median | 168 | 17.1 | 9.79× | within_tol |
rst_clip | 34.3 | 3.55 | 9.66× | timing-only |
rst_merge | 309 | 32.0 | 9.65× | within_tol |
rst_buildoverviews | 174 | 18.1 | 9.61× | within_tol |
rst_combineavg | 220 | 27.5 | 8.00× | within_tol |
rst_quadbin_rastertogridmedian | 647 | 94.3 | 6.86× | within_tol |
rst_tryopen | 1.54 | 0.230 | 6.67× | exact |
rst_h3_rastertogridmax | 7370 | 1241 | 5.94× | exact |
rst_h3_rastertogridavg | 7356 | 1243 | 5.92× | within_tol |
rst_h3_rastertogridcount | 7312 | 1239 | 5.90× | within_tol |
rst_h3_rastertogridmin | 7320 | 1244 | 5.88× | within_tol |
rst_transform | 181 | 31.1 | 5.82× | within_tol |
rst_h3_rastertogridmedian | 7627 | 1376 | 5.54× | exact |
rst_tilexyz | 12.7 | 2.40 | 5.27× | timing-only |
rst_resample_to_res | 254 | 53.9 | 4.71× | timing-only |
rst_slope | 86.1 | 24.0 | 3.59× | within_tol |
rst_tpi | 57.7 | 16.7 | 3.45× | within_tol |
rst_roughness | 61.0 | 18.3 | 3.33× | within_tol |
rst_to_webmercator | 201 | 69.0 | 2.91× | within_tol |
rst_hillshade | 62.6 | 21.7 | 2.89× | within_tol |
rst_tri | 58.3 | 23.4 | 2.50× | within_tol |
rst_filter | 891 | 435 | 2.05× | within_tol |
rst_aspect | 88.5 | 44.6 | 1.98× | within_tol |
rst_quadbin_rastertogridmax | 152 | 79.9 | 1.90× | within_tol |
rst_quadbin_rastertogridavg | 154 | 82.9 | 1.85× | within_tol |
rst_quadbin_rastertogridcount | 137 | 78.2 | 1.75× | within_tol |
rst_quadbin_rastertogridmin | 144 | 84.1 | 1.71× | within_tol |
rst_max | 7.65 | 4.56 | 1.68× | exact |
rst_min | 7.60 | 4.59 | 1.66× | exact |
rst_xyzpyramid | 204 | 141 | 1.45× | timing-only |
rst_proximity | 74.4 | 56.0 | 1.33× | within_tol |
| heavyweight faster below (< 1×) | ━━━ | ━━━ | ━━━ | ━━━ |
rst_getsubdataset | 0.232 | 0.237 | 0.98× | timing-only |
rst_boundingbox | 0.234 | 0.242 | 0.97× | timing-only |
rst_color_relief | 16.3 | 22.4 | 0.73× | timing-only |
rst_metadata | 0.145 | 0.241 | 0.60× | timing-only |
rst_rastertoworldcoord | 0.142 | 0.249 | 0.57× | exact |
rst_srid | 0.133 | 0.243 | 0.55× | exact |
rst_scalex | 0.087 | 0.223 | 0.39× | exact |
rst_gridfrompoints | 87.5 | 242 | 0.36× | within_tol |
rst_scaley | 0.079 | 0.224 | 0.35× | exact |
rst_pixelheight | 0.079 | 0.228 | 0.35× | exact |
rst_type | 0.076 | 0.225 | 0.34× | timing-only |
rst_getnodata | 0.058 | 0.223 | 0.26× | exact |
rst_contour | 362 | 1414 | 0.26× | divergent |
rst_isempty | 0.054 | 0.224 | 0.24× | exact |
rst_polygonize | 152 | 640 | 0.24× | within_tol |
rst_height | 0.051 | 0.236 | 0.22× | exact |
rst_numbands | 0.050 | 0.232 | 0.22× | exact |
rst_width | 0.047 | 0.252 | 0.19× | exact |
rst_upperleftx | 0.040 | 0.226 | 0.18× | exact |
rst_upperlefty | 0.039 | 0.224 | 0.17× | exact |
rst_rastertoworldcoordx | 0.041 | 0.242 | 0.17× | exact |
rst_rastertoworldcoordy | 0.038 | 0.236 | 0.16× | exact |
rst_skewy | 0.034 | 0.224 | 0.15× | exact |
rst_pixelwidth | 0.035 | 0.234 | 0.15× | exact |
rst_format | 0.029 | 0.225 | 0.13× | exact |
rst_skewx | 0.029 | 0.225 | 0.13× | exact |
rst_rotation | 0.028 | 0.226 | 0.13× | exact |
rst_bandmetadata | 0.025 | 0.224 | 0.11× | timing-only |
rst_georeference | 0.024 | 0.222 | 0.11× | timing-only |
rst_subdatasets | 0.024 | 0.244 | 0.10× | timing-only |
rst_summary | 0.478 | 5.89 | 0.08× | timing-only |
rst_worldtorastercoordx | 0.019 | 0.247 | 0.08× | timing-only |
rst_worldtorastercoord | 0.015 | 0.248 | 0.06× | timing-only |
rst_worldtorastercoordy | 0.014 | 0.242 | 0.06× | timing-only |
rst_sample | 0.011 | 0.276 | 0.04× | timing-only |
rst_histogram | 0.308 | 16.0 | 0.02× | timing-only |
rst_memsize | 0.019 | 1.67 | 0.01× | timing-only |
rst_viewshed | 16.7 | 4047 | 0.004× | timing-only |
rst_cog_convert | —* | 342 | — | timing-only |
rst_fromfile | —* | 0.732 | — | timing-only |
rst_cog_convert and rst_fromfile are not measured on the local heavyweight tier (a container-specific GDAL driver quirk returns immediately). They run normally on a cluster — see rst_cog_convert in the spark-path table below.How to read the pure-core numbers
- Band math (
rst_ndvi/ndwi/nbr/evi/savi/index/mapalgebra/threshold) shows the largest lightweight wins (roughly 65–130×). The heavyweight tier computes these by shelling out to agdal_calcsubprocess per call, which dominates its time; the lightweight tier evaluates the expression in-process with NumPy. - Tiling and warps (
rst_maketiles/retile/tooverlappingtiles/merge/transform) favor the lightweight tier by roughly 9–23×. The retile family reads only the output window per tile rather than translating the whole raster, so it stays fast even at 1024². - Terrain (
rst_slope/aspect/hillshade/tri/tpi/roughness) is a steadier ~2–3.6× lightweight win, all within tolerance — including on geographic rasters, where both tiers auto-derive the horizontal scale from the CRS. - Discrete-grid aggregation (
rst_h3_*/rst_quadbin_*) is a ~1.7–6× lightweight win and matches within tolerance. - Trivial metadata accessors (
rst_width/height/numbandsand the world↔raster coordinate helpers) are microseconds on both tiers; here the JVM-native heavyweight path edges out the Python-worker call, so speedup dips below 1×. At this scale the absolute difference is a fraction of a millisecond. - Algorithm-bound outliers —
rst_viewshed,rst_contour, andrst_polygonize— are slower on the lightweight tier (the heavyweight GDAL implementations are hard to beat).rst_viewshedin particular is xrspatial/Numba-based; its first call also pays a one-time JIT compile cost (the figures above are warmed). These remain correct (within tolerance, where comparable) but are the cases where the heavyweight tier is the better choice.
Spark-path at scale (cluster, 1000 tiles, 1024²)
Pure-core measures the algorithm; spark-path measures the job. Each function runs as a registered rst_* UDF over a DataFrame of 1,000 distinct 1024² tiles on a cluster, so the timing includes the realistic per-tile overhead — UDF dispatch, and for the lightweight tier the JVM↔Python serialization of tile bytes on every row.
Environment: Databricks · DBR 17.3 LTS · 2-band
float32tiles at 1024² (EPSG:32618) · 1,000 tiles per iteration (every tile processed each iteration, not sampled).Heavy/tileandLight/tileare the whole-job wall-clock amortized over the 1,000 tiles. (At 1,000 tiles, the per-tile figure in ms equals the whole-iteration time in seconds.)Of the 83 functions with a spark-path cell, the lightweight tier is at least as fast in 57 and the heavyweight tier is faster in 26. Only the 7 aggregators produce a comparable fingerprint here (all within tolerance); the rest are
timing-onlyin spark-path.
The headline: the pure-core wins do not all survive at scale. Operations that win 3–6× in isolation — terrain (rst_slope/aspect/tri/tpi), the H3/QuadBin grids — collapse to parity or slightly heavyweight-favored once every tile's bytes cross the Python boundary. The serialization cost, not the compute, dominates byte-heavy operations at scale. Band math and the largest reductions still favor the lightweight tier (~2–4×) because their compute saving outweighs the boundary tax; algorithm-bound GDAL operations (rst_proximity, rst_contour, rst_polygonize, rst_gridfrompoints) favor the heavyweight tier by 5–16×.
Ordered by Speedup =
Heavy / Light(highest first).
| Function | Heavy/tile (ms) | Light/tile (ms) | Speedup | Consistency |
|---|---|---|---|---|
rst_dtmfromgeoms_agg | 175.10 | 9.94 | 18× | within_tol |
rst_fillnodata | 32.71 | 6.98 | 4.69× | timing-only |
rst_convolve | 35.98 | 7.87 | 4.57× | timing-only |
rst_resample | 45.82 | 10.92 | 4.20× | timing-only |
rst_h3_tessellate | 45.59 | 10.99 | 4.15× | timing-only |
rst_evi | 24.31 | 6.19 | 3.93× | timing-only |
rst_nbr | 22.59 | 5.96 | 3.79× | timing-only |
rst_mapalgebra | 20.90 | 5.67 | 3.68× | timing-only |
rst_index | 22.04 | 6.02 | 3.66× | timing-only |
rst_ndwi | 21.98 | 6.08 | 3.62× | timing-only |
rst_savi | 21.55 | 6.24 | 3.45× | timing-only |
rst_ndvi | 22.31 | 6.55 | 3.40× | timing-only |
rst_threshold | 19.28 | 6.29 | 3.07× | timing-only |
rst_median | 9.73 | 3.24 | 3.01× | timing-only |
rst_separatebands | 15.04 | 6.11 | 2.46× | timing-only |
rst_transform | 18.76 | 7.84 | 2.39× | timing-only |
rst_updatetype | 16.78 | 7.04 | 2.38× | timing-only |
rst_quadbin_rastertogridmedian | 27.60 | 11.90 | 2.32× | timing-only |
rst_buildoverviews | 15.60 | 7.02 | 2.22× | timing-only |
rst_avg | 7.06 | 3.25 | 2.17× | timing-only |
rst_initnodata | 13.39 | 6.24 | 2.15× | timing-only |
rst_setsrid | 13.15 | 6.17 | 2.13× | timing-only |
rst_pixelcount | 6.71 | 3.19 | 2.10× | timing-only |
rst_to_webmercator | 20.83 | 9.91 | 2.10× | timing-only |
rst_tooverlappingtiles | 22.57 | 10.75 | 2.10× | timing-only |
rst_retile | 15.66 | 8.53 | 1.84× | timing-only |
rst_filter | 51.09 | 29.78 | 1.72× | timing-only |
rst_isempty | 4.58 | 2.71 | 1.69× | timing-only |
rst_skewy | 4.08 | 2.43 | 1.68× | timing-only |
rst_frombands_agg | 12.63 | 7.52 | 1.68× | within_tol |
rst_format | 4.07 | 2.42 | 1.68× | timing-only |
rst_combineavg_agg | 16.69 | 9.96 | 1.68× | within_tol |
rst_skewx | 4.04 | 2.44 | 1.66× | timing-only |
rst_rastertoworldcoordx | 4.08 | 2.48 | 1.65× | timing-only |
rst_upperlefty | 4.15 | 2.52 | 1.64× | timing-only |
rst_width | 5.14 | 3.14 | 1.64× | timing-only |
rst_pixelheight | 4.23 | 2.58 | 1.64× | timing-only |
rst_derivedband | 10.27 | 6.28 | 1.63× | timing-only |
rst_upperleftx | 4.10 | 2.57 | 1.59× | timing-only |
rst_scalex | 3.98 | 2.50 | 1.59× | timing-only |
rst_rastertoworldcoordy | 4.07 | 2.57 | 1.58× | timing-only |
rst_numbands | 4.11 | 2.61 | 1.57× | timing-only |
rst_getnodata | 3.94 | 2.52 | 1.56× | timing-only |
rst_rotation | 4.09 | 2.63 | 1.56× | timing-only |
rst_pixelwidth | 4.16 | 2.73 | 1.52× | timing-only |
rst_scaley | 4.00 | 2.63 | 1.52× | timing-only |
rst_maketiles | 10.10 | 6.74 | 1.50× | timing-only |
rst_quadbin_rastertogridmax | 17.02 | 11.41 | 1.49× | timing-only |
rst_band | 8.56 | 5.81 | 1.47× | timing-only |
rst_max | 4.51 | 3.06 | 1.47× | timing-only |
rst_min | 4.47 | 3.04 | 1.47× | timing-only |
rst_height | 4.56 | 3.10 | 1.47× | timing-only |
rst_srid | 4.12 | 2.83 | 1.46× | timing-only |
rst_merge_agg | 23.13 | 16.48 | 1.40× | within_tol |
rst_resample_to_size | 7.30 | 5.34 | 1.37× | timing-only |
rst_derivedband_agg | 9.43 | 7.61 | 1.24× | within_tol |
rst_aspect | 7.66 | 7.60 | 1.01× | timing-only |
| heavyweight faster below (< 1×) | ━━━ | ━━━ | ━━━ | ━━━ |
rst_roughness | 7.42 | 7.46 | 0.99× | timing-only |
rst_slope | 7.31 | 7.57 | 0.97× | timing-only |
rst_quadbin_rastertogridcount | 10.73 | 11.14 | 0.96× | timing-only |
rst_quadbin_rastertogridmin | 11.17 | 11.84 | 0.94× | timing-only |
rst_tpi | 6.87 | 7.40 | 0.93× | timing-only |
rst_tri | 7.05 | 7.70 | 0.92× | timing-only |
rst_quadbin_rastertogridavg | 10.68 | 11.92 | 0.90× | timing-only |
rst_rastertoworldcoord | 4.17 | 4.99 | 0.84× | timing-only |
rst_hillshade | 5.79 | 7.02 | 0.82× | timing-only |
rst_h3_rastertogridmedian | 67.02 | 81.90 | 0.82× | timing-only |
rst_h3_rastertogridavg | 56.72 | 77.22 | 0.73× | timing-only |
rst_h3_rastertogridmax | 57.06 | 78.56 | 0.73× | timing-only |
rst_tryopen | 3.99 | 5.59 | 0.71× | timing-only |
rst_h3_rastertogridmin | 56.51 | 79.28 | 0.71× | timing-only |
rst_h3_rastertogridcount | 55.34 | 79.39 | 0.70× | timing-only |
rst_fromcontent | 4.03 | 5.90 | 0.68× | timing-only |
rst_asformat | 4.00 | 6.04 | 0.66× | timing-only |
rst_frombands | 2.31 | 4.94 | 0.47× | timing-only |
rst_cog_convert | 4.23 | 11.13 | 0.38× | timing-only |
rst_rasterize_agg | 2.00 | 6.16 | 0.32× | within_tol |
rst_merge | 2.30 | 8.42 | 0.27× | timing-only |
rst_combineavg | 1.91 | 7.73 | 0.25× | timing-only |
rst_polygonize | 10.90 | 54.88 | 0.20× | timing-only |
rst_contour | 13.91 | 127.62 | 0.11× | timing-only |
rst_gridfrompoints_agg | 2.66 | 33.75 | 0.08× | within_tol |
rst_proximity | 0.52 | 8.12 | 0.06× | timing-only |
Repartition strategy
The spark-path numbers above are wall-clock for a whole distributed job, so they're sensitive to how evenly the 1,000 tiles spread across the cluster's task slots. The benchmark harness tunes the partitioning so the comparison reflects the engines, not a straggler tail:
- Repartition the N-row DataFrame to ~
slots × 4partitions (oversubscribe ~4×) so a slot that finishes early picks up a pending tile instead of idling while one long tile finishes. - Set
spark.sql.shuffle.partitionsto match, and disable AQE for the measured run so Adaptive Query Execution can't coalesce that repartition back towarddefaultParallelismand reintroduce the idle tail.
This repartitioning is benchmark-harness tuning — it lives in the bench code that runs from a repo checkout, not in the GeoBrix product. The lightweight pyrx tier never mutates Spark configuration: it only registers UDFs and builds Column expressions, so it is safe on Serverless / Spark Connect, where runtime spark.conf.set(...) and JVM-bridge access are not permitted. On Serverless you therefore can't hand-tune partitions; you rely on the platform's managed AQE and default partitioning. Treat the tuned figures above as the controlled, like-for-like comparison number — not as what a Serverless user sets by hand.
Fan-out generators (streaming UDTFs)
These functions are streaming Python UDTFs on the lightweight tier (one input tile → many output rows via LATERAL); the benchmark times the lightweight UDTF against the equivalent heavyweight Scala generator at matched fan-out and asserts that both tiers produce the same flat row count — decoded-feature parity as a hard gate.
Environment: Databricks · DBR 17.3 LTS · 20 workers · synthetic fan-out corpus · whole-job wall-clock, 1 measured iteration. Parity gate: all functions listed passed — light and heavy produced the same flat row count. Run with
gbx:bench:cluster --fanout-only.
| Function | rows (fan-out) | light (s) | heavy (s) | light edge |
|---|---|---|---|---|
rst_polygonize | 256 | 0.262 | 0.196 | heavy faster |
rst_h3_rastertogridcount | 58,123 | 0.46 | 0.35 | heavy faster |
rst_xyzpyramid | 78 | 0.774 | 0.674 | heavy faster |
rst_maketiles | 16 | 0.341 | 0.254 | heavy faster |
rst_retile | 256 | 0.528 | 0.549 | light faster |
rst_separatebands | 64 | 0.292 | 0.372 | light faster |
rst_tooverlappingtiles | 324 | 0.633 | 0.608 | heavy ~parity |
Across the converted fan-out functions, light runs at parity or faster than the heavyweight JVM generators for rst_retile and rst_separatebands — with decoded-row parity as a hard gate for every result in the table. The streaming-UDTF model amortizes the JVM↔Python boundary across the full fan-out (the boundary is paid once per input tile, not once per output row), which is why the at-scale cost difference is smaller than for scalar transforms. See the Performance page for the streaming-UDTF architecture.
rst_h3_tessellatenumbers are pending a cluster re-bench. The covering/centroidmodework has landed — light and heavy now agree on cell assignment for both modes (test-enforced cross-tier parity; see the H3 tessellation explainer) — and timings will be added on the next cluster run.
MVT encoding (st_asmvt)
The st_asmvt aggregator encodes a group of tile-local features into one Mapbox Vector Tile (MVT)
protobuf blob. The benchmark groups 500 synthetic features into 10 tiles and times the
groupBy("z","x","y").agg(st_asmvt(geom, attrs, "layer")) whole-job wall-clock on each tier, then
decodes both tiers' tiles and asserts feature-level parity (geometry + native-typed attributes
match). Run it with gbx:bench:cluster --mvt-only.
Environment: Databricks · DBR 17.3 LTS · 20 workers · 500 features → 10 tiles · whole-job wall-clock median, one measured iteration.
| Tier | Function | Median (500 feat → 10 tiles) | Ratio |
|---|---|---|---|
| lightweight | pyvx.st_asmvt | 0.589 s | ~1.06× (≈ parity) |
| heavyweight | vectorx.st_asmvt | 0.555 s | — |
The pure-Python lightweight aggregator (a grouped-aggregate pandas UDF over mapbox-vector-tile)
runs at parity with the heavyweight JVM/OGR aggregator (~6% apart). Both encode attributes with
native protobuf value types and take features in tile-local [0, extent] coordinates, so the
decoded tiles are feature-equivalent — a hard parity gate; the benchmark fails if the tiers
diverge. Tiles compose with pmtiles_agg for end-to-end publishing.
TIN & legacy migration (st_triangulate, st_interpolateelevation*, st_legacyaswkb)
The constrained-Delaunay TIN generators (st_triangulate, st_interpolateelevationbbox,
st_interpolateelevationgeom) and the legacy-geometry migration function (st_legacyaswkb)
are benchmarked light (pyvx) vs heavy (vectorx) with decoded-output parity. Light invokes the TIN
generators as PySpark UDTFs (SQL LATERAL); heavy invokes the JVM generator expressions. Parity is
asserted on decoded output, per the established posture: triangle count + centroid match within
1e-6 for triangulation (not triangle identity — no-Steiner constrained recovery may pick different
non-constraint diagonals), per-cell surface closeness within 1e-6 for elevation interpolation, and
decoded-geometry equality for legacy migration. Run it with gbx:bench:cluster --vector-tin-only.
Cluster spark-path results (DBR 17.3 LTS; median wall-clock of one full distributed iteration over N tiles; same corpus both tiers):
| Function | Tiles/iter | Heavy (s) | Light (s) | Light speedup |
|---|---|---|---|---|
st_interpolateelevationgeom | 125 | 0.49 | 0.56 | 0.88× |
st_legacyaswkb | 1,000 | 0.39 | 0.48 | 0.82× |
st_interpolateelevationbbox | 245 | 0.40 | 0.53 | 0.75× |
st_triangulate | 220 | 0.41 | 0.70 | 0.60× |
The heavyweight JVM/JTS tier is modestly faster across all four. The lightweight tier is competitive
on legacy migration and elevation interpolation (~0.75–0.88×) and ~1.7× behind on st_triangulate:
the gap is the JVM↔Python serialization of the geometry arrays crossing the UDTF boundary (the same
boundary cost seen for other byte-heavy lightweight UDTFs), not the triangulation compute itself.
Decoded-output parity holds across both tiers (enforced by the cross-tier parity test suite, above).
Quadbin (quadbin_pointascell, quadbin_polyfill, quadbin_tessellate, quadbin_cellunion_agg)
The quadbin grid functions are benchmarked light (pygx) vs heavy (gridx.quadbin) with exact-output
parity across four representative shapes: a scalar encode (quadbin_pointascell, lon/lat → cell),
a geometry → cell-array fill (quadbin_polyfill), a geometry → cell-clip struct array
(quadbin_tessellate), and a grouped aggregate (quadbin_cellunion_agg, cell ids → unioned
coverage geometry). Both tiers expose the same quadbin_* SQL names, so the light tier is
collected before the heavy tier re-registers. Parity is asserted on decoded output: exact cell-id
equality for pointascell and polyfill, exact cells with centroid match within 1e-6 for
tessellate, and decoded union-geometry equality (1e-6) per group for cellunion_agg. Run it
with gbx:bench:cluster --grid-quadbin-only.
All 10 quadbin functions were benchmarked on a cluster (DBR 17.3 LTS, x86, spark-path, 1,000
tiles/iteration, 5 measured iterations, both tiers). Exact cell-set parity holds for all 10
(test-enforced — in-cell and by the test_parity_quadbin cross-tier suite).
On timing, these are sub-millisecond cell-math operations, so the spark-path wall-clock is
dominated by per-row JVM↔Python dispatch, not by the work itself. The 5-measured-iteration medians put
the lightweight pygx tier roughly at parity with the heavyweight JVM tier across all 10
(0.89×–1.33×): the small gaps are fixed UDF-boundary/dispatch cost on top of very cheap per-row cell
math, not an algorithmic deficiency. (One honest caveat: at sub-millisecond per-tile work, these
numbers still carry some run-to-run variance, so read the relative ordering rather than the last digit.)
Cluster spark-path results (median wall-clock of one full distributed iteration over 1,000 tiles; speedup = heavy ÷ light, ≥1.0 → light faster):
| Function | Heavy (s) | Light (s) | Light speedup |
|---|---|---|---|
quadbin_resolution | 0.28 | 0.21 | 1.33× |
quadbin_cellunion_agg | 0.38 | 0.29 | 1.32× |
quadbin_tessellate | 0.20 | 0.18 | 1.11× |
quadbin_polyfill | 0.24 | 0.23 | 1.06× |
quadbin_kring | 0.23 | 0.22 | 1.05× |
quadbin_distance | 0.19 | 0.18 | 1.04× |
quadbin_cellunion | 0.23 | 0.22 | 1.04× |
quadbin_aswkb | 0.15 | 0.15 | 0.98× |
quadbin_pointascell | 0.21 | 0.22 | 0.96× |
quadbin_centroid | 0.17 | 0.19 | 0.89× |
quadbin_cellunion_agg is a grouped aggregate over 8 groups; the other nine are 1,000-row
scalar/array operations.
The implementation is scale-aware: scalar/bounded-output ops (pointascell, resolution,
distance, aswkb, centroid, cellunion) use vectorized/batched pandas_udfs, while
array-returning ops (kring, polyfill, tessellate) use plain row-at-a-time UDFs so a single
batch never buffers many large variable-length outputs (a memory-safety choice for large
geometries/zoom levels). The lightweight pygx tier's decisive advantage here is Serverless/ARM
reach and no-JVM/JAR install, with timing competitive with the heavyweight tier.
BNG (all 23 bng_* functions)
The British National Grid functions are available in both tiers — light (pygx, a pure-Python
codec port of BNG.scala) vs heavy (gridx.bng) — with the same exact-output parity contract as
quadbin: exact cell-set equality for eastnorthasbng/pointascell/polyfill/kring,
exact cells with chip-geometry match for tessellate, and decoded chip-geometry equality
per group for the aggregates (asserted by the test_parity_bng cross-tier suite). The execution
shape mirrors quadbin (numpy-vectorized scalar pandas_udfs for cell-id math, plain UDFs for
array-returning ops, UDTFs for the explodes, grouped aggregates for the chip aggregators), so the
timing expectation is the same: sub-millisecond STRING cell-id math dominated by the per-row
JVM↔Python boundary, with WKB-geometry ops additionally carrying the geometry-bytes serialization
cost. Run the BNG legs with gbx:bench:cluster --grid-bng-only.
All 23 bng_* functions were benchmarked on a cluster (DBR 17.3 LTS, spark-path, 1,000
tiles/iteration, 5 measured iterations, both tiers). Exact cell-set / decoded-geometry parity
holds for all 23 — light output equals heavy output, enforced by the in-cell parity gate and the
test_parity_bng cross-tier suite (all 23 parity gates passed; 0 errors).
On timing, the pure-Python pygx BNG tier runs at near-parity with the heavyweight JVM tier:
across all 23 functions the lightweight tier is within ~±20% of heavy (0.82×–1.18×), all
sub-millisecond per tile at 1,000 tiles/iteration. 7/23 are light-faster (best bng_kring 1.18×);
16/23 are marginally heavy-faster (most bng_cellintersection 0.82×). These deltas sit within the
run-to-run noise band for sub-second distributed jobs at this scale — effectively at parity. (Note
BNG WKB carries no SRID for EPSG:27700, unlike quadbin's EWKB.)
Cluster spark-path results (median wall-clock of one full distributed iteration over 1,000 tiles; speedup = heavy ÷ light, ≥1.0 → light faster):
| Function | Heavy (s) | Light (s) | Light speedup |
|---|---|---|---|
bng_kring | 0.20 | 0.17 | 1.18× |
bng_geomkring | 0.15 | 0.13 | 1.11× |
bng_geomkringexplode | 0.48 | 0.44 | 1.08× |
bng_cellunion_agg | 0.29 | 0.27 | 1.07× |
bng_aswkt | 0.18 | 0.17 | 1.06× |
bng_centroid | 0.14 | 0.14 | 1.06× |
bng_kloop | 0.13 | 0.13 | 1.01× |
bng_cellunion | 0.14 | 0.15 | 0.97× |
bng_aswkb | 0.14 | 0.14 | 0.96× |
bng_kloopexplode | 0.26 | 0.27 | 0.96× |
bng_kringexplode | 0.26 | 0.27 | 0.96× |
bng_cellarea | 0.15 | 0.16 | 0.94× |
bng_geomkloopexplode | 0.43 | 0.47 | 0.93× |
bng_euclideandistance | 0.14 | 0.15 | 0.91× |
bng_geomkloop | 0.15 | 0.16 | 0.90× |
bng_polyfill | 0.14 | 0.16 | 0.89× |
bng_tessellateexplode | 0.42 | 0.47 | 0.89× |
bng_cellintersection_agg | 0.25 | 0.28 | 0.89× |
bng_eastnorthasbng | 0.14 | 0.16 | 0.88× |
bng_tessellate | 0.14 | 0.17 | 0.86× |
bng_pointascell | 0.15 | 0.17 | 0.86× |
bng_distance | 0.13 | 0.16 | 0.86× |
bng_cellintersection | 0.14 | 0.17 | 0.82× |
Custom grid (all 7 custom_* functions)
The custom equal-area grid functions are available in both tiers — light (pygx) vs heavy
(gridx.custom) — with the same exact-output parity contract as quadbin and BNG: exact cell-id /
cell-set equality for pointascell/polyfill/kring and the grid struct, decoded geometry
match within 1e-6 for centroid/cellaswkb/cellaswkt, and an identical grid struct for
grid (asserted by the JAR-gated cross-tier parity suite and the in-cell hard-assert gates on the
cluster run). The execution shape mirrors the other grids (vectorized scalar pandas_udfs for the
cell-id math, plain UDFs for the geometry/array-returning ops), so the timing expectation is the
same: sub-millisecond cell math dominated by the per-row JVM↔Python boundary, with WKB/WKT-geometry
ops additionally carrying the geometry-bytes serialization cost. Run the custom legs with
gbx:bench:cluster --grid-custom-only.
All 7 custom_* functions were benchmarked on a cluster (DBR 17.3 LTS, x86, spark-path,
1,000 tiles/iteration, 5 measured iterations, both tiers). Exact cell-id/cell-set parity holds for
all 7 — light output equals heavy output (cell ids/sets exact; geometry within 1e-6; grid struct
identical), enforced by the JAR-gated cross-tier parity suite and the in-cell hard-assert gates on the
cluster run.
On timing, the 5-measured-iteration medians put the pure-Python pygx custom-grid tier roughly at
parity with the heavyweight Scala tier across all 7 (0.87×–1.34×): it is faster on the cheapest ops
(custom_centroid 1.34×, custom_kring 1.33×) and marginally slower on a few. Absolute times are all
sub-second at 1,000 tiles (sub-millisecond per tile), so the gaps are the fixed Python UDF-boundary /
Arrow serialization cost on top of very cheap per-row grid math, not an algorithmic deficiency.
Cluster spark-path results (median wall-clock of one full distributed iteration over 1,000 tiles; speedup = heavy ÷ light, ≥1.0 → light faster):
| Function | Heavy (s) | Light (s) | Light speedup |
|---|---|---|---|
custom_centroid | 0.19 | 0.14 | 1.34× |
custom_kring | 0.17 | 0.13 | 1.33× |
custom_cellaswkt | 0.18 | 0.15 | 1.20× |
custom_grid | 0.15 | 0.15 | 0.97× |
custom_pointascell | 0.16 | 0.18 | 0.89× |
custom_cellaswkb | 0.14 | 0.16 | 0.89× |
custom_polyfill | 0.13 | 0.15 | 0.87× |
Caveats
- Absolute timings depend on the machine, the corpus, and the tier's install; treat the relative speedup and the consistency outcome as the durable signal, not the millisecond values.
- The two tables answer different questions. Pure-core isolates the algorithm and is where the lightweight tier's biggest wins show; spark-path at scale adds the per-row UDF and serialization overhead and is the better predictor of end-to-end job cost — and the better place to decide which tier to run a given operation on.
- Consistency is checked on output statistics with tolerance, not byte-equality: GDAL and NumPy can differ in the last bits, and neighborhood / NoData-boundary operations legitimately differ on the one-pixel kernel border or in how masked pixels propagate. The benchmark surfaces those as
divergentso they're visible, not hidden.