Benchmarking

GeoBrix ships a benchmark suite that compares the two execution tiers — Heavyweight (rasterx) and Lightweight (pyrx) — on the same inputs, reporting both performance (per-function timing) and output consistency (do the two tiers produce the same result?).

This page explains what the benchmark measures, how to run it on a Databricks cluster (the primary, same-hardware comparison), how to run it locally, how to read the output, and full tables of representative results — both the pure-core (algorithm-in-isolation) view and the spark-path at scale view from a cluster.

Runtime support vs. benchmark environment

GeoBrix is supported on both DBR 17.3 LTS and DBR 18 LTS. The per-result Environment stamps below record the specific cluster each benchmark actually ran on (DBR 17.3 LTS); they are factual records of the run, not a statement that DBR 18 is unsupported.

What it measures

Each function is timed under two independent models:

Pure-core — the raster operation in isolation: open a single tile, call the function, measure. It is always one tile per measurement (repeated once for each tile shape in the corpus — each tile-size / band-count / dtype / SRID combination), and it ignores --row-counts entirely. This is the fairest apples-to-apples view of the algorithm itself, with no Spark or serialization in the path.
Spark-path — the registered function (rst_*) applied to a Spark DataFrame of N rows (the only model that uses --row-counts). This includes the realistic per-row overhead (UDF dispatch, serialization, Python-worker round-trips for the lightweight tier), and is swept across a row ladder (e.g. 10 → 100 → 1,000 → 10,000 rows).

Both models run the function --warmup times untimed (to absorb cold caches, JIT, and worker spin-up) and then --measured times timed; the reported figure is the median over the measured passes. Locally the defaults are 2 warmup / 5 measured; on a cluster they are 1 / 3 for pure-core and 1 / 1 for spark-path (one full N-tile iteration is already substantial).

Alongside timing, every pure-core result carries an output fingerprint (per-band statistics) so the two tiers can be checked for consistency:

exact — every statistic is bitwise-equal across tiers.
within_tol — every statistic agrees to a relative tolerance of 1e-3 or an absolute tolerance of 1e-3 (the absolute floor handles near-zero values where a relative comparison is meaningless).
divergent — neither tolerance is met.

The goal of the one-line tier swap is that results stay consistent; the benchmark is how that guarantee is verified.

note

The benchmark is run from the GeoBrix source repository (it uses the repo's gbx:bench:* commands and a job notebook). A wheel-only install does not include these tools — the cluster benchmark submits a job to your own provisioned cluster from a checkout of the repo.

Running on a cluster

Running on a provisioned cluster is the true comparison: both tiers execute on the same hardware, against the same corpus, and the full row ladder and larger tiles are within reach. Results append to a bench_results Delta table and a comparison.csv / summary.md land on the configured Volume.

Prerequisites

Provision a cluster and stage the artifacts per the installation guide:

Tier	Cluster	Artifacts
Heavyweight (rasterx)	x86 · DBR 17.3 LTS	Init script + bundle + GeoBrix wheel + the bench tests JAR (`geobrix-*-tests.jar`)
Lightweight (pyrx)	x86 or ARM	The `[light]` wheel only

Then fill in the cluster configuration file (notebooks/tests/databricks_cluster_config.env) with your cluster ID and Volume paths.

Run

# Both tiers, same cluster, full row ladder, all functions
bash scripts/commands/gbx-bench-cluster.sh \
  --cluster-id <your-cluster-id> \
  --run-id cluster-2026-06 \
  --modes both \
  --row-counts 10,100,1000,10000

Scale the run with the options below. --cluster-id and --run-id identify the run; the rest are optional. --row-counts and --functions take a comma-separated list; --modes and --set take a single value.

Option	Purpose
`--modes pure-core \| spark-path \| both`	Timing model — pick only one mode (default `both`, which runs both models).
`--row-counts 10,100,1000,10000`	Spark-path row ladder — each value is the number of distinct tiles processed in one timed iteration (the main scale dimension); one iteration is measured per rung. The largest value must be ≤ the corpus row-pool size (the bench refuses to under-fill). Default `10,100,1000,10000`.
`--set core \| full`	Which function set to benchmark — the representative `core` set or every benchmarked function (`full`). Default `core`.
`--functions rst_slope,rst_ndvi`	Comma-separated list restricting the run to specific functions; overrides `--set`. Default: unset (benchmarks the `--set`).
`--warmup` / `--measured`	Warmup and measured iteration counts. Defaults: pure-core `1` / `3`, spark-path `1` / `1`.
`--lightweight-only` / `--heavyweight-only`	Run a single tier (mutually exclusive). `--lightweight-only` is required on ARM clusters (heavyweight is x86-only); `--heavyweight-only` skips the lightweight leg. Default: both tiers run.
`--no-wait`	Submit the job without blocking on completion. Default: waits for the run to finish.

# ARM cluster: lightweight only
bash scripts/commands/gbx-bench-cluster.sh --cluster-id <arm-cluster-id> --lightweight-only

Tile sizes, band counts, data types, and projections are set when the corpus is generated (see the local section's scale knobs); the same corpus is reused for both tiers so the comparison is fair.

note

A spark-path iteration processes max(--row-counts) distinct tiles drawn from the corpus row pool — it does not recycle a small pool to reach the row count. The largest --row-counts value must therefore be ≤ the corpus row-pool size; if it isn't, the bench refuses to run rather than silently under-fill (which would report a row count it never actually processed). Generate a larger pool or lower the row ladder.

Output

bench_results Delta table — every measured row (tagged with environment + run ID), so runs accumulate and can be queried/visualized over time.
summary.md / comparison.csv on the Volume — the human-readable speedup + consistency report (see Reading the output), also rendered inline in the run notebook.

Running locally

The local pipeline runs the heavyweight tier in the geobrix-dev Docker container and the lightweight tier in an isolated Python virtual environment, then compares them. Local runs are intentionally single-tile for pure-core (the algorithm cost in isolation); the full row ladder and at-scale spark-path numbers belong on a cluster, but can still be exercised in the local Docker environment at a modest scale.

# Full local pipeline: generate corpus -> heavyweight -> lightweight -> compare
bash scripts/commands/gbx-bench-all.sh --run-id local-1 --modes pure-core

The heavyweight and lightweight legs run one after the other (never concurrently) so they don't contend for CPU and skew each other's timings. Outputs land in test-logs/bench/<run-id>/.

Set (core vs full)

--set selects how many functions the run benchmarks:

--set core (the default) runs a small, representative set covering each function family — accessors, terrain, band math, warps. It's fast and is the right choice for a routine check.
--set full runs every benchmarked function. It takes longer but gives the complete coverage and parity picture.

note

The core here is the function set (--set) — how many functions run. Don't confuse it with pure-core, the timing model (--modes) that times one tile in isolation. They are independent: you can run --set core --modes pure-core, --set full --modes spark-path, or any other combination.

An explicit --functions list overrides --set.

# Routine check (default): the representative core set
bash scripts/commands/gbx-bench-all.sh --modes pure-core

# Complete coverage: every benchmarked function
bash scripts/commands/gbx-bench-all.sh --set full --modes pure-core

Scale and shape the corpus with the options below. Each takes a comma-separated list to sweep several values (e.g. --tile-px 256,512,1024) or a single value (e.g. --tile-px 1024); the corpus is the combination across the options you set.

Option	Purpose
`--tile-px 256,512,1024,2048`	Tile sizes (pixels per side) — any size; larger tiles (1024², 2048², …) make the per-tile algorithm cost dominate the fixed overhead. Default `256,512`.
`--bands 1,4`	Band counts — any positive integer count. Default `2`.
`--dtypes uint8,int16,float32`	Pixel data types — these three are the full supported set (closed; other dtypes are not generated). Default `float32`.
`--srids 4326,3857,32618,27700`	Projections — the full supported set (closed): `4326` (WGS84 geographic), `3857` (WebMercator), `32618` (UTM 18N), `27700` (British National Grid); other SRIDs are not generated. Default `4326,32618`. Pick at least one geographic (lat/long) and one projected (metre) CRS to exercise both.
`--nodata-frac 0.0,0.25`	Fraction of pixels set to NoData — any value in `0.0`–`1.0` (0% to 100%; continuous, not a fixed set). Default `0.0`. Pass a comma-separated list to sweep several fractions.
`--row-counts 10,100,1000,10000`	Spark-path row ladder — each value is the number of distinct tiles processed in one timed iteration (the scale dimension); the bench measures one iteration per rung. The largest value must be ≤ the corpus row-pool size (the bench refuses to under-fill). Default `2,4` (laptop-modest); run the full ladder on a cluster.
`--modes pure-core \| spark-path \| both`	Timing model — pick only one mode. Default `both` (runs both models).
`--warmup` / `--measured`	Untimed warmup passes / timed passes per measurement (median of the timed passes is reported). Defaults `2` / `5` locally.

The individual stages are also available as standalone commands — gbx:bench:gen-data, gbx:bench:heavyweight, gbx:bench:lightweight, and gbx:bench:compare — if you want to regenerate just one part of the pipeline.

Reading the output

The comparison summary.md opens with insights (biggest wins, the consistency tally) followed by a per-function table. For a spark-path run it looks like:

Consistency (7 compared cells): exact 0 - within-tol 7 - divergent 0

Tile scale: 1000 tiles/iteration (spark-path) — every tile processed each timed iteration.

| fn | hw_iter_s | lw_iter_s | hw_per_tile_s | lw_per_tile_s | speedup | consistency |
| rst_dtmfromgeoms_agg | 175.10 | 9.94 | 0.17510 | 0.00994 | 17.61 | within_tol |
| rst_slope | 7.31 | 7.57 | 0.00731 | 0.00757 | 0.97 | timing-only |
| rst_proximity | 0.52 | 8.12 | 0.00052 | 0.00812 | 0.06 | timing-only |

hw_iter_s / lw_iter_s are the median wall-clock of one full iteration over all N tiles (the whole distributed job). hw_per_tile_s / lw_per_tile_s are that ÷ N — the amortized per-tile cost. (A pure-core summary uses hw_ms / lw_ms and hw_mpix/s / lw_mpix/s for the single-tile algorithm cost instead.)
speedup is heavy / light — greater than 1 means the lightweight tier is faster.
consistency is the exact / within_tol / divergent label defined above (timing-only where the output can't be fingerprinted for a direct comparison — readers, metadata accessors, and most non-aggregator spark-path cells). Per-cell deltas are in comparison.csv (the max_rel_delta column).

A per-engine summary is also written for each tier (heavyweight.summary.md, lightweight.summary.md) with that tier's own timing and throughput in isolation.

Results

Representative results across the benchmark families, organized by function group.

Readers & Writers
Raster
Vector
Grid

Raster readers & writers

The raster reader/writer is benchmarked the same way on both tiers: read 1,000 × 1024² GeoTIFF tiles from a UC Volume (count), and write the same 1,000 tiles back out as GeoTIFF. Lightweight uses raster_gbx / gtiff_gbx (rasterio); heavyweight uses gdal / gtiff_gdal (native GDAL in the JVM). Run with gbx:bench:cluster --readers-only --spark-warmup 0 --spark-measured 1.

Environment: Databricks · DBR 17.3 LTS · 1,000 × 1024² GeoTIFF tiles · whole-job wall-clock median, one measured iteration (I/O, not JIT-sensitive — no warm-up).

Operation	rows	light	heavy	ratio
Read GeoTIFF (`raster_gbx` / `gdal`)	1,000	22.7 s	9.5 s	2.4× (heavy faster)
Write GeoTIFF (`gtiff_gbx` / `gtiff_gdal`)	1,000	4.7 s	4.0 s	1.18× (≈ parity)

Both tiers emit the same tile schema, so they are drop-in swaps. The writer runs at parity (~1.2×): the GeoTIFF encode dominates and costs about the same whether it runs in the JVM or via rasterio. The reader trails ~2.4× — the cost is moving each decoded tile's bytes across the Spark Python DataSource boundary (JVM↔Python ser/de), not the decode itself. That boundary is the structural tax of the pure-Python tier, and it is the trade for running where the heavyweight tier can't (Serverless, ARM, no JAR / init script). At scale the read still parallelizes across the cluster (~23 ms/tile here, spread over the workers).

Tiled output (PMTiles writer)

The PMTiles writer is benchmarked separately from the rst_* functions: it is a write that packages a tile pyramid into a .pmtiles archive, not a per-tile transform. The lightweight pmtiles_gbx writer and the heavyweight pmtiles writer take the same (z, x, y, bytes) input; the benchmark writes 1,000 tiles to a single archive on each tier and verifies decoded-tile parity between the two outputs. Run it with gbx:bench:cluster --pmtiles-only --row-counts 1000.

Environment: Databricks · DBR 17.3 LTS · 1,000 synthetic PNG tiles → one .pmtiles archive per tier · whole-job wall-clock median.

Writer	Tier	Whole-job median (1,000 tiles)	Speedup
`pmtiles_gbx`	lightweight	18.4 s	—
`pmtiles`	heavyweight	20.7 s	1.13× (lightweight faster)

Parity: the two archives decode to the same 1,000 (z, x, y) tiles with byte-identical tile data. This is a hard gate — the benchmark fails if the tiers diverge — so the timing above is a like-for-like comparison of identical output.

Both writers are two-phase (executors write intermediates, the driver assembles the archive), so the benchmark stages intermediates on a shared filesystem rather than node-local disk; the lightweight writer streams its per-partition scratch and assembles on the driver, the same model the sharded/mosaic output uses at scale.

Tiled output aggregate (`pmtiles_agg`)

pmtiles_agg is the grouped-aggregate companion to the PMTiles writer: it folds a group of (tile, z, x, y) rows into one PMTiles archive per group key inside a groupBy(...).agg(pmtiles_agg(...)) job. It is format-agnostic — the same function archives raster or vector tiles — and is registered from both lightweight tiers (pyrx and pyvx), reusing the same archive assembler as the PMTiles writer. The benchmark folds 1,000 tiles into a single archive on each tier and verifies that every group emits a non-empty archive before timing. Run it with gbx:bench:cluster --pmtiles-agg-only --row-counts 1000.

Environment: Databricks · DBR 17.3 LTS · 1,000 synthetic tiles → one .pmtiles archive per tier · groupBy().agg() whole-job wall-clock · 5 measured iterations, median reported.

Aggregator	Tier	Whole-job median (1,000 tiles)	Per-tile	Speedup
`pmtiles_agg`	lightweight	0.74 s	0.74 ms	~1.07× (≈ parity)
`pmtiles_agg`	heavyweight	0.70 s	0.70 ms	—

The lightweight tier runs at the heavyweight median ×1.07 (0.742 s vs 0.695 s). The distributions overlap — light min 0.654 s sits below the heavyweight p90 of 0.758 s — so the gap is inside the run-to-run noise band for a grouped aggregate of this size; the two tiers are effectively at parity. Both produce spec-valid PMTiles v3 archives that decode to the same tile set. The archives are not byte-identical: the lightweight writer GZIP-compresses the internal directories while the heavyweight writer leaves them uncompressed (none) — both are spec-valid and decode identically. Decoded-tile parity (light == heavy) is checked separately by the JAR-gated test/ds/test_pmtiles_agg_parity.py. The lightweight aggregator is a grouped-aggregate Arrow UDF (pandas_udf) folding each group's tiles into an in-memory archive — the same execution shape as st_asmvt.

Vector readers

The vector readers are benchmarked on a scaled corpus: 5 × 1,000,000-feature files (≈5M polygons) per format, read from a UC Volume into a Delta table — the realistic "ingest vector files into Delta" pipeline. Each format runs light (*_gbx) vs heavy (*_ogr) as its own isolated job, one measured iteration (I/O is not JIT-sensitive). Cluster: DBR 17.3 LTS. Run with gbx:bench:cluster --vector-only --vector-scale.

Format	rows	light (`*_gbx`)	heavy (`*_ogr`)	light edge
GeoJSON — FeatureCollection (`multi=false`)	5M	29.6 s	130.6 s	4.4×
GeoJSON — GeoJSONSeq (`multi=true`)	5M	31.9 s	113.9 s	3.6×
Shapefile	5M	27.8 s	26.3 s	~par
GeoPackage	5M	23.5 s	62.1 s	2.6×
FileGDB	1M	13.8 s	—	light-only on Volumes¹

The lightweight readers are Arrow-native (columnar pyogrio batches — no per-row Python) and stage GeoPackage/FileGDB to worker-local temp for random-access reads; they win or tie every format. GeoJSON is read one partition per file in both tiers (the GeoJSON driver re-parses the whole FeatureCollection on each open, so feature-offset chunking is counterproductive); heavy GeoJSON's per-feature OGR→row construction is what puts it behind.

¹ FileGDB on a Volume: the heavy OGR FileGDB reader opens a native .gdb/.gdb.zip from local/cluster storage but does not read one directly from a UC Volume (FileGDB's seeked multi-file I/O does not serve well over object storage); the lightweight reader stages it locally and reads it from the Volume. So FileGDB reads are light-only at Volume/directory scale (a single 1M archive is shown).

Parity: for each format both tiers ingest the same files and the row counts must match (checked inline; the run flags any divergence).

Vector writers

The single-file vector writers are lightweight-only — their OGR write paths aren't implemented in the heavyweight tier. The sharded GeoJSONL writer is available in both tiers (as the table shows). Each writes a 14,000,000-feature Delta table to a single file (the two-phase writer merges the partition fragments on the driver). Run with gbx:bench:cluster --vector-only --vector-scale --vector-legs writer --writer-rows 14000000.

Format	rows	light (`*_gbx`)	throughput
GeoJSON (single file)	14M	279.1 s	~50k rows/s
Shapefile (single file)	14M	152.6 s	~92k rows/s
GeoPackage (single file)	14M	155.9 s	~90k rows/s
FileGDB (single file)	14M	368.3 s	~38k rows/s
GeoJSONL (sharded) — light `geojsonl_gbx` / heavy `geojsonl`	14M	16.4 s / 16.1 s (80 shards)	~855k / ~870k rows/s

The single-file writers are driver-bound — the partition fragments are merged into one file on the driver (GeoJSON is slowest: text encoding; FileGDB next: native osgeo, per-feature). The sharded GeoJSONL writer skips the merge entirely — one newline-delimited shard per partition, written in parallel — so it is ~17× faster than the single-file GeoJSON writer at 14M (~16 s vs 279 s) and scales with partitions. It is the only vector writer available in both tiers (the others are lightweight-only), and the two tiers run at parity (16.4 s light vs 16.1 s heavy). It's the recommended writer for large / any-scale output; maxRecordsPerFile caps features per shard.

Format capacity — the limit is the format's, not GeoBrix's. 14M box polygons sits Shapefile near its 2 GB .shp ceiling; the other formats carry it easily and are specified far higher:

Format	Ceiling	Cause
Shapefile	2 GB per `.shp`/`.dbf` (~15.8M box polygons)	32-bit record offsets
GeoPackage	~17.6 TB (billions of rows)	SQLite `page_size × max_page_count`
FileGDB	2.1 B rows / 1 TB+ per feature class	OBJECTID is signed int32
GeoJSON	none (RFC 7946)	text; bounded only by disk / parse memory

Mode coverage (of the 107 RasterX functions)

Not every function appears in both timing models — the gaps are deliberate measurement choices, not functional gaps (the lightweight tier implements all 107):

Pure-core: 100 / 107. The 7 absent are the grouped aggregators (rst_*_agg) — a UDAF has no single-tile, single-row form to time in isolation, so they appear only in spark-path.
Spark-path: 83 / 107. The 24 absent are pure-core-only because their call shape doesn't fit the spark-path tile-DataFrame model: geometry-input constructors (rst_rasterize, rst_gridfrompoints, rst_dtmfromgeoms — the tile DataFrame carries no geometry column), the path reader (rst_fromfile), functions needing an in-extent coordinate/geometry literal valid across the multi-CRS corpus (rst_clip, rst_sample, rst_viewshed, rst_worldtorastercoord*, rst_resample_to_res), render-engine-divergent tiles (rst_tilexyz, rst_xyzpyramid, rst_color_relief), and metadata/scalar accessors (rst_metadata, rst_type, rst_summary, …) whose spark-path cell would be an uncomparable timing-only result.

Pure-core (local, 1024²)

The pure-core table is the algorithm in isolation: open one tile, call the function, measure — no Spark, no serialization. It is the fairest view of the raw implementation. It is one snapshot — absolute timings are environment-dependent — but the relative picture and the consistency outcome are stable. Per-tile timing is shown in milliseconds; cells whose output cannot be fingerprinted for a direct comparison (readers, metadata accessors) are timed and labelled timing-only.

Environment: amd64 · Linux · GeoBrix 0.4.0 · GDAL 3.11.4 · 2-band float32 tiles at 1024² (EPSG:32618) with 2% NoData. All rows are pure-core (single tile).

Consistency: of 100 functions — 26 exact, 47 within-tolerance, 23 timing-only (no fingerprint), and 4 divergent. The four divergences are all NoData / edge handling, where the bulk of the raster matches but the boundary or masked pixels differ:

rst_convolve — a one-pixel edge ring (GDAL applies a block-halo convolution that no single lightweight boundary mode reproduces exactly); interior pixels match.

rst_derivedband and rst_resample — differ in how NoData is masked/propagated through the operation; valid-pixel values agree within tolerance.

rst_contour — a vector output whose feature count differs (a segmentation artifact at NoData boundaries), not a pixel value.

These boundary behaviors are tracked for alignment to the heavyweight (GDAL) semantics; they do not affect interior/valid-pixel results.

The table is ordered by Speedup = Heavy / Light (highest first). > 1× means the lightweight tier is faster; < 1× means the heavyweight tier is faster.

Function	Heavy (ms)	Light (ms)	Speedup	Consistency
`rst_fromcontent`	114	0.0002	684,479×	within_tol
`rst_threshold`	718	5.63	127×	timing-only
`rst_index`	762	7.00	109×	within_tol
`rst_ndwi`	747	7.33	102×	within_tol
`rst_savi`	769	7.55	102×	within_tol
`rst_nbr`	743	7.43	100×	within_tol
`rst_ndvi`	768	7.77	99×	within_tol
`rst_dtmfromgeoms`	4915	58.4	84×	within_tol
`rst_evi`	771	9.33	83×	within_tol
`rst_mapalgebra`	735	11.2	66×	within_tol
`rst_separatebands`	220	4.53	49×	within_tol
`rst_convolve`	788	21.4	37×	divergent
`rst_initnodata`	159	4.32	37×	within_tol
`rst_fillnodata`	417	12.9	32×	within_tol
`rst_setsrid`	133	4.48	30×	within_tol
`rst_band`	76.9	2.76	28×	within_tol
`rst_maketiles`	227	9.78	23×	within_tol
`rst_updatetype`	162	7.54	22×	within_tol
`rst_asformat`	94.5	4.47	21×	within_tol
`rst_frombands`	184	8.81	21×	within_tol
`rst_derivedband`	108	6.15	18×	divergent
`rst_retile`	280	17.9	16×	within_tol
`rst_rasterize`	49.3	3.22	15×	within_tol
`rst_h3_tessellate`	495	35.3	14×	exact
`rst_tooverlappingtiles`	423	31.7	13×	within_tol
`rst_resample`	659	54.8	12×	divergent
`rst_resample_to_size`	65.3	5.83	11×	within_tol
`rst_pixelcount`	47.9	4.53	11×	exact
`rst_avg`	46.5	4.75	9.80×	within_tol
`rst_median`	168	17.1	9.79×	within_tol
`rst_clip`	34.3	3.55	9.66×	timing-only
`rst_merge`	309	32.0	9.65×	within_tol
`rst_buildoverviews`	174	18.1	9.61×	within_tol
`rst_combineavg`	220	27.5	8.00×	within_tol
`rst_quadbin_rastertogridmedian`	647	94.3	6.86×	within_tol
`rst_tryopen`	1.54	0.230	6.67×	exact
`rst_h3_rastertogridmax`	7370	1241	5.94×	exact
`rst_h3_rastertogridavg`	7356	1243	5.92×	within_tol
`rst_h3_rastertogridcount`	7312	1239	5.90×	within_tol
`rst_h3_rastertogridmin`	7320	1244	5.88×	within_tol
`rst_transform`	181	31.1	5.82×	within_tol
`rst_h3_rastertogridmedian`	7627	1376	5.54×	exact
`rst_tilexyz`	12.7	2.40	5.27×	timing-only
`rst_resample_to_res`	254	53.9	4.71×	timing-only
`rst_slope`	86.1	24.0	3.59×	within_tol
`rst_tpi`	57.7	16.7	3.45×	within_tol
`rst_roughness`	61.0	18.3	3.33×	within_tol
`rst_to_webmercator`	201	69.0	2.91×	within_tol
`rst_hillshade`	62.6	21.7	2.89×	within_tol
`rst_tri`	58.3	23.4	2.50×	within_tol
`rst_filter`	891	435	2.05×	within_tol
`rst_aspect`	88.5	44.6	1.98×	within_tol
`rst_quadbin_rastertogridmax`	152	79.9	1.90×	within_tol
`rst_quadbin_rastertogridavg`	154	82.9	1.85×	within_tol
`rst_quadbin_rastertogridcount`	137	78.2	1.75×	within_tol
`rst_quadbin_rastertogridmin`	144	84.1	1.71×	within_tol
`rst_max`	7.65	4.56	1.68×	exact
`rst_min`	7.60	4.59	1.66×	exact
`rst_xyzpyramid`	204	141	1.45×	timing-only
`rst_proximity`	74.4	56.0	1.33×	within_tol
heavyweight faster below (< 1×)	━━━	━━━	━━━	━━━
`rst_getsubdataset`	0.232	0.237	0.98×	timing-only
`rst_boundingbox`	0.234	0.242	0.97×	timing-only
`rst_color_relief`	16.3	22.4	0.73×	timing-only
`rst_metadata`	0.145	0.241	0.60×	timing-only
`rst_rastertoworldcoord`	0.142	0.249	0.57×	exact
`rst_srid`	0.133	0.243	0.55×	exact
`rst_scalex`	0.087	0.223	0.39×	exact
`rst_gridfrompoints`	87.5	242	0.36×	within_tol
`rst_scaley`	0.079	0.224	0.35×	exact
`rst_pixelheight`	0.079	0.228	0.35×	exact
`rst_type`	0.076	0.225	0.34×	timing-only
`rst_getnodata`	0.058	0.223	0.26×	exact
`rst_contour`	362	1414	0.26×	divergent
`rst_isempty`	0.054	0.224	0.24×	exact
`rst_polygonize`	152	640	0.24×	within_tol
`rst_height`	0.051	0.236	0.22×	exact
`rst_numbands`	0.050	0.232	0.22×	exact
`rst_width`	0.047	0.252	0.19×	exact
`rst_upperleftx`	0.040	0.226	0.18×	exact
`rst_upperlefty`	0.039	0.224	0.17×	exact
`rst_rastertoworldcoordx`	0.041	0.242	0.17×	exact
`rst_rastertoworldcoordy`	0.038	0.236	0.16×	exact
`rst_skewy`	0.034	0.224	0.15×	exact
`rst_pixelwidth`	0.035	0.234	0.15×	exact
`rst_format`	0.029	0.225	0.13×	exact
`rst_skewx`	0.029	0.225	0.13×	exact
`rst_rotation`	0.028	0.226	0.13×	exact
`rst_bandmetadata`	0.025	0.224	0.11×	timing-only
`rst_georeference`	0.024	0.222	0.11×	timing-only
`rst_subdatasets`	0.024	0.244	0.10×	timing-only
`rst_summary`	0.478	5.89	0.08×	timing-only
`rst_worldtorastercoordx`	0.019	0.247	0.08×	timing-only
`rst_worldtorastercoord`	0.015	0.248	0.06×	timing-only
`rst_worldtorastercoordy`	0.014	0.242	0.06×	timing-only
`rst_sample`	0.011	0.276	0.04×	timing-only
`rst_histogram`	0.308	16.0	0.02×	timing-only
`rst_memsize`	0.019	1.67	0.01×	timing-only
`rst_viewshed`	16.7	4047	0.004×	timing-only
`rst_cog_convert`	—*	342	—	timing-only
`rst_fromfile`	—*	0.732	—	timing-only

_{* rst_cog_convert and rst_fromfile are not measured on the local heavyweight tier (a container-specific GDAL driver quirk returns immediately). They run normally on a cluster — see rst_cog_convert in the spark-path table below.}

How to read the pure-core numbers

Band math (rst_ndvi/ndwi/nbr/evi/savi/index/mapalgebra/threshold) shows the largest lightweight wins (roughly 65–130×). The heavyweight tier computes these by shelling out to a gdal_calc subprocess per call, which dominates its time; the lightweight tier evaluates the expression in-process with NumPy.
Tiling and warps (rst_maketiles/retile/tooverlappingtiles/merge/transform) favor the lightweight tier by roughly 9–23×. The retile family reads only the output window per tile rather than translating the whole raster, so it stays fast even at 1024².
Terrain (rst_slope/aspect/hillshade/tri/tpi/roughness) is a steadier ~2–3.6× lightweight win, all within tolerance — including on geographic rasters, where both tiers auto-derive the horizontal scale from the CRS.
Discrete-grid aggregation (rst_h3_*/rst_quadbin_*) is a ~1.7–6× lightweight win and matches within tolerance.
Trivial metadata accessors (rst_width/height/numbands and the world↔raster coordinate helpers) are microseconds on both tiers; here the JVM-native heavyweight path edges out the Python-worker call, so speedup dips below 1×. At this scale the absolute difference is a fraction of a millisecond.
Algorithm-bound outliers — rst_viewshed, rst_contour, and rst_polygonize — are slower on the lightweight tier (the heavyweight GDAL implementations are hard to beat). rst_viewshed in particular is xrspatial/Numba-based; its first call also pays a one-time JIT compile cost (the figures above are warmed). These remain correct (within tolerance, where comparable) but are the cases where the heavyweight tier is the better choice.

Spark-path at scale (cluster, 1000 tiles, 1024²)

Pure-core measures the algorithm; spark-path measures the job. Each function runs as a registered rst_* UDF over a DataFrame of 1,000 distinct 1024² tiles on a cluster, so the timing includes the realistic per-tile overhead — UDF dispatch, and for the lightweight tier the JVM↔Python serialization of tile bytes on every row.

Environment: Databricks · DBR 17.3 LTS · 2-band float32 tiles at 1024² (EPSG:32618) · 1,000 tiles per iteration (every tile processed each iteration, not sampled). Heavy/tile and Light/tile are the whole-job wall-clock amortized over the 1,000 tiles. (At 1,000 tiles, the per-tile figure in ms equals the whole-iteration time in seconds.)

Of the 83 functions with a spark-path cell, the lightweight tier is at least as fast in 57 and the heavyweight tier is faster in 26. Only the 7 aggregators produce a comparable fingerprint here (all within tolerance); the rest are timing-only in spark-path.

The headline: the pure-core wins do not all survive at scale. Operations that win 3–6× in isolation — terrain (rst_slope/aspect/tri/tpi), the H3/QuadBin grids — collapse to parity or slightly heavyweight-favored once every tile's bytes cross the Python boundary. The serialization cost, not the compute, dominates byte-heavy operations at scale. Band math and the largest reductions still favor the lightweight tier (~2–4×) because their compute saving outweighs the boundary tax; algorithm-bound GDAL operations (rst_proximity, rst_contour, rst_polygonize, rst_gridfrompoints) favor the heavyweight tier by 5–16×.

Ordered by Speedup = Heavy / Light (highest first).

Function	Heavy/tile (ms)	Light/tile (ms)	Speedup	Consistency
`rst_dtmfromgeoms_agg`	175.10	9.94	18×	within_tol
`rst_fillnodata`	32.71	6.98	4.69×	timing-only
`rst_convolve`	35.98	7.87	4.57×	timing-only
`rst_resample`	45.82	10.92	4.20×	timing-only
`rst_h3_tessellate`	45.59	10.99	4.15×	timing-only
`rst_evi`	24.31	6.19	3.93×	timing-only
`rst_nbr`	22.59	5.96	3.79×	timing-only
`rst_mapalgebra`	20.90	5.67	3.68×	timing-only
`rst_index`	22.04	6.02	3.66×	timing-only
`rst_ndwi`	21.98	6.08	3.62×	timing-only
`rst_savi`	21.55	6.24	3.45×	timing-only
`rst_ndvi`	22.31	6.55	3.40×	timing-only
`rst_threshold`	19.28	6.29	3.07×	timing-only
`rst_median`	9.73	3.24	3.01×	timing-only
`rst_separatebands`	15.04	6.11	2.46×	timing-only
`rst_transform`	18.76	7.84	2.39×	timing-only
`rst_updatetype`	16.78	7.04	2.38×	timing-only
`rst_quadbin_rastertogridmedian`	27.60	11.90	2.32×	timing-only
`rst_buildoverviews`	15.60	7.02	2.22×	timing-only
`rst_avg`	7.06	3.25	2.17×	timing-only
`rst_initnodata`	13.39	6.24	2.15×	timing-only
`rst_setsrid`	13.15	6.17	2.13×	timing-only
`rst_pixelcount`	6.71	3.19	2.10×	timing-only
`rst_to_webmercator`	20.83	9.91	2.10×	timing-only
`rst_tooverlappingtiles`	22.57	10.75	2.10×	timing-only
`rst_retile`	15.66	8.53	1.84×	timing-only
`rst_filter`	51.09	29.78	1.72×	timing-only
`rst_isempty`	4.58	2.71	1.69×	timing-only
`rst_skewy`	4.08	2.43	1.68×	timing-only
`rst_frombands_agg`	12.63	7.52	1.68×	within_tol
`rst_format`	4.07	2.42	1.68×	timing-only
`rst_combineavg_agg`	16.69	9.96	1.68×	within_tol
`rst_skewx`	4.04	2.44	1.66×	timing-only
`rst_rastertoworldcoordx`	4.08	2.48	1.65×	timing-only
`rst_upperlefty`	4.15	2.52	1.64×	timing-only
`rst_width`	5.14	3.14	1.64×	timing-only
`rst_pixelheight`	4.23	2.58	1.64×	timing-only
`rst_derivedband`	10.27	6.28	1.63×	timing-only
`rst_upperleftx`	4.10	2.57	1.59×	timing-only
`rst_scalex`	3.98	2.50	1.59×	timing-only
`rst_rastertoworldcoordy`	4.07	2.57	1.58×	timing-only
`rst_numbands`	4.11	2.61	1.57×	timing-only
`rst_getnodata`	3.94	2.52	1.56×	timing-only
`rst_rotation`	4.09	2.63	1.56×	timing-only
`rst_pixelwidth`	4.16	2.73	1.52×	timing-only
`rst_scaley`	4.00	2.63	1.52×	timing-only
`rst_maketiles`	10.10	6.74	1.50×	timing-only
`rst_quadbin_rastertogridmax`	17.02	11.41	1.49×	timing-only
`rst_band`	8.56	5.81	1.47×	timing-only
`rst_max`	4.51	3.06	1.47×	timing-only
`rst_min`	4.47	3.04	1.47×	timing-only
`rst_height`	4.56	3.10	1.47×	timing-only
`rst_srid`	4.12	2.83	1.46×	timing-only
`rst_merge_agg`	23.13	16.48	1.40×	within_tol
`rst_resample_to_size`	7.30	5.34	1.37×	timing-only
`rst_derivedband_agg`	9.43	7.61	1.24×	within_tol
`rst_aspect`	7.66	7.60	1.01×	timing-only
heavyweight faster below (< 1×)	━━━	━━━	━━━	━━━
`rst_roughness`	7.42	7.46	0.99×	timing-only
`rst_slope`	7.31	7.57	0.97×	timing-only
`rst_quadbin_rastertogridcount`	10.73	11.14	0.96×	timing-only
`rst_quadbin_rastertogridmin`	11.17	11.84	0.94×	timing-only
`rst_tpi`	6.87	7.40	0.93×	timing-only
`rst_tri`	7.05	7.70	0.92×	timing-only
`rst_quadbin_rastertogridavg`	10.68	11.92	0.90×	timing-only
`rst_rastertoworldcoord`	4.17	4.99	0.84×	timing-only
`rst_hillshade`	5.79	7.02	0.82×	timing-only
`rst_h3_rastertogridmedian`	67.02	81.90	0.82×	timing-only
`rst_h3_rastertogridavg`	56.72	77.22	0.73×	timing-only
`rst_h3_rastertogridmax`	57.06	78.56	0.73×	timing-only
`rst_tryopen`	3.99	5.59	0.71×	timing-only
`rst_h3_rastertogridmin`	56.51	79.28	0.71×	timing-only
`rst_h3_rastertogridcount`	55.34	79.39	0.70×	timing-only
`rst_fromcontent`	4.03	5.90	0.68×	timing-only
`rst_h3_rasterize_agg`¹	1.50	2.26	0.66×	exact
`rst_asformat`	4.00	6.04	0.66×	timing-only
`rst_frombands`	2.31	4.94	0.47×	timing-only
`rst_cog_convert`	4.23	11.13	0.38×	timing-only
`rst_rasterize_agg`	2.00	6.16	0.32×	within_tol
`rst_merge`	2.30	8.42	0.27×	timing-only
`rst_combineavg`	1.91	7.73	0.25×	timing-only
`rst_polygonize`	10.90	54.88	0.20×	timing-only
`rst_contour`	13.91	127.62	0.11×	timing-only
`rst_gridfrompoints_agg`	2.66	33.75	0.08×	within_tol
`rst_proximity`	0.52	8.12	0.06×	timing-only

¹ rst_h3_rasterize_agg is a cell→raster aggregator with a different workload than the 1024² rows: each of the 1,000 groups burns a fixed 331-cell H3 set (resolution 9, presence mask) onto one small 39×24 canvas, so its per-tile figure reflects that small output rather than a 1024² tile. Read the cross-tier ratio (heavyweight ~1.5× faster) and exact parity as the comparable result, not the absolute ms against the 1024² rows. Measured on DBR 18.x at the same fixed 20-worker cluster (the 1024² rows above are DBR 17.3 LTS).

Repartition strategy

The spark-path numbers above are wall-clock for a whole distributed job, so they're sensitive to how evenly the 1,000 tiles spread across the cluster's task slots. The benchmark harness tunes the partitioning so the comparison reflects the engines, not a straggler tail:

Repartition the N-row DataFrame to ~slots × 4 partitions (oversubscribe ~4×) so a slot that finishes early picks up a pending tile instead of idling while one long tile finishes.
Set spark.sql.shuffle.partitions to match, and disable AQE for the measured run so Adaptive Query Execution can't coalesce that repartition back toward defaultParallelism and reintroduce the idle tail.

Serverless: no Spark-config tuning

This repartitioning is benchmark-harness tuning — it lives in the bench code that runs from a repo checkout, not in the GeoBrix product. The lightweight pyrx tier never mutates Spark configuration: it only registers UDFs and builds Column expressions, so it is safe on Serverless / Spark Connect, where runtime spark.conf.set(...) and JVM-bridge access are not permitted. On Serverless you therefore can't hand-tune partitions; you rely on the platform's managed AQE and default partitioning. Treat the tuned figures above as the controlled, like-for-like comparison number — not as what a Serverless user sets by hand.

Fan-out generators (streaming UDTFs)

These functions are streaming Python UDTFs on the lightweight tier (one input tile → many output rows via LATERAL); the benchmark times the lightweight UDTF against the equivalent heavyweight Scala generator at matched fan-out and asserts that both tiers produce the same flat row count — decoded-feature parity as a hard gate.

Environment: Databricks · DBR 17.3 LTS · 20 workers · synthetic fan-out corpus · whole-job wall-clock, 1 measured iteration. Parity gate: all functions listed passed — light and heavy produced the same flat row count. Run with gbx:bench:cluster --fanout-only.

Function	rows (fan-out)	light (s)	heavy (s)	light edge
`rst_polygonize`	256	0.262	0.196	heavy faster
`rst_h3_rastertogridcount`	58,123	0.46	0.35	heavy faster
`rst_xyzpyramid`	78	0.774	0.674	heavy faster
`rst_maketiles`	16	0.341	0.254	heavy faster
`rst_retile`	256	0.528	0.549	light faster
`rst_separatebands`	64	0.292	0.372	light faster
`rst_tooverlappingtiles`	324	0.633	0.608	heavy ~parity

Across the converted fan-out functions, light runs at parity or faster than the heavyweight JVM generators for rst_retile and rst_separatebands — with decoded-row parity as a hard gate for every result in the table. The streaming-UDTF model amortizes the JVM↔Python boundary across the full fan-out (the boundary is paid once per input tile, not once per output row), which is why the at-scale cost difference is smaller than for scalar transforms. See the Performance page for the streaming-UDTF architecture.

rst_h3_tessellate numbers are pending a cluster re-bench. The covering/centroid mode work has landed — light and heavy now agree on cell assignment for both modes (test-enforced cross-tier parity; see the H3 tessellation explainer) — and timings will be added on the next cluster run.

MVT encoding (`st_asmvt`)

The st_asmvt aggregator encodes a group of tile-local features into one Mapbox Vector Tile (MVT) protobuf blob. The benchmark groups 500 synthetic features into 10 tiles and times the groupBy("z","x","y").agg(st_asmvt(geom, attrs, "layer")) whole-job wall-clock on each tier, then decodes both tiers' tiles and asserts feature-level parity (geometry + native-typed attributes match). Run it with gbx:bench:cluster --mvt-only.

Environment: Databricks · DBR 17.3 LTS · 20 workers · 500 features → 10 tiles · whole-job wall-clock median, one measured iteration.

Tier	Function	Median (500 feat → 10 tiles)	Ratio
lightweight	`pyvx.st_asmvt`	0.589 s	~1.06× (≈ parity)
heavyweight	`vectorx.st_asmvt`	0.555 s	—

The pure-Python lightweight aggregator (a grouped-aggregate pandas UDF over mapbox-vector-tile) runs at parity with the heavyweight JVM/OGR aggregator (~6% apart). Both encode attributes with native protobuf value types and take features in tile-local [0, extent] coordinates, so the decoded tiles are feature-equivalent — a hard parity gate; the benchmark fails if the tiers diverge. Tiles compose with pmtiles_agg for end-to-end publishing.

TIN & legacy migration (`st_triangulate`, `st_interpolateelevation*`, `st_legacyaswkb`)

The constrained-Delaunay TIN generators (st_triangulate, st_interpolateelevationbbox, st_interpolateelevationgeom) and the legacy-geometry migration function (st_legacyaswkb) are benchmarked light (pyvx) vs heavy (vectorx) with decoded-output parity. Light invokes the TIN generators as PySpark UDTFs (SQL LATERAL); heavy invokes the JVM generator expressions. Parity is asserted on decoded output, per the established posture: triangle count + centroid match within 1e-6 for triangulation (not triangle identity — no-Steiner constrained recovery may pick different non-constraint diagonals), per-cell surface closeness within 1e-6 for elevation interpolation, and decoded-geometry equality for legacy migration. Run it with gbx:bench:cluster --vector-tin-only.

Cluster spark-path results (DBR 17.3 LTS; median wall-clock of one full distributed iteration over N tiles; same corpus both tiers):

Function	Tiles/iter	Heavy (s)	Light (s)	Light speedup
`st_interpolateelevationgeom`	125	0.49	0.56	0.88×
`st_legacyaswkb`	1,000	0.39	0.48	0.82×
`st_interpolateelevationbbox`	245	0.40	0.53	0.75×
`st_triangulate`	220	0.41	0.70	0.60×

The heavyweight JVM/JTS tier is modestly faster across all four. The lightweight tier is competitive on legacy migration and elevation interpolation (~0.75–0.88×) and ~1.7× behind on st_triangulate: the gap is the JVM↔Python serialization of the geometry arrays crossing the UDTF boundary (the same boundary cost seen for other byte-heavy lightweight UDTFs), not the triangulation compute itself. Decoded-output parity holds across both tiers (enforced by the cross-tier parity test suite, above).

Quadbin (`quadbin_pointascell`, `quadbin_polyfill`, `quadbin_tessellate`, `quadbin_cellunion_agg`)

The quadbin grid functions are benchmarked light (pygx) vs heavy (gridx.quadbin) with exact-output parity across four representative shapes: a scalar encode (quadbin_pointascell, lon/lat → cell), a geometry → cell-array fill (quadbin_polyfill), a geometry → cell-clip struct array (quadbin_tessellate), and a grouped aggregate (quadbin_cellunion_agg, cell ids → unioned coverage geometry). Both tiers expose the same quadbin_* SQL names, so the light tier is collected before the heavy tier re-registers. Parity is asserted on decoded output: exact cell-id equality for pointascell and polyfill, exact cells with centroid match within 1e-6 for tessellate, and decoded union-geometry equality (1e-6) per group for cellunion_agg. Run it with gbx:bench:cluster --grid-quadbin-only.

All 10 quadbin functions were benchmarked on a cluster (DBR 17.3 LTS, x86, spark-path, 1,000 tiles/iteration, 5 measured iterations, both tiers). Exact cell-set parity holds for all 10 (test-enforced — in-cell and by the test_parity_quadbin cross-tier suite).

On timing, these are sub-millisecond cell-math operations, so the spark-path wall-clock is dominated by per-row JVM↔Python dispatch, not by the work itself. The 5-measured-iteration medians put the lightweight pygx tier roughly at parity with the heavyweight JVM tier across all 10 (0.89×–1.33×): the small gaps are fixed UDF-boundary/dispatch cost on top of very cheap per-row cell math, not an algorithmic deficiency. (One honest caveat: at sub-millisecond per-tile work, these numbers still carry some run-to-run variance, so read the relative ordering rather than the last digit.)

Cluster spark-path results (median wall-clock of one full distributed iteration over 1,000 tiles; speedup = heavy ÷ light, ≥1.0 → light faster):

Function	Heavy (s)	Light (s)	Light speedup
`quadbin_resolution`	0.28	0.21	1.33×
`quadbin_cellunion_agg`	0.38	0.29	1.32×
`quadbin_tessellate`	0.20	0.18	1.11×
`quadbin_polyfill`	0.24	0.23	1.06×
`quadbin_kring`	0.23	0.22	1.05×
`quadbin_distance`	0.19	0.18	1.04×
`quadbin_cellunion`	0.23	0.22	1.04×
`quadbin_aswkb`	0.15	0.15	0.98×
`quadbin_pointascell`	0.21	0.22	0.96×
`quadbin_centroid`	0.17	0.19	0.89×

quadbin_cellunion_agg is a grouped aggregate over 8 groups; the other nine are 1,000-row scalar/array operations.

The implementation is scale-aware: scalar/bounded-output ops (pointascell, resolution, distance, aswkb, centroid, cellunion) use vectorized/batched pandas_udfs, while array-returning ops (kring, polyfill, tessellate) use plain row-at-a-time UDFs so a single batch never buffers many large variable-length outputs (a memory-safety choice for large geometries/zoom levels). The lightweight pygx tier's decisive advantage here is Serverless/ARM reach and no-JVM/JAR install, with timing competitive with the heavyweight tier.

BNG (all 23 `bng_*` functions)

The British National Grid functions are available in both tiers — light (pygx, a pure-Python codec port of BNG.scala) vs heavy (gridx.bng) — with the same exact-output parity contract as quadbin: exact cell-set equality for eastnorthasbng/pointascell/polyfill/kring, exact cells with chip-geometry match for tessellate, and decoded chip-geometry equality per group for the aggregates (asserted by the test_parity_bng cross-tier suite). The execution shape mirrors quadbin (numpy-vectorized scalar pandas_udfs for cell-id math, plain UDFs for array-returning ops, UDTFs for the explodes, grouped aggregates for the chip aggregators), so the timing expectation is the same: sub-millisecond STRING cell-id math dominated by the per-row JVM↔Python boundary, with WKB-geometry ops additionally carrying the geometry-bytes serialization cost. Run the BNG legs with gbx:bench:cluster --grid-bng-only.

All 23 bng_* functions were benchmarked on a cluster (DBR 17.3 LTS, spark-path, 1,000 tiles/iteration, 5 measured iterations, both tiers). Exact cell-set / decoded-geometry parity holds for all 23 — light output equals heavy output, enforced by the in-cell parity gate and the test_parity_bng cross-tier suite (all 23 parity gates passed; 0 errors).

On timing, the pure-Python pygx BNG tier runs at near-parity with the heavyweight JVM tier: across all 23 functions the lightweight tier is within ~±20% of heavy (0.82×–1.18×), all sub-millisecond per tile at 1,000 tiles/iteration. 7/23 are light-faster (best bng_kring 1.18×); 16/23 are marginally heavy-faster (most bng_cellintersection 0.82×). These deltas sit within the run-to-run noise band for sub-second distributed jobs at this scale — effectively at parity. (Note BNG WKB carries no SRID for EPSG:27700, unlike quadbin's EWKB.)

Cluster spark-path results (median wall-clock of one full distributed iteration over 1,000 tiles; speedup = heavy ÷ light, ≥1.0 → light faster):

Function	Heavy (s)	Light (s)	Light speedup
`bng_kring`	0.20	0.17	1.18×
`bng_geomkring`	0.15	0.13	1.11×
`bng_geomkringexplode`	0.48	0.44	1.08×
`bng_cellunion_agg`	0.29	0.27	1.07×
`bng_aswkt`	0.18	0.17	1.06×
`bng_centroid`	0.14	0.14	1.06×
`bng_kloop`	0.13	0.13	1.01×
`bng_cellunion`	0.14	0.15	0.97×
`bng_aswkb`	0.14	0.14	0.96×
`bng_kloopexplode`	0.26	0.27	0.96×
`bng_kringexplode`	0.26	0.27	0.96×
`bng_cellarea`	0.15	0.16	0.94×
`bng_geomkloopexplode`	0.43	0.47	0.93×
`bng_euclideandistance`	0.14	0.15	0.91×
`bng_geomkloop`	0.15	0.16	0.90×
`bng_polyfill`	0.14	0.16	0.89×
`bng_tessellateexplode`	0.42	0.47	0.89×
`bng_cellintersection_agg`	0.25	0.28	0.89×
`bng_eastnorthasbng`	0.14	0.16	0.88×
`bng_tessellate`	0.14	0.17	0.86×
`bng_pointascell`	0.15	0.17	0.86×
`bng_distance`	0.13	0.16	0.86×
`bng_cellintersection`	0.14	0.17	0.82×

Custom grid (all 7 `custom_*` functions)

The custom equal-area grid functions are available in both tiers — light (pygx) vs heavy (gridx.custom) — with the same exact-output parity contract as quadbin and BNG: exact cell-id / cell-set equality for pointascell/polyfill/kring and the grid struct, decoded geometry match within 1e-6 for centroid/cellaswkb/cellaswkt, and an identical grid struct for grid (asserted by the JAR-gated cross-tier parity suite and the in-cell hard-assert gates on the cluster run). The execution shape mirrors the other grids (vectorized scalar pandas_udfs for the cell-id math, plain UDFs for the geometry/array-returning ops), so the timing expectation is the same: sub-millisecond cell math dominated by the per-row JVM↔Python boundary, with WKB/WKT-geometry ops additionally carrying the geometry-bytes serialization cost. Run the custom legs with gbx:bench:cluster --grid-custom-only.

All 7 custom_* functions were benchmarked on a cluster (DBR 17.3 LTS, x86, spark-path, 1,000 tiles/iteration, 5 measured iterations, both tiers). Exact cell-id/cell-set parity holds for all 7 — light output equals heavy output (cell ids/sets exact; geometry within 1e-6; grid struct identical), enforced by the JAR-gated cross-tier parity suite and the in-cell hard-assert gates on the cluster run.

On timing, the 5-measured-iteration medians put the pure-Python pygx custom-grid tier roughly at parity with the heavyweight Scala tier across all 7 (0.87×–1.34×): it is faster on the cheapest ops (custom_centroid 1.34×, custom_kring 1.33×) and marginally slower on a few. Absolute times are all sub-second at 1,000 tiles (sub-millisecond per tile), so the gaps are the fixed Python UDF-boundary / Arrow serialization cost on top of very cheap per-row grid math, not an algorithmic deficiency.

Cluster spark-path results (median wall-clock of one full distributed iteration over 1,000 tiles; speedup = heavy ÷ light, ≥1.0 → light faster):

Function	Heavy (s)	Light (s)	Light speedup
`custom_centroid`	0.19	0.14	1.34×
`custom_kring`	0.17	0.13	1.33×
`custom_cellaswkt`	0.18	0.15	1.20×
`custom_grid`	0.15	0.15	0.97×
`custom_pointascell`	0.16	0.18	0.89×
`custom_cellaswkb`	0.14	0.16	0.89×
`custom_polyfill`	0.13	0.15	0.87×

Caveats

Absolute timings depend on the machine, the corpus, and the tier's install; treat the relative speedup and the consistency outcome as the durable signal, not the millisecond values.
The two tables answer different questions. Pure-core isolates the algorithm and is where the lightweight tier's biggest wins show; spark-path at scale adds the per-row UDF and serialization overhead and is the better predictor of end-to-end job cost — and the better place to decide which tier to run a given operation on.
Consistency is checked on output statistics with tolerance, not byte-equality: GDAL and NumPy can differ in the last bits, and neighborhood / NoData-boundary operations legitimately differ on the one-pixel kernel border or in how masked pixels propagate. The benchmark surfaces those as divergent so they're visible, not hidden.

What it measures​

Running on a cluster​

Prerequisites​

Run​

Output​

Running locally​

Set (core vs full)​

Reading the output​

Results​

Raster readers & writers​

Tiled output (PMTiles writer)​

Tiled output aggregate (pmtiles_agg)​

Vector readers​

Vector writers​

Pure-core (local, 1024²)​

How to read the pure-core numbers​

Spark-path at scale (cluster, 1000 tiles, 1024²)​

Repartition strategy​

Fan-out generators (streaming UDTFs)​

MVT encoding (st_asmvt)​

TIN & legacy migration (st_triangulate, st_interpolateelevation*, st_legacyaswkb)​

Quadbin (quadbin_pointascell, quadbin_polyfill, quadbin_tessellate, quadbin_cellunion_agg)​

BNG (all 23 bng_* functions)​

Custom grid (all 7 custom_* functions)​

Caveats​

What it measures

Running on a cluster

Prerequisites

Run

Output

Running locally

Set (core vs full)

Reading the output

Results

Raster readers & writers

Tiled output (PMTiles writer)

Tiled output aggregate (`pmtiles_agg`)

Vector readers

Vector writers

Pure-core (local, 1024²)

How to read the pure-core numbers

Spark-path at scale (cluster, 1000 tiles, 1024²)

Repartition strategy

Fan-out generators (streaming UDTFs)

MVT encoding (`st_asmvt`)

TIN & legacy migration (`st_triangulate`, `st_interpolateelevation*`, `st_legacyaswkb`)

Quadbin (`quadbin_pointascell`, `quadbin_polyfill`, `quadbin_tessellate`, `quadbin_cellunion_agg`)

BNG (all 23 `bng_*` functions)

Custom grid (all 7 `custom_*` functions)

Caveats