Skip to main content

Lightweight Tier: Implementation Techniques

The lightweight (pyrx / pyvx) tier is a pure-Python/PySpark implementation that installs as a wheel with no JAR, no init script, and no GDAL system install. It runs wherever Spark runs — Serverless, shared/standard clusters, ARM, and Lakeflow declarative pipelines — and covers the full raster and VectorX function set (vector-tile encoding, TIN surface modeling, and legacy-geometry migration).

The lightweight tier is seeing increasing investment precisely because of that reach. Functions that run on Serverless and ARM expand the audience that can use GeoBrix without infrastructure friction; that is the direction the product is moving, and the lightweight tier is how GeoBrix gets there now.

This page explains how the lightweight tier reaches functional and performance parity with the heavyweight JVM tier. The key is choosing the right Spark execution shape per function: fan-out generators use streaming UDTFs, aggregators use grouped-aggregate Arrow UDFs, and per-tile transforms use Arrow scalar UDFs. Underneath each shape, vectorized Python libraries (NumPy, SciPy, rasterio, rio-tiler, xarray-spatial, shapely, mapbox-vector-tile) do the compute.

For measured timing and output-consistency numbers, see Benchmarking.


Execution shapes

Streaming UDTFs

A Python UDTF (User-Defined Table Function) is a class with an eval method that yields rows. Spark calls it as a LATERAL table function: one input row fans out to any number of output rows without buffering the whole result set first.

This is the right shape for fan-out / generator operations — functions where one input tile produces a variable or large number of output rows. Buffering all output rows in a list before returning them would hold the whole result in worker memory; streaming with yield releases each row as soon as it is produced.

The lightweight UDTFs mirror the heavyweight CollectionGenerator call site: both register under the same SQL name and are invoked identically via LATERAL VIEW or the LATERAL table-function syntax.

Grouped-aggregate UDFs

A grouped-aggregate pandas_udf operates on one group at a time: Spark shuffles all rows for a key to one worker, then calls the UDF once with a pd.Series of all values in the group, returning a single scalar result. This is the natural shape for tile / feature aggregators — functions that reduce many rows to one output per key (one merged raster, one MVT tile blob).

Arrow-backed via pandas_udf(BinaryType()), these aggregators sit inside a standard .groupBy(...).agg(...) call: the same API the heavyweight Scala aggregators (TypedImperativeAggregate) use, so the Python call pattern is identical.

Arrow scalar UDFs

A vectorized Arrow scalar pandas_udf receives an Arrow batch of rows as a pd.Series, calls the function once per batch, and returns a pd.Series. The JVM–Python boundary is crossed once per batch, not once per row — the primary serialization saving over a plain @udf.

This is the right shape for per-row (1→1) tile and geometry transforms: operations that produce one output tile or scalar per input tile, where the compute dominates the per-row overhead.

note

Functions that return MapType columns use a plain @f.udf for SQL registration (Arrow does not support MapType in all pandas_udf builds); the Python Column API routes through the pandas_udf path for all tile-returning operations.

Vectorized cores

The compute underneath each UDF shape comes from best-in-class Python libraries rather than partial reimplementations:

LibraryRole
rasterioTile open/decode, clip, warp, resample, CoG conversion, band I/O
NumPyBand math (spectral indices, map algebra, thresholding), array ops
SciPyFocal filtering (rst_filter, rst_convolve); Delaunay triangulation + barycentric interpolation for the TIN functions (st_triangulate, st_interpolateelevation*)
rio-tilerXYZ / web-tile output (rst_tilexyz, rst_xyzpyramid)
xarray-spatialTerrain analysis (slope, hillshade, aspect, tri, tpi, roughness, viewshed)
shapelyWKB geometry handling (clip, polygonize, rasterize, grid aggregation); TIN geometry parse and legacy-struct decode to WKB (st_legacyaswkb)
pyogrioVector reader I/O (Arrow-native columnar batches — OGR-free)
mapbox-vector-tileMVT protobuf encode / decode (st_asmvt, st_asmvt_pyramid)

These modules live in python/geobrix/src/databricks/labs/gbx/pyrx/core/ and the corresponding pyvx/ modules, and are called directly by the UDF harness — no subprocess, no native bridge.


Function classification

The tabs below enumerate which functions use each execution shape. The classification is derived from the registered implementations in python/geobrix/src/databricks/labs/gbx/pyrx/functions.py and pyvx/functions.py.

These functions are Python UDTFs registered via spark.udtf.register(...). Each yields rows incrementally, so the output is never fully buffered on the worker.

RasterX (pyrx)

SQL namePython nameOutput per input tile
gbx_rst_polygonizerst_polygonizeOne row per contiguous polygon in the raster
gbx_rst_separatebandsrst_separatebandsOne row per band in the multi-band tile
gbx_rst_retilerst_retileOne row per retiled region
gbx_rst_tooverlappingtilesrst_tooverlappingtilesOne row per overlapping tile
gbx_rst_maketilesrst_maketilesOne row per subdivided tile
gbx_rst_h3_tessellaterst_h3_tessellateOne row per H3 cell covering the tile
gbx_rst_xyzpyramidrst_xyzpyramidOne row per intersecting XYZ tile across the zoom range
gbx_rst_h3_rastertogridavgrst_h3_rastertogridavgOne row per H3 cell covering the tile
gbx_rst_h3_rastertogridcountrst_h3_rastertogridcountOne row per H3 cell
gbx_rst_h3_rastertogridmaxrst_h3_rastertogridmaxOne row per H3 cell
gbx_rst_h3_rastertogridminrst_h3_rastertogridminOne row per H3 cell
gbx_rst_h3_rastertogridmedianrst_h3_rastertogridmedianOne row per H3 cell
gbx_rst_quadbin_rastertogridavgrst_quadbin_rastertogridavgOne row per QuadBin cell covering the tile
gbx_rst_quadbin_rastertogridcountrst_quadbin_rastertogridcountOne row per QuadBin cell
gbx_rst_quadbin_rastertogridmaxrst_quadbin_rastertogridmaxOne row per QuadBin cell
gbx_rst_quadbin_rastertogridminrst_quadbin_rastertogridminOne row per QuadBin cell
gbx_rst_quadbin_rastertogridmedianrst_quadbin_rastertogridmedianOne row per QuadBin cell

VectorX (pyvx)

SQL namePython nameOutput per input group
gbx_st_asmvt_pyramid(SQL LATERAL only)One MVT tile row per zoom-level tile in the pyramid
gbx_st_triangulate(SQL LATERAL only)One row per triangle in the constrained-Delaunay tessellation
gbx_st_interpolateelevationbbox(SQL LATERAL only)One elevation point per in-hull grid cell over a bounding box
gbx_st_interpolateelevationgeom(SQL LATERAL only)One elevation point per in-hull grid cell over a geometry

GridX (pygx) — BNG

SQL namePython nameOutput per input row
gbx_bng_kringexplodebng_kringexplodeOne row per k-ring cell
gbx_bng_kloopexplodebng_kloopexplodeOne row per k-loop cell
gbx_bng_geomkringexplodebng_geomkringexplodeOne row per geometry k-ring cell
gbx_bng_geomkloopexplodebng_geomkloopexplodeOne row per geometry k-loop cell
gbx_bng_tessellateexplodebng_tessellateexplodeOne row per tessellated cell (cell ID + chip)

SQL invocation (LATERAL table function):

SELECT z, x, y, tile
FROM your_table,
LATERAL gbx_rst_h3_rastertogridavg(tile, 8); -- expands to one row per H3 cell at resolution 8

Streaming vs consolidated returns

Not every fan-out function is a streaming UDTF. The choice follows directly from where the cost is:

Stream (UDTF) when one input row fans out to many output rows and the consumer needs them as individual rows. A consolidated ARRAY<...> return there carries a triple cost: buffer the whole fan-out in worker memory before returning it, serialize the nested array across the JVM↔Python boundary, then explode it again downstream. For large fan-out this is the bottleneck — and it scales with fan-out size.

The clearest example is rst_separatebands on hyperspectral or multispectral rasters. Each band-tile carries its own raster bytes. A consolidated return buffers all band-tiles per input row: on a 200-band AVIRIS scene that is 200 × tile-bytes held simultaneously in one worker, with proportional ser/de cost at the JVM boundary. The streaming UDTF yields one band-tile at a time — O(1) worker memory regardless of band count. The same logic applies to fine-resolution tessellation (rst_h3_tessellate at high H3 resolution), deep XYZ pyramids (rst_xyzpyramid across many zoom levels), and retiling (rst_retile, rst_tooverlappingtiles, rst_maketiles) where output tile count scales with input raster size.

Consolidate when the collection is the actual answer: aggregations (*_agg) produce one result per group — the consolidation IS the reduction. The only consolidated returns in GeoBrix's lightweight tier are the *_agg reductions; every fan-out generator is a streaming UDTF.


Where the lightweight tier wins — and where it doesn't

The Benchmarking page has the full per-function table; this section gives the pattern.

Big wins (10–100×+ pure-core, sustained at scale): Band math — rst_ndvi, rst_index, rst_mapalgebra, rst_threshold, and the other spectral-index functions — show the largest lightweight advantages. The heavyweight tier shells out to a gdal_calc subprocess per call; the lightweight tier evaluates the expression in-process with NumPy. That subprocess overhead dominates the heavyweight timing, so the lightweight wins here by roughly two orders of magnitude in isolation.

Moderate wins (2–20×, some erosion at scale): Tiling, warps, terrain, and the aggregators win by 2–20× in pure-core. At Spark-path scale, byte-heavy operations (those that move large tile buffers over the JVM–Python boundary per row) see some of this win erode. Band math and the largest reductions (rst_dtmfromgeoms_agg at 18× on a cluster) sustain their advantage because compute savings outweigh boundary cost.

Near-parity: The gbx_st_asmvt MVT aggregator runs within ~6% of the heavyweight JVM/OGR aggregator on a cluster. The gbx_pmtiles_agg archive aggregator is a format-agnostic grouped aggregate — registered from both pyrx and pyvx, reusing the lightweight PMTiles writer's archive assembler (the pmtiles package) — that folds a group's tiles into one PMTiles archive per key. On a cluster it runs at roughly parity with the heavyweight Scala encoder: the lightweight median is ~1.07× the heavyweight median (0.742 s vs 0.695 s for 1,000 tiles → one archive), with overlapping spreads, so the gap is within run-to-run noise. The VectorX TIN and legacy-migration functions sit just behind the JVM/JTS tier on the cluster spark-path: legacy decoding (gbx_st_legacyaswkb) and elevation interpolation (gbx_st_interpolateelevationbbox, gbx_st_interpolateelevationgeom) run within ~0.75–0.88× of heavy, while gbx_st_triangulate runs ~1.7× slower. That gap is the JVM↔Python serialization of the geometry arrays crossing the UDTF boundary — the same boundary cost seen for other byte-heavy lightweight UDTFs — not the triangulation compute itself; decoded-output parity holds across both tiers. Small metadata accessors (rst_width, rst_srid, …) are sub-millisecond on both tiers; the JVM-native path can edge them out by a fraction of a millisecond — irrelevant at scale but visible in pure-core.

Heavyweight still faster: Algorithm-bound GDAL operations — rst_viewshed (xrspatial/Numba), rst_proximity, rst_contour, rst_polygonize — run slower on the lightweight tier because no Python library matches the GDAL implementation. These are correct and production-usable; the heavyweight tier is simply the better choice when those specific operations dominate the workload.

GridX quadbin (pygx) — heavyweight faster, by the per-row UDF tax: The quadbin functions are sub-millisecond cell math, so the lightweight spark-path timing is set by per-row JVM↔Python dispatch and cluster run-to-run variance, not by the work. At that scale precise per-function speedups don't reproduce — the same unchanged function varied ~±40% between repeated 5-iteration cluster runs — so the honest summary is that the lightweight tier sits roughly at parity with the heavyweight JVM tier across the quadbin surface (sometimes faster, sometimes modestly slower, within that variance), with exact cell-set parity for all 10. The numpy-vectorized pointascell/resolution pandas_udfs closed what had been a large per-row-dispatch gap; the array-returning ops use plain UDFs for scale safety (see the module note above). The lightweight tier's decisive advantage is Serverless/ARM reach and no-JVM/JAR install at competitive timing. See Benchmarking — Grid for details.

GridX BNG (pygx) — same execution shape as quadbin, benchmarked at near-parity: The lightweight BNG tier is a pure-Python codec port of BNG.scala plus shapely for geometry. It carries exact cell-set parity with the heavyweight tier — the cross-tier parity suite asserts identical cell IDs for eastnorthasbng/pointascell/kring/polyfill/tessellate and identical chip geometry for the aggregates. The performance shape mirrors quadbin: STRING cell-id math is sub-millisecond, so the spark-path timing is dominated by the per-row JVM↔Python boundary rather than the work; the numpy-vectorized scalar pandas_udfs amortize the Arrow transfer, while the WKB-geometry ops (aswkb, tessellate, the chip aggregates) additionally carry the geometry-bytes serialization cost across that boundary. The lightweight tier's decisive advantage is again Serverless/ARM reach and no-JVM/JAR install. All 23 gbx_bng_* functions were benchmarked on a cluster (spark-path, 1,000 tiles/iteration, both tiers) with exact cell-set / decoded-geometry parity across all 23: the lightweight tier runs within ~±20% of heavy (0.82×–1.18×), all sub-millisecond per tile — effectively at parity, within the run-to-run noise band at this scale. See Benchmarking — Grid for the full per-function table.

For output consistency: the two tiers agree within tolerance on the large majority of functions. The known divergences (rst_convolve, rst_resample, rst_derivedband, rst_contour) are at NoData/edge boundaries; interior pixel values agree. See Benchmarking — Results for the per-function consistency labels.