Skip to main content

Beta Release Notes

Current version: 0.4.0

The changes on this page are relative to 0.1.0 (and earlier).

This page tracks API and naming changes since the GeoBrix project started. After the project is approved, formal release notes will take over; until then, use this as the single place to look up what changed and why.


What's new in v0.4.0

In-flight beta release. Per-version highlights; full migration tables are in the per-component sections below.

  • Lightweight execution tier (pyrx, pygx, pyvx). A pure-Python implementation of the GeoBrix API that needs no JAR and no init script, and runs on serverless compute, standard (shared) clusters, Lakeflow declarative pipelines, and ARM. It keeps the same function names and the same gbx_* SQL after register, so switching tiers is a one-line import change. RasterX (pyrx, on rasterio) implements every rst_* function; GridX (pygx) covers quadbin, BNG, and custom grids; VectorX (pyvx) covers MVT, TIN surface modeling, and legacy-geometry migration. With this release GridX and VectorX are fully both-tier — the lightweight tier reaches 1:1 parity with the heavyweight one across all three packages. See Choosing an Execution Tier.
    • Serverless support is verified and documented. geobrix[light] installs and runs on Databricks Serverless (environment v5), standard (shared) clusters, and ARM. Install with the quoted PEP 508 named form — %pip install "geobrix[light] @ file:///Volumes/.../geobrix-0.4.0-py3-none-any.whl" — not the path-with-extra form ('…whl[light]'), which fails on Serverless because %pip writes the surrounding quotes into the requirement and pip reads [light] as part of the filename. mapbox-vector-tile is pinned to 2.1.x so its protobuf dependency stays <6 (Spark Connect compatibility on Serverless), and idna is pinned <3.8 to avoid a core-package-change notice. See Installation.
    • Geometry inputs accept WKB, EWKB, WKT, and EWKT consistently. Every geometry-accepting function in both tiers now decodes all four encodings through a single shared decoder. Previously some lightweight functions accepted only WKB.
    • Geometry×raster operations align to the raster CRS and handle non-overlap gracefully. gbx_rst_clip, gbx_rst_sample, and gbx_rst_viewshed reproject the input geometry from its SRID to the raster's CRS (matching the heavyweight GDAL behavior), so a geometry in a different CRS clips/samples the correct region. A geometry that does not overlap the raster now returns null / empty instead of raising an error.
  • Raster reader default changed to no-split (sizeInMB = -1, behavior change since v0.3.0). The gdal / gtiff_gdal (heavyweight) and raster_gbx / gtiff_gbx (lightweight) readers now default sizeInMB to -1 — one whole-image tile per file — instead of auto-splitting large rasters at 16 MB. Set a positive sizeInMB to opt back into tiling for parallel processing of large files. See Raster Readers.
  • Lightweight raster writer and source-column parity. The lightweight gtiff_gbx writer accepts the nameCol option for deterministic output filenames, matching the heavyweight GDAL writer. The lightweight raster reader's source column is now dbfs:-scheme-qualified to match binaryFile and the heavyweight reader, so DataFrames join cleanly across tiers; lightweight file operations strip the scheme internally.
  • gbx_rst_fromfile is lightweight-tier only — registered into SQL on the heavyweight tier when geobrix[light] is present. On Databricks the executor JVM cannot read a Unity Catalog Volume (/Volumes/...) FUSE path — the UC credential is held only by Spark's user-scoped Python worker — so rst_fromfile is implemented solely in the lightweight tier (a pyrx Python loader) and has no heavyweight Scala expression. With geobrix[light] installed it is callable from Python (rx.rst_fromfile) and from SQL (gbx_rst_fromfile) regardless of tier: the heavyweight package's register(spark) specially registers the SQL name as the Python UDF. Without [light] it is not registered and the Python binding raises with guidance. For a tier-agnostic path on any compute, read the bytes with spark.read.format("binaryFile") and build the tile with gbx_rst_fromcontent. See Raster Functions § Constructors.
  • Vector tile encoding (gbx_st_asmvt). First VectorX expression-level function — aggregates features into MVT protobuf bytes for slippy-map publishing. See VectorX § Vector tile output.
  • Vector tile pyramid (gbx_st_asmvt_pyramid). Generator function: emits one row per (z, x, y) tile that input geometries intersect, encoded as MVT bytes. Composes with gbx_pmtiles_agg for end-to-end vector publishing pipelines. Builds on gbx_st_asmvt and shares the same web-mercator tile math as gbx_rst_xyzpyramid. See VectorX § Vector tile output.
  • Quadbin grid math (10 functions). New gridx/quadbin subpackage adds CARTO quadbin v0 support — gbx_quadbin_pointascell, gbx_quadbin_aswkb, gbx_quadbin_centroid, gbx_quadbin_resolution, gbx_quadbin_polyfill, gbx_quadbin_kring, gbx_quadbin_tessellate, gbx_quadbin_cellunion, gbx_quadbin_cellunion_agg, gbx_quadbin_distance. Cell IDs are 64-bit Long; coordinates are EPSG:4326 lon/lat; output geometry is EWKB SRID=4326. Cell encoding matches the CARTO quadbin-py reference implementation (cross-checked at 5 reference points). See GridX § Quadbin.
  • PMTiles output (gbx_pmtiles_agg UDAF + .write.format("pmtiles") DataSource). Native Scala PMTiles v3 encoder packages raster (PNG/JPG/WebP) or vector (MVT) tile pyramids into a single deployable blob. Aggregator path for tilesets that fit in a Spark cell (~100 MiB tile payload / 2 GiB cell limit); DataSource for larger pyramids streamed to a file via a partitioned commit protocol. Container is content-agnostic — tile bytes pass through verbatim, no GDAL/OGR dependency. Auto-detects tile type from magic bytes (PNG / JPEG / WebP / otherwise MVT). Read is not yet supported; spark.read.format("pmtiles") raises a friendly error pointing at the JS / Python pmtiles clients. The gbx_pmtiles_agg aggregate is available in both the heavyweight and lightweight tiers; the .write.format("pmtiles") DataSource (for larger streamed pyramids) remains heavyweight-only. See PMTiles.
  • Concurrent-safe lightweight writers. The lightweight vector and PMTiles writers now isolate their two-phase staging per write: concurrent jobs — or multiple users — writing to the same output location can no longer see or overwrite one another's in-progress data, and scratch left behind by an interrupted job is reclaimed automatically on a later write to the same location. The PMTiles writer previously staged into a fixed shared directory (a concurrency hazard) and now uses a unique hidden namespace per write. See Writers.
  • Raster→quadbin aggregators (5 functions). gbx_rst_quadbin_rastertogrid{avg,count,max,min,median} extend the H3 aggregation pattern to CARTO quadbin v0 cells. Natural fit for raster heatmaps that render in slippy-map viewers — cells align with the same XYZ pyramid that PMTiles / MVT readers consume. Resolution capped at z=20. See Raster Functions.
  • Web-mercator XYZ tile output (3 functions). gbx_rst_to_webmercator reprojects a raster to EPSG:3857 (default bilinear); gbx_rst_tilexyz(tile, z, x, y, [format, size, resampling]) renders a single XYZ tile to PNG / JPEG / WEBP bytes (returns BinaryType; out-of-extent tiles get a transparent PNG, not null); gbx_rst_xyzpyramid(tile, min_z, max_z, ...) is a generator that explodes one raster into one row per intersecting (z, x, y) tile across a zoom range. max_z capped at 20; total tile-count across zoom range capped at 10^6. Foundation for the PMTiles publishing pipeline. See Raster Functions.
  • Vector↔raster bridge (gbx_rst_rasterize, gbx_rst_polygonize). Two reciprocal RasterX functions that span GeoBrix's vector and raster worlds. gbx_rst_rasterize(geom_wkb, value, xmin, ymin, xmax, ymax, width_px, height_px, srid) burns a vector geometry into a fresh GTiff-backed raster tile at the given extent / resolution (pixels inside the geometry carry value, pixels outside are NoData = -9999.0). gbx_rst_polygonize(tile, [band, [connectedness]]) extracts ARRAY<struct(geom_wkb BINARY, value DOUBLE)> from tile — one feature per contiguous value region, NoData pixels excluded. The pair composes: polygonize(rasterize(geom, v, ...)) returns at least one feature with value v covering approximately the same area as the input geom, with edges quantized to the pixel grid. See Raster Functions § Vector bridge.
  • Terrain analysis (7 functions). gbx_rst_slope, gbx_rst_aspect, gbx_rst_hillshade, gbx_rst_tri, gbx_rst_tpi, gbx_rst_roughness, gbx_rst_color_relief — all thin wrappers over gdal.DEMProcessing. Each takes a single-band DEM tile and returns a derived tile (Float32 for slope/aspect/TRI/TPI/roughness, Byte for hillshade, RGB(A) Byte for color_relief). Defaults mirror the gdaldem CLI (hillshade NW sun at 315° azimuth, 45° altitude; slope in degrees). Foundation for terrain-derived workflows — solar exposure, viewshed pre-processing, watershed and runoff analysis, road grading. See Raster Functions § Terrain.
  • Slope and hillshade auto-scale from the raster CRS (breaking default on geographic rasters). gbx_rst_slope and gbx_rst_hillshade (and the lightweight prx.rst_slope / prx.rst_hillshade) now derive the horizontal scale from the raster's coordinate reference system by default, matching GDAL gdaldem. On geographic (lat/long, e.g. EPSG:4326) rasters the scale is computed from latitude (degree→metre), so a global or geographic DEM produces correct, non-saturated slope and shading without any extra argument; on projected (metre) rasters output is unchanged. Previously these two ran unscaled on geographic input, which over-steepened and saturated the result. This changes the default output for geographic rasters to the GDAL-consistent value. To pin a specific scale, pass it explicitly — gbx_rst_slope(tile, 'degrees', 111120) for a degree grid, or prx.rst_slope(tile, xscale=..., yscale=...) / prx.rst_hillshade(tile, xscale=..., yscale=...). gbx_rst_aspect is a direction and is unaffected. See Raster Functions § Terrain.
  • Spectral indices (5 functions). gbx_rst_evi, gbx_rst_savi, gbx_rst_ndwi, gbx_rst_nbr, plus a generic gbx_rst_index(tile, formula_name, band_map) — all compositions over gbx_rst_mapalgebra. Each takes user-supplied 1-based band indices, builds a per-pixel formula string, and dispatches to gdal_calc; output is a single-band Float32 GTiff sized to the input extent. The generic dispatcher ships built-in NDVI, GNDVI, MSAVI, red-edge NDVI, NDMI, and NDSI formulae and is the entry point users should reach for first for any named multi-band index; the four specialized expressions surface EVI / SAVI / NDWI / NBR with their canonical coefficient defaults (EVI: L=1.0, C1=6.0, C2=7.5, G=2.5 per MODIS; SAVI: L=0.5) so vegetation, water and burn-severity workflows compose without a hand-written formula string. See Raster Functions § Spectral indices.
  • Resample and IDW interpolation (5 functions). Three resample wrappers (gbx_rst_resample by multiplicative factor, gbx_rst_resample_to_size to explicit pixel dims, gbx_rst_resample_to_res to explicit ground resolution) all delegate to gdal.Warp with -tr / -ts plus -r <algorithm>. Two IDW functions — gbx_rst_gridfrompoints (arrays in one row) and its UDAF counterpart gbx_rst_gridfrompoints_agg (one point per row) — both delegate to gdal.Grid with the invdist:power=<p>:max_points=<m> algorithm and produce a single-band Float64 GTiff tile of the requested extent / size / SRID. Algorithm names match the gdalwarp -r set (near, bilinear, cubic, cubicspline, lanczos, average, mode, max, min, med, q1, q3); IDW defaults are power=2.0, max_pts=12, NoData -9999.0. See Raster Functions.
  • Pixel ops + extraction (7 functions). gbx_rst_fillnodata (fill NoData holes via inverse-distance from valid neighbors), gbx_rst_sample(tile, geom) (per-band pixel values at a geometry), gbx_rst_setsrid (stamp an EPSG code without reprojecting), gbx_rst_histogram (per-band bucket counts via band.GetHistogram), gbx_rst_threshold(tile, op, value) (binarize 0/1 via map-algebra), gbx_rst_buildoverviews(tile, levels, [resampling]) (add pyramid overview levels), and gbx_rst_band(tile, bandIndex) (extract a single band). Common per-pixel and per-tile operations missing from v0.3.0; each is a thin wrapper over the matching GDAL primitive. See Raster Functions.
  • Analysis (4 functions). gbx_rst_cog_convert(tile, [compression, [blocksize, [overview_resampling]]]) re-layouts a tile as a Cloud Optimized GeoTIFF via gdal.Translate -of COG (HTTP-range-friendly serving from object storage). gbx_rst_proximity(tile, [target_values, [distunits, [max_distance]]]) computes a Float32 distance raster via gdal.ComputeProximity — distance to the nearest non-NoData (or matching target_values) source pixel, in CRS units or pixels. gbx_rst_contour(tile, levels, [interval, [base, [attr_field]]]) extracts contour LineStrings via gdal.ContourGenerateEx, returning ARRAY<struct(geom_wkb BINARY, value DOUBLE)> — pass non-empty levels for fixed values or array() plus positive interval for equal-step contours. gbx_rst_viewshed(tile, observer_geom, observer_height, [target_height, [max_distance]]) computes a binary visibility mask (Byte raster, 255 visible / 0 invisible) from a DEM and an observer POINT via gdal.ViewshedGenerate. See Raster Functions.
  • TIN DTM rasters (2 functions). gbx_rst_dtmfromgeoms (array of Z-valued points and optional breaklines in one row) and gbx_rst_dtmfromgeoms_agg (streaming — one point per row, grouped by extent). Both build a constrained-Delaunay TIN and rasterize it to a Float64 GTiff DTM over a bbox at a pixel grid; cells outside the triangulated hull get NoData. Useful for deriving a continuous elevation surface from scattered survey points or LiDAR mass points. See Raster Functions § Constructors.
  • VectorX TIN surface modeling (3 functions). gbx_st_triangulate (emit one triangle polygon per row from a constrained-Delaunay TIN), gbx_st_interpolateelevationbbox (sample the TIN on a pixel grid over an explicit bounding box), and gbx_st_interpolateelevationgeom (sample on a grid anchored to a geometry's bounding box with explicit cell sizes) — all generators returning WKB geometries. Useful for exposing the raw triangulation and interpolated elevation points for vector-side workflows. See VectorX § Triangulation and elevation.
  • Streaming aggregators (3 functions). gbx_rst_rasterize_agg (burn geom/value pairs into one tile per group), gbx_rst_frombands_agg (collect ordered per-band tiles into one multi-band tile per group), and gbx_quadbin_cellunion_agg (dissolve a column of quadbin cell IDs into one MultiPolygon per group). Group-by / UDAF forms that stream rows instead of requiring a pre-collected array, suited for large partitions. See Raster Functions § Aggregators and GridX § Quadbin.
  • H3 cell rasterizer (gbx_rst_h3_rasterize_agg, gbx_h3_cell_bbox). gbx_rst_h3_rasterize_agg is a grouped aggregator (both tiers) that burns a set of H3 cells — one row per cell with an optional value — into a single GTiff-encoded raster tile per group, using pixel-centroid assignment. It is the inverse of the gbx_rst_h3_rastertogrid* family: where those extract per-cell statistics from an existing raster, this one synthesizes a raster from H3-indexed values. Extent and grid dimensions are either supplied explicitly or derived automatically from the cell set. gbx_h3_cell_bbox is a scalar function that returns a STRUCT<xmin DOUBLE, ymin DOUBLE, xmax DOUBLE, ymax DOUBLE> bounding box for a single H3 cell in the requested EPSG, optionally expanded by a k-ring pad. The lightweight Python API also ships rst_h3_gridspec, a helper that derives the canonical raster extent and pixel grid from a collection of H3 cells at a given resolution — useful for computing consistent grid parameters before calling the aggregator. See Raster Functions § H3 grid.
  • Custom grids (7 functions). gbx_custom_grid (define a user-specified regular grid from extent + resolution + SRID), gbx_custom_pointascell, gbx_custom_cellaswkb, gbx_custom_cellaswkt, gbx_custom_centroid, gbx_custom_polyfill, gbx_custom_kring. Index and tessellate against an arbitrary projected grid (for example a national or project-specific tiling) when H3, BNG, or quadbin cells do not match the required cell geometry. Available in both the heavyweight and lightweight (pygx) tiers, with exact cross-tier cell-ID and cell-set parity. See GridX § Custom Grid Functions.
  • gbx_rst_initnodata now works on multi-band rasters (behavior change since v0.3.0). Initializing NoData on a raster with more than one band previously raised an error; only single-band rasters were supported. gbx_rst_initnodata now initializes the NoData value correctly across all bands of a multi-band raster. Output for single-band rasters is unchanged. See Raster Functions.
  • gbx_rst_derivedband / gbx_rst_derivedband_agg return a single derived band for multi-band inputs (behavior change since v0.3.0). On a multi-band input, these functions previously returned one derived band per input band (an N-band output). They now apply the pixel function across all bands and return a single-band Float64 result, matching the documented single-band contract. Output for single-band inputs is unchanged. See Raster Functions.
  • gbx_bng_geomkring / gbx_bng_geomkloop accept string resolutions (consistency fix). These two functions now accept BNG string resolution keys (for example '1km', '100m') in addition to integer indices, matching gbx_bng_pointascell and the lightweight tier. Integer-index behavior is unchanged. See GridX § BNG.
  • Lightweight grouped aggregators return BINARY where the heavyweight tier returns a struct. For grouped aggregators whose heavyweight form returns a tile or chip struct (the rst_*_agg family, gbx_bng_cellunion_agg / gbx_bng_cellintersection_agg, and gbx_quadbin_cellunion_agg), the lightweight SQL form returns the serialized BINARY payload instead — a PySpark limitation (a grouped pandas_udf cannot return a struct type). Re-wrap the result with the matching scalar constructor to recover the struct. The Python DataFrame and Scala APIs are unaffected. See the per-function notes in Raster Functions and GridX.
  • gbx_custom_pointascell rejects a non-finite Y coordinate (fix). A NaN northing was previously not validated (a duplicate easting check), so it was only incidentally rejected with a misleading out-of-bounds message. Both tiers now reject a NaN Y with a clear error, matching the X-coordinate guard.
  • Lightweight STAC client (databricks.labs.gbx.stac.StacClient). A Serverless-safe client for distributed SpatioTemporal Asset Catalog (STAC) workflows — search (fan an area-of-interest DataFrame out across a catalog, one row per item/asset), download (resilient, validated asset fetch: re-signs each attempt, read-validates the bytes, retries with backoff, and skips already-valid files), and repair (re-download only the invalid rows via a Delta MERGE). Catalog-agnostic with pluggable signing, defaulting to Microsoft Planetary Computer. Ships behind the opt-in geobrix[light,stac] extra (adds pystac-client, planetary-computer) and imports cleanly on Serverless environment v5. See STAC Client.
  • AOI-driven sample downloaders (databricks.labs.gbx.sample). Three helpers stage open geospatial data to a Unity Catalog Volume with a shared discover → download → read shape, distributed and Serverless-safe: OvertureClient (Overture Maps buildings / places, via the Overture STAC catalog + overturemaps CLI), NaipDownloader (NAIP aerial imagery), and DemDownloader (USGS 3DEP elevation). NaipDownloader and DemDownloader wrap StacClient on Microsoft Planetary Computer and window each asset to the AOI on read. See Overture, NAIP, and 3DEP.
  • Visualization helpers (databricks.labs.gbx.vizx). A tier-agnostic, opt-in (geobrix[vizx]) module for inspecting GeoBrix outputs in a notebook. plot_raster / plot_file render a tile or file (auto-decimate, percentile-stretch, single-band viridis or multi-band RGB), and accept composite="depth" to render a multi-band presence stack as a per-pixel coverage-depth gradient (bright where many bands cover a pixel) instead of a mostly-black RGB. plot_mask_layers overlays several single-band mask tiles on one axes — each a solid colour with a legend — for multi-threshold coverage views. as_gdf / cells_as_gdf / grid_as_gdf adapt Spark DataFrames (geometry rows, H3 cell ids with an optional dissolve_by, or a rst_h3_gridspec grid struct) to GeoPandas for .plot() / .explore() maps. plot_static renders Spark- or GeoPandas-derived geometries (or DGGS cells) as a GitHub-renderable matplotlib figure over a basemap, and plot_interactive is its interactive twin — a folium pan/zoom map that automatically falls back to a raster image overlay at scale (where a bare .explore() would hang) and renders inline in Databricks via displayHTML. Single-band presence masks (constant value) now render as a solid footprint over a light background rather than a blank plot. See Visualization.
  • Inline PMTiles + COG viewers (plot_pmtiles, plot_cog, pmtiles_info). plot_pmtiles renders a PMTiles archive (raster or vector, auto-detected from the header) directly in a notebook — a self-contained MapLibre GL JS + pmtiles.js page with the archive base64-embedded as an in-browser FileSource, so there is no tile server; it falls back to a static image when the archive exceeds the notebook cell-output ceiling (~4–5 MB after displayHTML inflation), or drops the densest zooms with interactive_fit="downzoom". plot_cog renders a Cloud-Optimized GeoTIFF over a contextily basemap; pmtiles_info reports an archive's header (tile type, zoom range, bounds). See PMTiles viewers.
  • Example notebooks default to the lightweight tier. The EO Series and xView walkthroughs now run on the lightweight API (pyrx / pygx / pyvx plus the gbx_* DataSource readers and writers) by default, so they execute on Databricks Serverless (environment v5) with no JAR and no init script; each notebook calls out the one-line import to switch back to the heavyweight tier. The EO Series uses the new StacClient for its Planetary Computer search, download, and repair steps. See EO Series and xView.
  • H3 cell rasterize example notebook. A complete polygon → H3 polyfill → per-band rasterize → multi-band stack walkthrough on a San Francisco Bay Area DEM, treating elevation isobands as a stand-in for signal-strength coverage tiers (a telco coverage-analysis pattern). Exercises rst_h3_gridspec, rst_h3_rasterize_agg, and rst_frombands_agg, materializes the per-band tiles into a session-scoped temp table, and uses the gbx.vizx helpers (plot_mask_layers, plot_raster(composite="depth")) to inspect the result. See H3 Rasterize.
  • Helios distributed-tiling notebook series. A four-notebook solar site-selection walkthrough over one San Francisco AOI: building footprints → vector PMTiles (NB01), a NAIP aerial basemap → raster PMTiles (NB02), 3DEP terrain → COG catalog + hillshade PMTiles + a per-H3-cell solar score (NB03), and a distributed sharded PMTiles mosaic with a mosaic.json manifest for client-side assembly (NB04). Runs on the lightweight tier / Serverless with no JAR, dogfooding gbx_st_asmvt_pyramid, gbx_rst_xyzpyramid, gbx_pmtiles_agg, the sample downloaders, and the gbx.vizx PMTiles viewers. See Helios.

What's new in v0.3.0

Released 2026-05-26. Per-version highlights; full migration tables are in the per-component sections below.

  • rst_clip CRS axis-order fix (all-black clips). GDAL 3+ defaults EPSG-imported SpatialReferences to authority-compliant axis order (lat/lon for EPSG:4326), which silently swapped axes against JTS/Databricks WKT/WKB cutlines so the clip missed the raster entirely. The reprojection now clones the source/destination SpatialReferences and forces OAMS_TRADITIONAL_GIS_ORDER before the OGR transform; caller-owned SpatialReferences are not mutated.
  • EWKT / EWKB support for rst_clip. JTS.fromWKT / JTS.fromWKB auto-detect EWKT/EWKB; new JTS.toEWKT / JTS.toEWKB helpers emit SRID-preserving forms. rst_clip reprojects the cutline when its SRID differs from the raster CRS, and falls back to the raster's CRS (Mosaic-compatible) when the SRID is 0 / unresolvable.
  • rst_transform rejects invalid SRIDs. targetSrid <= 0 and unresolvable EPSG codes now surface a clear error via tile metadata error_message instead of returning a raster with an uninitialized CRS.
  • /vsimem/ path-handling hardening. rst_memsize / rst_unlink / GDAL writer in-memory byte fetch now use startsWith("/vsimem/") (not contains) and null-check GetMemFileBuffer, so datasets whose description embeds the substring (e.g. NetCDF subdataset selectors) aren't mis-routed through the in-memory branch.
  • tile.raster bytes are always self-contained (no VRT payloads). Three RasterX operations — MergeRasters (gbx_rst_merge, gbx_rst_merge_agg), MergeBands (gbx_rst_frombands), and PixelCombineRasters (gbx_rst_derivedband, gbx_rst_derivedband_agg, gbx_rst_combineavg, gbx_rst_combineavg_agg) — used to return tiles whose metadata("driver") claimed VRT even though the on-disk file was a materialized GTiff. That mis-tag propagated through RasterDriver.writeToBytes (which keys both the tempfile extension AND the -of flag in the inner gdal_translate call off metadata.driver), causing the serialized tile.raster payload to be VRT XML referencing a /vsimem/ tempfile only reachable on the producing executor. Single-node testing passed by accident; multi-executor clusters hit file not found when the VRT was opened elsewhere. Fix: GDALTranslate.executeTranslate now records the output dataset's driver in its returned metadata (not the input's), and RasterDriver.writeToBytes defensively coerces VRT to GTiff on serialization + sniffs the result to refuse shipping VRT bytes. Regression coverage in RST_NoVrtPayloadTest.
  • PixelCombineRasters pixel function now actually fires (combineavg / derivedband were silently returning one of the inputs). gbx_rst_combineavg, gbx_rst_combineavg_agg, gbx_rst_derivedband, and gbx_rst_derivedband_agg build a multi-source VRT, inject a <PixelFunctionLanguage>Python</...> band, and re-open it for gdal_translate. The previous implementation re-opened the VRT before mutating the XML file, so the in-memory Dataset handle never saw the pixel function; gdal.Translate then fell back to a default multi-source mosaic (last-source-wins per pixel). On co-extensive inputs (e.g. a monthly EO time-series), the output silently equaled one of the inputs — non-deterministic per partition in a distributed setting, producing visible tile-of-different-years patchwork on multi-executor clusters. Fix: PixelCombineRasters.combine now injects the pixel function before the VRT is re-opened, and pre-creates the per-JVM NodeFilePathUtil.rootPath staging dir itself (previously only ClipToGeom did, so combineavg would file not found if it was the first op to hit a fresh JVM). Regression coverage: RST_AggregationsTest "CombineAvg actually averages pixel values" (two constant rasters 50 + 100 → output 75).
  • gbx_rst_merge_agg overlap winner is now deterministic. When merging tiles whose extents overlap, the mosaic is last-wins, so the result depends on the order tiles are folded. The aggregator previously ordered tiles by their GDAL dataset description to make that order stable, but for the in-memory (BinaryType) tiles a groupBy().agg() produces, the description is a per-open /vsimem/<uuid> path — so the fold order, and therefore the overlap winner, varied from run to run. The aggregator now orders tiles by their raw serialized content (the GTiff bytes each tile carries) — a total order intrinsic to the tile with no ties for distinct content and no random per-open component — so one tile reliably wins the overlap regardless of fold order, and the result is identical across the heavyweight and lightweight tiers (both sort on the identical bytes). This also fixes overlapping tiles that share the same geotransform origin, which an origin-based key could not separate. Non-overlapping mosaics are unaffected. Regression coverage: RST_AggEvalTest deterministic same-origin and offset merge cases.
  • Friendly error on ARRAY<tile>-function misuse. Calling gbx_rst_combineavg, gbx_rst_merge, gbx_rst_frombands, or gbx_rst_mapalgebra on a single tile column (instead of an ARRAY<tile> like collect_list(tile)) used to surface as a raw ClassCastException: StructType cannot be cast to ArrayType from inside Catalyst analysis — untraceable from a notebook. The four expressions now route through RST_ExpressionUtil.arrayOfTileRasterType, which raises a clean IllegalArgumentException naming the function, the actual type received, and (where applicable) the aggregator companion the user likely wanted, e.g. gbx_rst_combineavg expects ARRAY<tile> (e.g. collect_list(tile) or array(t1, t2, ...)), but received STRUCT<...>. To aggregate the column across rows, use gbx_rst_combineavg_agg(tile).
  • Docs: GDAL_VRT_ENABLE_PYTHON for custom GDAL code paths. Built-in combineavg / derivedband calls auto-enable VRT Python via the in-process GDALManager.withVrtPython bracket — no cluster config needed. The RasterX Function Reference § VRT Python pixel functions section documents how to enable the same evaluation in your own GDAL calls (Python gdal.SetConfigOption, cluster spark.executorEnv, or the JVM withVrtPython helper) and points to the TRUSTED_MODULES variant for less-trusted VRT sources. A cross-reference is added in Security § 6 explaining why GeoBrix ships the option NO by default.
  • gbx_rst_derivedband / gbx_rst_derivedband_agg numerical-correctness regression coverage. These functions share the PixelCombineRasters code path with combineavg, so they were silently no-opping in the same way (returning one of the inputs unchanged on co-extensive stacks). The ordering fix above repairs both call sites, but the existing tests only checked that the result wasn't null — they would have passed either way. This release adds explicit pixel-value assertions: RST_AggregationsTest covers the in-process RST_DerivedBand path with a doubling pyfunc and a 3-input numpy-mean pyfunc, and RST_AggEvalTest covers the Spark-aggregation rst_derivedband_agg path end-to-end (three constant-Byte tiles 10/20/30 with a "mean × 2" pyfunc must yield 40 across the result tile). Two previously-passing tests used def myfunc(x): return x * 2 — an invalid VRT pixel-function signature — and were updated to the canonical (in_ar, out_ar, xoff, yoff, xsize, ysize, raster_xsize, raster_ysize, buf_radius, gt, **kwargs) shape; they only "passed" before because the pyfunc never actually ran.
  • gbx_rst_combineavg / gbx_rst_combineavg_agg math corrected (NoData, valid zeros, rounding). With the pixel function now firing (previous bullet), several latent bugs in the average kernel surface and are fixed in this release. The pyfunc used to sum every source value blindly — including each band's NoData sentinel (e.g. 255 on Byte EO products) — and counted only strictly-positive cells in the divisor (np.sum(stacked > 0, axis=0)), which (a) inflated the numerator with NoData and (b) wrongly excluded valid 0 measurements from the divisor. It also used np.divide(..., casting='unsafe'), which truncates rather than rounds when casting back to an integer output dtype (Byte / UInt16), producing systematic underbias on integer EO stacks. Now the kernel reads each source band's declared NoData (via BandAccessors.getNoDataValue, baked into the pyfunc source as a literal list at VRT-write time), masks NoData cells out of both sum and divisor, includes valid 0s, uses float64 internally, and rounds-to-nearest-even before the unsafe cast when the output dtype is integer. The bogus np.clip(out_ar, stacked.min(), stacked.max(), ...) (the bounds were contaminated by NoData sentinels) is removed. When at least one input declares NoData, that value is also stamped on the output band so downstream GetNoDataValue reports all-NoData pixels. Regression coverage in RST_AggregationsTest: "excludes declared NoData from both sum and divisor", "counts valid 0 cells in the divisor", "rounds (not truncates) when casting to integer output".
  • Scalar args without f.lit(...). Python wrappers auto-wrap bool / int / float / bytes; Scala adds typed overloads. SQL was already natively-typed. String literals still wrap in f.lit(...) per pyspark's column-ref convention. Details and migration examples in Scalar values vs lit(...) wrapping.
  • Example notebooks — EO Series, xView, and enablement diagrams. New end-to-end walkthroughs under docs/examples/ covering EO time-series, xView object-detection rasters, and RasterX architecture diagrams.
  • Supply-chain hardening (lockdown). Jobs pinned to the Databricks-hardened runner group (org-level allowlist, ephemeral VMs, constrained secret access); every Maven dependency, transitive dep, plugin, and plugin dependency is PGP-verified against .maven-keys.list before any compile or test execution; pip and Maven routed through JFrog with OIDC; init script + pinned package versions vetted; new Security page in the docs.
  • Pre-built, hash-verified GDAL bundle. The GDAL native install path is now a CI-built tarball (geobrix-gdal-artifacts-v<version>-noble.tar.gz + matching .sha256 sidecar, attached to each release alongside a versioned geobrix-gdal-init.sh). Cluster start drops from ~15 minutes (legacy PPA dance per boot) to ~30–90 seconds (verify sidecar → extract → dpkg -i). Trust chain is now four layers: CI-side GPG fingerprint pin → per-file SHA256SUMS inside the tarball → outer .sha256 sidecar in the staging Volume → the Volume's write ACL. The legacy on-cluster path is preserved as scripts/geobrix-gdal-init-ppa.sh for bundle bootstrapping. Bundle is amd64 / x86_64 only (Intel or AMD CPUs); ARM-based instance types — AWS Graviton, Ampere, Apple Silicon — are not supported. See Installation and the rationale on the Security page.

Conventions:

  • baseline — Name or behavior before the change (what to search for in old code or docs).
  • Notes — Short reason (e.g. standardize across languages, underscore standardization, _geometry → _geom).

General

BaselineCurrentNotes
Python import geobrix.*databricks.labs.gbx.*Match Scala package and published artifact; avoid namespace clashes.
Extra underscores in function names (multi-word parts spelled with _)Single underscore between prefix and compound (e.g. rst_pixelwidth, gbx_bng_cellarea)Underscore standardization: one leading prefix, then one compound word; no _ inside the operation name.
Non-Column value args required f.lit(...) / lit(...) wrapping (e.g. rst_clip(tile, geom, f.lit(True)), bng_pointascell(pt, f.lit(1)))Plain Python/Scala non-string scalars accepted directly (e.g. rst_clip(tile, geom, True), rst_transform(tile, 4326), bng_pointascell(pt, 1))Matches Mosaic/DBR built-in ergonomics for booleans/numerics. Python wrappers auto-wrap bool/int/float/bytes via f.lit; Scala adds typed overloads. Strings still follow pyspark's column-ref convention — rx.rst_width("tile") is still f.col("tile"); wrap in f.lit(...) for string literals (e.g. driver=f.lit("GTiff")).

All specific function renames from that standardization are listed in the component tables below.


RasterX

BaselineCurrentNotes
(GDAL reader output column) pathsourceDocs/tests aligned to GDAL reader output column name.
rst_band_metadata / gbx_rst_band_metadatarst_bandmetadata / gbx_rst_bandmetadataUnderscore standardization.
rst_bounding_box / gbx_rst_bounding_boxrst_boundingbox / gbx_rst_boundingboxUnderscore standardization.
rst_pixel_width / gbx_rst_pixel_widthrst_pixelwidth / gbx_rst_pixelwidthUnderscore standardization.
rst_pixel_height / gbx_rst_pixel_heightrst_pixelheight / gbx_rst_pixelheightUnderscore standardization.
rst_num_bands / gbx_rst_num_bandsrst_numbands / gbx_rst_numbandsUnderscore standardization.
rst_pixel_count / gbx_rst_pixel_countrst_pixelcount / gbx_rst_pixelcountUnderscore standardization.
rst_scale_x / gbx_rst_scale_xrst_scalex / gbx_rst_scalexUnderscore standardization.
rst_scale_y / gbx_rst_scale_yrst_scaley / gbx_rst_scaleyUnderscore standardization.
rst_upper_left_x / gbx_rst_upper_left_xrst_upperleftx / gbx_rst_upperleftxUnderscore standardization.
rst_upper_left_y / gbx_rst_upper_left_yrst_upperlefty / gbx_rst_upperleftyUnderscore standardization.
rst_geo_reference / gbx_rst_geo_referencerst_georeference / gbx_rst_georeferenceUnderscore standardization.
rst_get_nodata / gbx_rst_get_nodatarst_getnodata / gbx_rst_getnodataUnderscore standardization.
rst_get_subdataset / gbx_rst_get_subdatasetrst_getsubdataset / gbx_rst_getsubdatasetUnderscore standardization.
rst_mem_size / gbx_rst_mem_sizerst_memsize / gbx_rst_memsizeUnderscore standardization.
rst_sub_datasets / gbx_rst_sub_datasetsrst_subdatasets / gbx_rst_subdatasetsUnderscore standardization.
rst_combine_avg_agg / gbx_rst_combine_avg_aggrst_combineavg_agg / gbx_rst_combineavg_aggUnderscore standardization.
rst_derived_band_agg / gbx_rst_derived_band_aggrst_derivedband_agg / gbx_rst_derivedband_aggUnderscore standardization.
rst_from_content / gbx_rst_from_contentrst_fromcontent / gbx_rst_fromcontentUnderscore standardization.
rst_from_file / gbx_rst_from_filerst_fromfile / gbx_rst_fromfileUnderscore standardization.
rst_from_bands / gbx_rst_from_bandsrst_frombands / gbx_rst_frombandsUnderscore standardization.
rst_make_tiles / gbx_rst_make_tilesrst_maketiles / gbx_rst_maketilesUnderscore standardization.
rst_re_tile / gbx_rst_re_tilerst_retile / gbx_rst_retileUnderscore standardization.
rst_separate_bands / gbx_rst_separate_bandsrst_separatebands / gbx_rst_separatebandsUnderscore standardization.
rst_to_overlapping_tiles / gbx_rst_to_overlapping_tilesrst_tooverlappingtiles / gbx_rst_tooverlappingtilesUnderscore standardization.
rst_init_nodata / gbx_rst_init_nodatarst_initnodata / gbx_rst_initnodataUnderscore standardization.
rst_is_empty / gbx_rst_is_emptyrst_isempty / gbx_rst_isemptyUnderscore standardization.
rst_map_algebra / gbx_rst_map_algebrarst_mapalgebra / gbx_rst_mapalgebraUnderscore standardization.
rst_raster_to_world_coord / gbx_rst_raster_to_world_coord (and X/Y variants)rst_rastertoworldcoord / gbx_rst_rastertoworldcoord (and X/Y)Underscore standardization.
rst_world_to_raster_coord / gbx_rst_world_to_raster_coord (and X/Y variants)rst_worldtorastercoord / gbx_rst_worldtorastercoord (and X/Y)Underscore standardization.
rst_as_format / gbx_rst_as_formatrst_asformat / gbx_rst_asformatUnderscore standardization.
rst_combine_avg / gbx_rst_combine_avgrst_combineavg / gbx_rst_combineavgUnderscore standardization.
rst_h3_raster_to_grid_avg (and Count/Max/Min/Median)rst_h3_rastertogridavg (and Count/Max/Min/Median)Underscore standardization.
rst_bandmetadata(tile) (single arg)rst_bandmetadata(tile, band)Required band parameter added; use e.g. rst_bandmetadata("tile", f.lit(1)).
rst_fromfile raster field was StringType (path) with metadata.size = -1rst_fromfile raster field is BinaryType (file bytes) with real metadata.sizerst_fromfile now reads the file into the tile, so tiles are self-contained and downstream ops (e.g. rst_clip) no longer produce orphan temp paths. Matches rst_fromcontent and the GDAL reader.
Default output compression was ZSTD (TIFF tag 50000)Default output compression is DEFLATE (baseline TIFF)ZSTD output was not decodable by Java ImageIO and broke the Databricks image preview after operators like rst_clip. DEFLATE is universally previewable and (with PREDICTOR=2/3) still compresses well. Override per-call via tile metadata compression key.

GridX (BNG)

BaselineCurrentNotes
bng_eastnortasbng (Python) / gbx_bng_eastnortasbng (SQL)bng_eastnorthasbng / gbx_bng_eastnorthasbngStandardize across languages (Python had typo; Scala already eastnorth).
bng_cell_area / gbx_bng_cell_areabng_cellarea / gbx_bng_cellareaUnderscore standardization.
bng_cell_intersection / gbx_bng_cell_intersectionbng_cellintersection / gbx_bng_cellintersectionUnderscore standardization.
bng_cell_union / gbx_bng_cell_unionbng_cellunion / gbx_bng_cellunionUnderscore standardization.
bng_euclidean_distance / gbx_bng_euclidean_distancebng_euclideandistance / gbx_bng_euclideandistanceUnderscore standardization.
bng_point_as_bng / gbx_bng_point_as_bngbng_pointascell / gbx_bng_pointascellUnderscore standardization; Renamed for clarity: point → cell (not "point as BNG").
bng_cell_intersection_agg / gbx_bng_cell_intersection_aggbng_cellintersection_agg / gbx_bng_cellintersection_aggUnderscore standardization.
bng_cell_union_agg / gbx_bng_cell_union_aggbng_cellunion_agg / gbx_bng_cellunion_aggUnderscore standardization.
bng_geometry_kring / gbx_bng_geometry_kringbng_geomkring / gbx_bng_geomkring_geometry → _geom in name.
bng_geometry_kloop / gbx_bng_geometry_kloopbng_geomkloop / gbx_bng_geomkloop_geometry → _geom in name.
bng_geometry_kring_explode / gbx_bng_geometry_kring_explodebng_geomkringexplode / gbx_bng_geomkringexplode_geometry → _geom + underscore standardization.
bng_geometry_kloop_explode / gbx_bng_geometry_kloop_explodebng_geomkloopexplode / gbx_bng_geomkloopexplode_geometry → _geom + underscore standardization.
bng_k_ring / gbx_bng_k_ringbng_kring / gbx_bng_kringUnderscore standardization.
bng_k_loop / gbx_bng_k_loopbng_kloop / gbx_bng_kloopUnderscore standardization.
bng_k_ring_explode / gbx_bng_k_ring_explodebng_kringexplode / gbx_bng_kringexplodeUnderscore standardization.
bng_k_loop_explode / gbx_bng_k_loop_explodebng_kloopexplode / gbx_bng_kloopexplodeUnderscore standardization.
bng_tessellate_explode / gbx_bng_tessellate_explodebng_tessellateexplode / gbx_bng_tessellateexplodeUnderscore standardization.

VectorX

BaselineCurrentNotes
(Schema/column) _geometry_geomStandardize geometry column suffix across readers and examples.
st_legacy_as_wkb / gbx_st_legacy_as_wkbst_legacyaswkb / gbx_st_legacyaswkbUnderscore standardization.

Readers

BaselineCurrentNotes
shapefileshapefile_ogrReader namespace: format + engine to avoid conflicts with other Spark extensions.
geojsongeojson_ogrSame.
ogr_gpkggpkg_ogrSame; consistent format_engine order.
file_gdbfile_gdb_ogrSame.
(none)gtiff_gdalNew reader: named GDAL reader for GeoTIFF; use instead of gdal with option("driver", "GTiff").
info

Reader renames above are planned for 0.2.0. Beta (0.1.x) may still expose the baseline names in some contexts.


Scalar values vs lit(...) wrapping

Previously, every non-Column argument had to be wrapped in f.lit(...) (Python) or lit(...) (Scala). That was a regression from Mosaic/DBR built-ins, where booleans and numerics can be passed as plain values. In 0.3.0, plain scalars are accepted across Python, Scala, and SQL bindings.

Python — wrappers accept Column or scalar (bool/int/float/bytes); non-string scalars are auto-wrapped with f.lit(...). Strings still follow pyspark's column-reference convention (bare string ≈ f.col(name)); wrap in f.lit("...") to pass a string literal.

# ✅ Before 0.3.0 — required f.lit for every value
rx.rst_clip("tile", "geom", f.lit(True))
rx.rst_transform("tile", f.lit(4326))
bx.bng_pointascell("pt", f.lit(1))
bx.bng_pointascell("pt", f.lit("1km"))

# ✅ 0.3.0 — scalars accepted directly
rx.rst_clip("tile", "geom", True)
rx.rst_transform("tile", 4326)
bx.bng_pointascell("pt", 1)
bx.bng_pointascell("pt", f.lit("1km")) # string literal — still wrap in f.lit

Scala — typed overloads added for Boolean / Int / Double / String value parameters. Column args (e.g. geometry, tile) still take Column.

// ✅ 0.3.0 — scalar overloads resolve without lit(...)
rst_clip(col("tile"), col("geom"), cutlineAllTouched = true)
rst_transform(col("tile"), 4326)
bng_pointascell(col("pt"), 1)
bng_pointascell(col("pt"), "1km")

SQL — values are already natively accepted by Spark SQL; no change needed:

SELECT gbx_rst_clip(tile, geom, true) FROM ...;
SELECT gbx_bng_pointascell(pt, 1) FROM ...;
SELECT gbx_bng_pointascell(pt, '1km') FROM ...;

When you still need f.lit(...) in Python:

  • String literals: rx.rst_fromfile(f.lit("/path/to.tif"), f.lit("GTiff")) — a bare string is treated as a column reference.
  • Nulls / explicit typing: e.g. f.lit(None).cast("double").

How to use this page

  • Migrating code: Search for the baseline name in your code or config; replace with Current and apply any behavior notes.
  • Docs or tests: After a change, add one row here so future readers know what changed and why.
  • After approval: Move content into formal release notes (e.g. per-version sections) and keep this page for historical beta-only changes, or retire it.

Notable improvements and fixes

  • Python package rename: Imports changed from geobrix.* to databricks.labs.gbx.* to align with Scala and the published artifact; update all import statements and environment references.
  • Init script / NumPy: Init script updated to install NumPy 2.x so GDAL Python array operations execute correctly; fixes runtime failures in gbx_rst_mapalgebra and gbx_rst_ndvi when used with array-based paths.
  • Error handling: Functions that previously threw exceptions during execution now surface errors more clearly (e.g. return null or a controlled default with error messages captured) instead of failing with opaque stack traces.
  • RasterX rst_bandmetadata: A required band argument was added; call as rst_bandmetadata(tile, band) (e.g. rst_bandmetadata("tile", f.lit(1))) in Python/SQL/Scala.
  • GDAL reader column: Raster DataFrames from the GDAL reader use the column name source (not path) for the file path; update any code or docs that assumed path.
  • BNG aggregators (bng_cellunion_agg, bng_cellintersection_agg): Fixed a bug where aggregation buffers were shared across partitions (and across tests in the same JVM), causing incorrect core flags when running full test suites or with multiple partitions. Each partition now gets a fresh buffer. Chip fields are resolved by type/name in the union aggregator for robustness to struct field order. Test expectation corrected for “all core chips” intersection: result is now correctly documented as core=true (whole cell).
  • rst_clip axis-order fix for EPSG-imported CRS (fixes all-black clips): When the clip geometry's CRS was set via an EPSG code (plain rst_transform-style input, EWKT SRID=4326;..., or EWKB with SRID), GDAL 3+ defaults that SpatialReference to authority-compliant axis order — for EPSG:4326 that means (latitude, longitude). JTS / Databricks / most GIS tooling emit WKT/WKB coordinates in traditional (x, y) = (lon, lat) order, so the reprojection inside rst_clip was silently swapping the axes (e.g. -80 14 interpreted as lat=-80, near the south pole) and the cutline missed the raster entirely, producing all-black output. OSRTransformGeometry.transform now clones both source and destination SpatialReferences and forces OAMS_TRADITIONAL_GIS_ORDER on the clones before running the OGR transform, so JTS-origin WKB is interpreted correctly. Caller-owned SpatialReferences are not mutated.
  • EWKT / EWKB support for raster clip (CRS mismatch handling): rst_clip now accepts EWKT (SRID=<epsg>;<WKT>) and EWKB (PostGIS extended WKB) in addition to plain WKT/WKB. Semantics:
    • Plain WKT / WKB (no SRID): the geometry is assumed to already be in the raster's CRS; no reprojection is performed.
    • EWKT / EWKB (SRID set and resolvable via EPSG): the geometry's CRS is used and, if it differs from the raster's CRS, the cutline is reprojected before clipping.
    • If the SRID is 0 or not a valid EPSG code, the code falls back to the raster's CRS (same as the plain case) — this restores Mosaic-compatible behavior but no longer silently produces an empty/black clip when a caller forgets to set the SRID. JTS.fromWKT / JTS.fromWKB now auto-detect EWKT/EWKB; new JTS.toEWKT / JTS.toEWKB helpers emit SRID-preserving forms. Plain toWKT / toWKB output is unchanged (OGC, no SRID).
  • rst_transform invalid SRID: rst_transform(tile, targetSrid) now rejects targetSrid <= 0 and EPSG codes that GDAL cannot resolve with a clear error (surfaced in tile metadata error_message) instead of returning a raster with an uninitialized CRS.
  • /vsimem/ path handling hardening: rst_memsize / rst_unlink and the GDAL writer's in-memory byte fetch now use startsWith("/vsimem/") (not contains) and null-check GetMemFileBuffer, so datasets whose description happens to embed the substring (e.g. NetCDF subdataset selectors) are no longer mis-routed through the in-memory branch.