Skip to main content

Readers Overview

GeoBrix provides Spark readers for geospatial file formats.

The lightweight tier ships native Python DataSource V2 readers — no JAR, no init script.

Why this scales beyond a single node

These are Spark DataSource V2 readers, not single-node rasterio/pyogrio wrappers. The work is partitioned and read in parallel across the cluster — vector readers slice features by chunkSize, raster readers can split large files by sizeInMB — and the result is a distributed DataFrame ready for joins and aggregations with no driver-side collect. A single-node pyogrio.read_* or rasterio.open reads one file sequentially on one machine; these readers fan the same work across executors and scale past a single machine's memory.

Register first

Unlike the heavyweight readers (auto-discovered from the JAR), the lightweight Python DataSources are not auto-registered — call register(spark) once per session before using any *_gbx format:

from databricks.labs.gbx.ds.register import register
register(spark)

To register only the formats this session uses, pass only= (by format name, with or without the _gbx suffix):

register(spark, only=["raster_gbx", "gtiff_gbx"])

An unrecognized format raises ValueError.

Available Readers

ReaderFormat NameDescription
Raster Readerraster_gbxPure-Python catch-all raster reader (no JAR; DataSource V2)
GeoTIFF Readergtiff_gbxPure-Python GeoTIFF reader (preset driver="GTiff")
Vector Readervector_gbxPure-Python catch-all vector reader (pyogrio; same OGR schema)
Shapefile Readershapefile_gbxPure-Python Shapefile reader (preset OGR driver)
GeoJSON Readergeojson_gbxPure-Python GeoJSON reader (preset OGR driver)
GeoPackage Readergpkg_gbxPure-Python GeoPackage reader (preset OGR driver)
GeoDatabase Readerfile_gdb_gbxPure-Python File Geodatabase reader (preset OGR driver)

See the Raster Reader page for full raster usage/options, and the vector reader pages for the OGR (geom_0, geom_0_srid, geom_0_srid_proj, …attributes) schema.

Benchmarks

Each *_gbx lightweight reader corresponds to a *_ogr (or gdal/gtiff_gdal) heavyweight counterpart. The generic vector_gbx catch-all reader pairs with the heavyweight ogr generic OGR reader; named readers such as shapefile_gbx pair with shapefile_ogr, and so on.

Both tiers are benchmarked on the same cluster, same source file, same row counts. Parity (row-count equality between the two tiers) is a hard gate — a mismatch fails the run immediately. Per-format timing results are published on the Benchmarking page; the vector-reader comparison table is in the Results — vector readers section (timing figures will be filled in once a controlled cluster run is completed).

For the full methodology, raster-function results, and the PMTiles writer comparison, see the Benchmarking page.