Readers Overview

GeoBrix provides Spark readers for geospatial file formats.

Lightweight (pyrx)
Heavyweight (rasterx)

The lightweight tier ships native Python DataSource V2 readers — no JAR, no init script.

Why this scales beyond a single node

These are Spark DataSource V2 readers, not single-node rasterio/pyogrio wrappers. The work is partitioned and read in parallel across the cluster — vector readers slice features by chunkSize, raster readers can split large files by sizeInMB — and the result is a distributed DataFrame ready for joins and aggregations with no driver-side collect. A single-node pyogrio.read_* or rasterio.open reads one file sequentially on one machine; these readers fan the same work across executors and scale past a single machine's memory.

Unlike the heavyweight readers (auto-discovered from the JAR), the lightweight Python DataSources are not auto-registered — call register(spark) once per session before using any *_gbx format:

from databricks.labs.gbx.ds.register import register
register(spark)

To register only the formats this session uses, pass only= (by format name, with or without the _gbx suffix):

register(spark, only=["raster_gbx", "gtiff_gbx"])

An unrecognized format raises ValueError.

Available Readers

Reader	Format Name	Description
Raster Reader	`raster_gbx`	Pure-Python catch-all raster reader (no JAR; DataSource V2)
GeoTIFF Reader	`gtiff_gbx`	Pure-Python GeoTIFF reader (preset `driver="GTiff"`)
Vector Reader	`vector_gbx`	Pure-Python catch-all vector reader (pyogrio; same OGR schema)
Shapefile Reader	`shapefile_gbx`	Pure-Python Shapefile reader (preset OGR driver)
GeoJSON Reader	`geojson_gbx`	Pure-Python GeoJSON reader (preset OGR driver)
GeoPackage Reader	`gpkg_gbx`	Pure-Python GeoPackage reader (preset OGR driver)
GeoDatabase Reader	`file_gdb_gbx`	Pure-Python File Geodatabase reader (preset OGR driver)

See the Raster Reader page for full raster usage/options, and the vector reader pages for the OGR (geom_0, geom_0_srid, geom_0_srid_proj, …attributes) schema.

Benchmarks

Each *_gbx lightweight reader corresponds to a *_ogr (or gdal/gtiff_gdal) heavyweight counterpart. The generic vector_gbx catch-all reader pairs with the heavyweight ogr generic OGR reader; named readers such as shapefile_gbx pair with shapefile_ogr, and so on.

Both tiers are benchmarked on the same cluster, same source file, same row counts. Parity (row-count equality between the two tiers) is a hard gate — a mismatch fails the run immediately. Per-format timing results are published on the Benchmarking page; the vector-reader comparison table is in the Results — vector readers section (timing figures will be filled in once a controlled cluster run is completed).

For the full methodology, raster-function results, and the PMTiles writer comparison, see the Benchmarking page.

Heavyweight readers are implemented as Spark DataSource V2 connectors backed by GDAL (raster) and OGR (vector). They are registered automatically when the GeoBrix JAR is on the classpath — no additional configuration needed.

Available Readers

Raster Readers (GDAL-based)

Reader	Format Name	Description
Raster Reader	`gdal` / `gtiff_gdal`	GDAL-backed raster readers (generic and GeoTIFF named)

Vector Readers (OGR-based)

Reader	Format Name	Description
Vector Reader	`ogr`	Generic reader for any OGR-supported vector format
Shapefile Reader	`shapefile_ogr`	Named reader for ESRI Shapefiles
GeoJSON Reader	`geojson_ogr`	Named reader for GeoJSON/GeoJSONSeq
GeoPackage Reader	`gpkg_ogr`	Named reader for GeoPackage
FileGDB Reader	`file_gdb_ogr`	Named reader for ESRI File Geodatabase

Basic Usage

All readers follow the same pattern:

# Python
df = spark.read.format("<reader_name>").load("/path/to/file")

# Scala
val df = spark.read.format("<reader_name>").load("/path/to/file")

-- SQL
SELECT * FROM <reader_name>.`/path/to/file`;

Examples

Raster (GeoTIFF):

df = spark.read.format("gtiff_gdal").load("/path/to/raster.tif")

Vector (Shapefile):

df = spark.read.format("shapefile_ogr").load("/path/to/data.shp")

Generic (any format):

# Generic GDAL for rasters
df = spark.read.format("gdal").option("driver", "NetCDF").load("/path/to/data.nc")

# Generic OGR for vectors
df = spark.read.format("ogr").option("driverName", "KML").load("/path/to/data.kml")

Output Schemas

Raster Output

Produces the standard tile format used by GeoBrix APIs.

root
 |-- tile: struct
     |-- cellid: bigint (grid cell ID, nullable)
     |-- raster: binary (raster file content)
     |-- metadata: map<string,string> (driver, extension, etc.)

See Tile Structure for detailed field descriptions.

Vector Output

Output will vary, depending on format / driver conventions.

root
 |-- geom_0: binary (geometry in WKB format)
 |-- geom_0_srid: integer (spatial reference ID)
 |-- geom_0_srid_proj: string (projection definition)
 |-- <attribute_columns>: various (feature attributes)

Path Types

All readers support:

Single file: /path/to/file.tif
Directory: /path/to/directory/
Wildcard: /path/to/*.tif
Cloud storage: s3://bucket/path, abfss://..., gs://bucket/path
Unity Catalog Volumes: /Volumes/catalog/schema/volume/path

Common Options

Raster Options

Option	Default	Description
`driver`	Auto-detect	GDAL driver name (e.g., "GTiff", "NetCDF")
`sizeInMB`	`"-1"`	Default (`<= 0`) = no split (one tile per file); positive MB value splits large files
`filterRegex`	`".*"`	Filter files by regex pattern

Vector Options

Option	Default	Description
`driverName`	Auto-detect	OGR driver name (e.g., "ESRI Shapefile")
`chunkSize`	`"10000"`	Records per chunk for parallel reading
`layerName`	`""`	Layer name for multi-layer formats
`layerN`	`"0"`	Layer index for multi-layer formats
`asWKB`	`"true"`	Output geometry as WKB (binary) vs WKT (text)

Reader Types Explained

Generic Readers (gdal, ogr):

Work with any format supported by GDAL/OGR
Require driver or driverName option for non-standard formats
Use when format doesn't have a named reader

Named Readers (gtiff_gdal, shapefile_ogr, etc.):

Preset the driver option for common formats
Cleaner syntax, no driver option needed
Recommended for supported formats

Example:

# Named reader (cleaner)
df = spark.read.format("gtiff_gdal").load("/path/to/file.tif")

# Generic reader (more verbose, same result)
df = spark.read.format("gdal").option("driver", "GTiff").load("/path/to/file.tif")

Performance Tips

Use appropriate split size for rasters:

df = spark.read.format("gdal").option("sizeInMB", "32").load("/path")

Use chunk size for vectors:

df = spark.read.format("ogr").option("chunkSize", "50000").load("/path")

Filter files with regex:

df = spark.read.format("gdal").option("filterRegex", ".*_2024.*\\.tif").load("/path")

Write data after read to a table (avoid repeat loading from file):

df = spark.read.format("shapefile_ogr").load("/path").write.saveAsTable(...)

Reader Types Correspondence

Each lightweight *_gbx reader maps directly to a heavyweight counterpart: gtiff_gbx ↔ gtiff_gdal, shapefile_gbx ↔ shapefile_ogr, geojson_gbx ↔ geojson_ogr, gpkg_gbx ↔ gpkg_ogr, file_gdb_gbx ↔ file_gdb_ogr. The generic catch-all pair is vector_gbx (lightweight) ↔ ogr (heavyweight), used for any OGR-supported format without a named reader.

Benchmarks

Each heavyweight reader is benchmarked against its *_gbx lightweight counterpart on the same cluster, same source file, same row counts. Parity (row-count equality between the two tiers) is a hard gate — a mismatch fails the run immediately. Per-format timing results are published on the Benchmarking page; the vector-reader comparison table is in the Results — vector readers section (timing figures will be filled in once a controlled cluster run is completed).

For the full methodology, raster-function results, and the PMTiles writer comparison, see the Benchmarking page.

Next Steps

Raster Reader - Generic and GeoTIFF raster readers
GeoTIFF Reader - Named GeoTIFF reader
Vector Reader - Generic vector reader
Shapefile Reader - Named Shapefile reader
GeoJSON Reader - Named GeoJSON reader
GeoPackage Reader - Named GeoPackage reader
FileGDB Reader - Named File Geodatabase reader

Available Readers​

Benchmarks​

Available Readers​

Raster Readers (GDAL-based)​

Vector Readers (OGR-based)​

Basic Usage​

Examples​

Output Schemas​

Raster Output​

Vector Output​

Path Types​

Common Options​

Raster Options​

Vector Options​

Reader Types Explained​

Performance Tips​

Reader Types Correspondence​

Benchmarks​

Next Steps​

Available Readers

Benchmarks

Available Readers

Raster Readers (GDAL-based)

Vector Readers (OGR-based)

Basic Usage

Examples

Output Schemas

Raster Output

Vector Output

Path Types

Common Options

Raster Options

Vector Options

Reader Types Explained

Performance Tips

Reader Types Correspondence

Benchmarks

Next Steps