Readers Overview
GeoBrix provides Spark readers for geospatial file formats.
- Lightweight (pyrx)
- Heavyweight (rasterx)
The lightweight tier ships native Python DataSource V2 readers — no JAR, no init script.
These are Spark DataSource V2 readers, not single-node rasterio/pyogrio wrappers. The
work is partitioned and read in parallel across the cluster — vector readers slice
features by chunkSize, raster readers can split large files by sizeInMB — and the result is
a distributed DataFrame ready for joins and aggregations with no driver-side collect. A
single-node pyogrio.read_* or rasterio.open reads one file sequentially on one machine;
these readers fan the same work across executors and scale past a single machine's memory.
Unlike the heavyweight readers (auto-discovered from the JAR), the lightweight Python
DataSources are not auto-registered — call register(spark) once per session before
using any *_gbx format:
from databricks.labs.gbx.ds.register import register
register(spark)
To register only the formats this session uses, pass only= (by format name, with or without the _gbx suffix):
register(spark, only=["raster_gbx", "gtiff_gbx"])
An unrecognized format raises ValueError.
Available Readers
| Reader | Format Name | Description |
|---|---|---|
| Raster Reader | raster_gbx | Pure-Python catch-all raster reader (no JAR; DataSource V2) |
| GeoTIFF Reader | gtiff_gbx | Pure-Python GeoTIFF reader (preset driver="GTiff") |
| Vector Reader | vector_gbx | Pure-Python catch-all vector reader (pyogrio; same OGR schema) |
| Shapefile Reader | shapefile_gbx | Pure-Python Shapefile reader (preset OGR driver) |
| GeoJSON Reader | geojson_gbx | Pure-Python GeoJSON reader (preset OGR driver) |
| GeoPackage Reader | gpkg_gbx | Pure-Python GeoPackage reader (preset OGR driver) |
| GeoDatabase Reader | file_gdb_gbx | Pure-Python File Geodatabase reader (preset OGR driver) |
See the Raster Reader page for full raster usage/options, and the vector
reader pages for the OGR (geom_0, geom_0_srid, geom_0_srid_proj, …attributes) schema.
Benchmarks
Each *_gbx lightweight reader corresponds to a *_ogr (or gdal/gtiff_gdal) heavyweight
counterpart. The generic vector_gbx catch-all reader pairs with the heavyweight ogr generic
OGR reader; named readers such as shapefile_gbx pair with shapefile_ogr, and so on.
Both tiers are benchmarked on the same cluster, same source file, same row counts. Parity (row-count equality between the two tiers) is a hard gate — a mismatch fails the run immediately. Per-format timing results are published on the Benchmarking page; the vector-reader comparison table is in the Results — vector readers section (timing figures will be filled in once a controlled cluster run is completed).
For the full methodology, raster-function results, and the PMTiles writer comparison, see the Benchmarking page.
Heavyweight readers are implemented as Spark DataSource V2 connectors backed by GDAL (raster) and OGR (vector). They are registered automatically when the GeoBrix JAR is on the classpath — no additional configuration needed.
Available Readers
Raster Readers (GDAL-based)
| Reader | Format Name | Description |
|---|---|---|
| Raster Reader | gdal / gtiff_gdal | GDAL-backed raster readers (generic and GeoTIFF named) |
Vector Readers (OGR-based)
| Reader | Format Name | Description |
|---|---|---|
| Vector Reader | ogr | Generic reader for any OGR-supported vector format |
| Shapefile Reader | shapefile_ogr | Named reader for ESRI Shapefiles |
| GeoJSON Reader | geojson_ogr | Named reader for GeoJSON/GeoJSONSeq |
| GeoPackage Reader | gpkg_ogr | Named reader for GeoPackage |
| FileGDB Reader | file_gdb_ogr | Named reader for ESRI File Geodatabase |
Basic Usage
All readers follow the same pattern:
# Python
df = spark.read.format("<reader_name>").load("/path/to/file")
# Scala
val df = spark.read.format("<reader_name>").load("/path/to/file")
-- SQL
SELECT * FROM <reader_name>.`/path/to/file`;
Examples
Raster (GeoTIFF):
df = spark.read.format("gtiff_gdal").load("/path/to/raster.tif")
Vector (Shapefile):
df = spark.read.format("shapefile_ogr").load("/path/to/data.shp")
Generic (any format):
# Generic GDAL for rasters
df = spark.read.format("gdal").option("driver", "NetCDF").load("/path/to/data.nc")
# Generic OGR for vectors
df = spark.read.format("ogr").option("driverName", "KML").load("/path/to/data.kml")
Output Schemas
Raster Output
Produces the standard tile format used by GeoBrix APIs.
root
|-- tile: struct
|-- cellid: bigint (grid cell ID, nullable)
|-- raster: binary (raster file content)
|-- metadata: map<string,string> (driver, extension, etc.)
See Tile Structure for detailed field descriptions.
Vector Output
Output will vary, depending on format / driver conventions.
root
|-- geom_0: binary (geometry in WKB format)
|-- geom_0_srid: integer (spatial reference ID)
|-- geom_0_srid_proj: string (projection definition)
|-- <attribute_columns>: various (feature attributes)
Path Types
All readers support:
- Single file:
/path/to/file.tif - Directory:
/path/to/directory/ - Wildcard:
/path/to/*.tif - Cloud storage:
s3://bucket/path,abfss://...,gs://bucket/path - Unity Catalog Volumes:
/Volumes/catalog/schema/volume/path
Common Options
Raster Options
| Option | Default | Description |
|---|---|---|
driver | Auto-detect | GDAL driver name (e.g., "GTiff", "NetCDF") |
sizeInMB | "-1" | Default (<= 0) = no split (one tile per file); positive MB value splits large files |
filterRegex | ".*" | Filter files by regex pattern |
Vector Options
| Option | Default | Description |
|---|---|---|
driverName | Auto-detect | OGR driver name (e.g., "ESRI Shapefile") |
chunkSize | "10000" | Records per chunk for parallel reading |
layerName | "" | Layer name for multi-layer formats |
layerN | "0" | Layer index for multi-layer formats |
asWKB | "true" | Output geometry as WKB (binary) vs WKT (text) |
Reader Types Explained
Generic Readers (gdal, ogr):
- Work with any format supported by GDAL/OGR
- Require
driverordriverNameoption for non-standard formats - Use when format doesn't have a named reader
Named Readers (gtiff_gdal, shapefile_ogr, etc.):
- Preset the driver option for common formats
- Cleaner syntax, no driver option needed
- Recommended for supported formats
Example:
# Named reader (cleaner)
df = spark.read.format("gtiff_gdal").load("/path/to/file.tif")
# Generic reader (more verbose, same result)
df = spark.read.format("gdal").option("driver", "GTiff").load("/path/to/file.tif")
Performance Tips
-
Use appropriate split size for rasters:
df = spark.read.format("gdal").option("sizeInMB", "32").load("/path") -
Use chunk size for vectors:
df = spark.read.format("ogr").option("chunkSize", "50000").load("/path") -
Filter files with regex:
df = spark.read.format("gdal").option("filterRegex", ".*_2024.*\\.tif").load("/path") -
Write data after read to a table (avoid repeat loading from file):
df = spark.read.format("shapefile_ogr").load("/path").write.saveAsTable(...)
Reader Types Correspondence
Each lightweight *_gbx reader maps directly to a heavyweight counterpart: gtiff_gbx ↔
gtiff_gdal, shapefile_gbx ↔ shapefile_ogr, geojson_gbx ↔ geojson_ogr, gpkg_gbx
↔ gpkg_ogr, file_gdb_gbx ↔ file_gdb_ogr. The generic catch-all pair is vector_gbx
(lightweight) ↔ ogr (heavyweight), used for any OGR-supported format without a named reader.
Benchmarks
Each heavyweight reader is benchmarked against its *_gbx lightweight counterpart on the same
cluster, same source file, same row counts. Parity (row-count equality between the two tiers)
is a hard gate — a mismatch fails the run immediately. Per-format timing results are published
on the Benchmarking page; the vector-reader comparison table is in the
Results — vector readers section (timing figures will be filled in once a controlled
cluster run is completed).
For the full methodology, raster-function results, and the PMTiles writer comparison, see the Benchmarking page.
Next Steps
- Raster Reader - Generic and GeoTIFF raster readers
- GeoTIFF Reader - Named GeoTIFF reader
- Vector Reader - Generic vector reader
- Shapefile Reader - Named Shapefile reader
- GeoJSON Reader - Named GeoJSON reader
- GeoPackage Reader - Named GeoPackage reader
- FileGDB Reader - Named File Geodatabase reader