Skip to main content

Vector Reader

Both tiers produce the same (attrs..., geom_0, geom_0_srid, geom_0_srid_proj) schema — see Choosing an Execution Tier.

Benchmark & tradeoff

The lightweight (*_gbx) and heavyweight readers emit the same schema, but your compute usually decides the tier: the lightweight tier needs no JAR or init script and is the only option on Serverless, standard (shared), and ARM clusters. The heavyweight tier requires a classic x86 cluster (JAR + GDAL init script); where it is available it uses native GDAL on the JVM and tends to pull ahead on large workloads. See the Benchmarking page for light-vs-heavy timings and methodology.

Available Formats

Both tiers read any OGR/GDAL vector driver, including:

  • ESRI Shapefile (.shp), GeoJSON (.geojson, .json), GeoPackage (.gpkg), File Geodatabase (.gdb)
  • KML (.kml), GML (.gml), CSV with geometry (.csv), PostgreSQL/PostGIS, and 80+ more.
Format availability

Driver coverage varies by environment — some formats need extra GDAL drivers/packages installed.

Options

Both tiers (lightweight vector_gbx, heavyweight ogr) take the same options; the named format readers preset driverName.

OptionDefaultDescription
driverNamerequired on vector_gbx; auto-detected from the extension on ogr; preset on named readersOGR driver name (e.g. GPKG, ESRI Shapefile, GeoJSON) — forces a specific driver regardless of the file extension.
asWKB"true"Output geometry as WKB (binary) vs WKT (text).
chunkSize"10000"Records per read batch (in-memory batching on the single per-file read — not partition splitting).
layerName""Layer name for multi-layer formats (overrides the layer index).
layerNumber / layerN"0"Layer index for multi-layer formats (0-based) — layerNumber (lightweight) / layerN (heavyweight).

Example — forcing the driver explicitly:

# Explicit driver (sample-data Volumes path)
df = spark.read.format("ogr") \
.option("driverName", "GeoJSON") \
.load("/Volumes/main/default/geobrix_samples/geobrix-examples/nyc/boroughs/nyc_boroughs.geojson")
df.show()
Example output
+--------------------+-----------+-----+
|geom_0 |geom_0_srid|... |
+--------------------+-----------+-----+
|[BINARY] |4326 |... |
|... |... |... |
+--------------------+-----------+-----+

vector_gbx is the lightweight catch-all vector reader (pyogrio-backed, no JAR). It reads any OGR-supported format and emits the same schema as the heavyweight ogr reader.

# Lightweight generic vector reader (pyogrio; no JAR)
from databricks.labs.gbx.ds.register import register
register(spark)
df = spark.read.format("vector_gbx").load(SAMPLE) # (attrs..., geom_0, geom_0_srid, geom_0_srid_proj)
df.show()

It is the lightweight counterpart of the heavyweight ogr reader, supporting Python and SQL bindings (not Scala).

Typical pipeline: ingest into a table

The common pattern is to land vector files in a table for downstream analytics — on Databricks a managed table is Delta:

df = (spark.read.format("vector_gbx")
.option("driverName", "GeoJSON") # pass any OGR driver name
.load("/Volumes/main/geo/raw/")) # a folder of files
df.write.mode("overwrite").saveAsTable("main.geo.features") # Delta table on Databricks

Reading a folder fans the files across the cluster (one partition per file), so ingest scales with the data — unlike a single-node pyogrio.read_* that parses one file on one machine. See Benchmarking for light-vs-heavy ingest figures.

Next Steps