Vector Reader
Both tiers produce the same (attrs..., geom_0, geom_0_srid, geom_0_srid_proj) schema — see Choosing an Execution Tier.
The lightweight (*_gbx) and heavyweight readers emit the same schema, but your
compute usually decides the tier: the lightweight tier needs no JAR or init script
and is the only option on Serverless, standard (shared), and ARM clusters. The
heavyweight tier requires a classic x86 cluster (JAR + GDAL init script); where it is
available it uses native GDAL on the JVM and tends to pull ahead on large workloads. See
the Benchmarking page for light-vs-heavy timings and methodology.
Available Formats
Both tiers read any OGR/GDAL vector driver, including:
- ESRI Shapefile (.shp), GeoJSON (.geojson, .json), GeoPackage (.gpkg), File Geodatabase (.gdb)
- KML (.kml), GML (.gml), CSV with geometry (.csv), PostgreSQL/PostGIS, and 80+ more.
Driver coverage varies by environment — some formats need extra GDAL drivers/packages installed.
Options
Both tiers (lightweight vector_gbx, heavyweight ogr) take the same options; the named format readers preset driverName.
| Option | Default | Description |
|---|---|---|
driverName | required on vector_gbx; auto-detected from the extension on ogr; preset on named readers | OGR driver name (e.g. GPKG, ESRI Shapefile, GeoJSON) — forces a specific driver regardless of the file extension. |
asWKB | "true" | Output geometry as WKB (binary) vs WKT (text). |
chunkSize | "10000" | Records per read batch (in-memory batching on the single per-file read — not partition splitting). |
layerName | "" | Layer name for multi-layer formats (overrides the layer index). |
layerNumber / layerN | "0" | Layer index for multi-layer formats (0-based) — layerNumber (lightweight) / layerN (heavyweight). |
Example — forcing the driver explicitly:
# Explicit driver (sample-data Volumes path)
df = spark.read.format("ogr") \
.option("driverName", "GeoJSON") \
.load("/Volumes/main/default/geobrix_samples/geobrix-examples/nyc/boroughs/nyc_boroughs.geojson")
df.show()
+--------------------+-----------+-----+
|geom_0 |geom_0_srid|... |
+--------------------+-----------+-----+
|[BINARY] |4326 |... |
|... |... |... |
+--------------------+-----------+-----+
- Lightweight · vector_gbx
- Heavyweight · ogr
vector_gbx is the lightweight catch-all vector reader (pyogrio-backed, no JAR). It reads any OGR-supported format and emits the same schema as the heavyweight ogr reader.
# Lightweight generic vector reader (pyogrio; no JAR)
from databricks.labs.gbx.ds.register import register
register(spark)
df = spark.read.format("vector_gbx").load(SAMPLE) # (attrs..., geom_0, geom_0_srid, geom_0_srid_proj)
df.show()
It is the lightweight counterpart of the heavyweight ogr reader, supporting Python and SQL bindings (not Scala).
Typical pipeline: ingest into a table
The common pattern is to land vector files in a table for downstream analytics — on Databricks a managed table is Delta:
df = (spark.read.format("vector_gbx")
.option("driverName", "GeoJSON") # pass any OGR driver name
.load("/Volumes/main/geo/raw/")) # a folder of files
df.write.mode("overwrite").saveAsTable("main.geo.features") # Delta table on Databricks
Reading a folder fans the files across the cluster (one partition per file), so ingest scales with the data — unlike a single-node pyogrio.read_* that parses one file on one machine. See Benchmarking for light-vs-heavy ingest figures.
The OGR reader provides generic support for reading vector data formats through the OGR library. This is the base reader that powers all vector format readers in GeoBrix.
Format Name
ogr
Overview
The OGR reader is a generic vector data reader that can handle any format supported by OGR/GDAL. While GeoBrix provides named readers for common formats (Shapefile, GeoJSON, GeoPackage, etc.), you can use the OGR reader directly for any available format.
Basic Usage
Python
# OGR reader (sample-data Volumes path)
df = spark.read.format("ogr").load("/Volumes/main/default/geobrix_samples/geobrix-examples/nyc/boroughs/nyc_boroughs.geojson")
df.show()
+--------------------+-----------+-----+
|geom_0 |geom_0_srid|... |
+--------------------+-----------+-----+
|[BINARY] |4326 |... |
|... |... |... |
+--------------------+-----------+-----+
Scala
val df = spark.read.format("ogr").load("/Volumes/main/default/geobrix_samples/geobrix-examples/nyc/boroughs/nyc_boroughs.geojson")
+--------------------+-----------+-----+
|geom_0 |geom_0_srid|... |
+--------------------+-----------+-----+
|[BINARY] |4326 |... |
|... |... |... |
+--------------------+-----------+-----+
SQL
-- Read with OGR in SQL (sample-data Volumes path)
SELECT * FROM ogr.`/Volumes/main/default/geobrix_samples/geobrix-examples/nyc/boroughs/nyc_boroughs.geojson`;
+--------------------+-----------+-----+
|geom_0 |geom_0_srid|... |
+--------------------+-----------+-----+
|[BINARY] |4326 |... |
|... |... |... |
+--------------------+-----------+-----+
Output Schema
root
|-- geom_0: binary (geometry in WKB format)
|-- geom_0_srid: integer (spatial reference ID)
|-- geom_0_srid_proj: string (projection definition)
|-- <attribute_1>: <type> (feature attributes...)
|-- <attribute_2>: <type>
|-- ...
Databricks Integration
OGR (and named vector readers) output geometry in WKB format. To use with Databricks spatial functions, convert to GEOMETRY type. Example uses the Shapefile reader and sample-data Volumes path; the same pattern applies to any OGR-based reader.
These examples use st_geomfromwkb to convert GeoBrix WKB to Databricks GEOMETRY type.
Convert to GEOMETRY
# Convert WKB to Databricks GEOMETRY type
df = spark.read.format("shapefile_ogr").load("/Volumes/main/default/geobrix_samples/geobrix-examples/nyc/subway/nyc_subway.shp.zip")
df_with_geom = df.select("*", expr("st_geomfromwkb(geom_0)").alias("geometry"))
SQL Example
-- Read shapefile and convert to GEOMETRY in SQL
CREATE OR REPLACE TEMP VIEW stations AS
SELECT *, st_geomfromwkb(geom_0) as geometry
FROM shapefile_ogr.`/Volumes/main/default/geobrix_samples/geobrix-examples/nyc/subway/nyc_subway.shp.zip`;
SELECT name, geometry FROM stations LIMIT 10;
Named Readers vs OGR
For common formats, GeoBrix provides named readers for convenience (sample-data Volumes path):
# Named reader (recommended for common formats)
df = spark.read.format("shapefile_ogr").load("/Volumes/main/default/geobrix_samples/geobrix-examples/nyc/subway/nyc_subway.shp.zip")
# OGR with explicit driver (same result)
df = spark.read.format("ogr").option("driverName", "ESRI Shapefile").load("/Volumes/main/default/geobrix_samples/geobrix-examples/nyc/subway/nyc_subway.shp.zip")
When to use each:
- Named readers (shapefile, geojson, ogr_gpkg, file_gdb): Better for common formats, cleaner syntax
- OGR: Useful for less common formats or when you need OGR-specific options