Vector Reader

Both tiers produce the same (attrs..., geom_0, geom_0_srid, geom_0_srid_proj) schema — see Choosing an Execution Tier.

Benchmark & tradeoff

The lightweight (*_gbx) and heavyweight readers emit the same schema, but your compute usually decides the tier: the lightweight tier needs no JAR or init script and is the only option on Serverless, standard (shared), and ARM clusters. The heavyweight tier requires a classic x86 cluster (JAR + GDAL init script); where it is available it uses native GDAL on the JVM and tends to pull ahead on large workloads. See the Benchmarking page for light-vs-heavy timings and methodology.

Available Formats

Both tiers read any OGR/GDAL vector driver, including:

ESRI Shapefile (.shp), GeoJSON (.geojson, .json), GeoPackage (.gpkg), File Geodatabase (.gdb)
KML (.kml), GML (.gml), CSV with geometry (.csv), PostgreSQL/PostGIS, and 80+ more.

Format availability

Driver coverage varies by environment — some formats need extra GDAL drivers/packages installed.

Options

Both tiers (lightweight vector_gbx, heavyweight ogr) take the same options; the named format readers preset driverName.

Option	Default	Description
`driverName`	required on `vector_gbx`; auto-detected from the extension on `ogr`; preset on named readers	OGR driver name (e.g. `GPKG`, `ESRI Shapefile`, `GeoJSON`) — forces a specific driver regardless of the file extension.
`asWKB`	`"true"`	Output geometry as WKB (binary) vs WKT (text).
`chunkSize`	`"10000"`	Records per read batch (in-memory batching on the single per-file read — not partition splitting).
`layerName`	`""`	Layer name for multi-layer formats (overrides the layer index).
`layerNumber` / `layerN`	`"0"`	Layer index for multi-layer formats (0-based) — `layerNumber` (lightweight) / `layerN` (heavyweight).

Example — forcing the driver explicitly:

# Explicit driver (sample-data Volumes path)
df = spark.read.format("ogr") \
    .option("driverName", "GeoJSON") \
    .load("/Volumes/main/default/geobrix_samples/geobrix-examples/nyc/boroughs/nyc_boroughs.geojson")
df.show()

Example output
+--------------------+-----------+-----+
|geom_0              |geom_0_srid|...  |
+--------------------+-----------+-----+
|[BINARY]            |4326       |...  |
|...                 |...        |...  |
+--------------------+-----------+-----+

Lightweight · vector_gbx
Heavyweight · ogr

vector_gbx is the lightweight catch-all vector reader (pyogrio-backed, no JAR). It reads any OGR-supported format and emits the same schema as the heavyweight ogr reader.

# Lightweight generic vector reader (pyogrio; no JAR)
from databricks.labs.gbx.ds.register import register
register(spark)
df = spark.read.format("vector_gbx").load(SAMPLE)   # (attrs..., geom_0, geom_0_srid, geom_0_srid_proj)
df.show()

It is the lightweight counterpart of the heavyweight ogr reader, supporting Python and SQL bindings (not Scala).

Typical pipeline: ingest into a table

The common pattern is to land vector files in a table for downstream analytics — on Databricks a managed table is Delta:

df = (spark.read.format("vector_gbx")
      .option("driverName", "GeoJSON")           # pass any OGR driver name
      .load("/Volumes/main/geo/raw/"))            # a folder of files
df.write.mode("overwrite").saveAsTable("main.geo.features")  # Delta table on Databricks

Reading a folder fans the files across the cluster (one partition per file), so ingest scales with the data — unlike a single-node pyogrio.read_* that parses one file on one machine. See Benchmarking for light-vs-heavy ingest figures.

The OGR reader provides generic support for reading vector data formats through the OGR library. This is the base reader that powers all vector format readers in GeoBrix.

Format Name

ogr

Overview

The OGR reader is a generic vector data reader that can handle any format supported by OGR/GDAL. While GeoBrix provides named readers for common formats (Shapefile, GeoJSON, GeoPackage, etc.), you can use the OGR reader directly for any available format.

Basic Usage

Python

# OGR reader (sample-data Volumes path)
df = spark.read.format("ogr").load("/Volumes/main/default/geobrix_samples/geobrix-examples/nyc/boroughs/nyc_boroughs.geojson")
df.show()

Example output
+--------------------+-----------+-----+
|geom_0              |geom_0_srid|...  |
+--------------------+-----------+-----+
|[BINARY]            |4326       |...  |
|...                 |...        |...  |
+--------------------+-----------+-----+

Scala

val df = spark.read.format("ogr").load("/Volumes/main/default/geobrix_samples/geobrix-examples/nyc/boroughs/nyc_boroughs.geojson")

Example output
+--------------------+-----------+-----+
|geom_0              |geom_0_srid|...  |
+--------------------+-----------+-----+
|[BINARY]            |4326       |...  |
|...                 |...        |...  |
+--------------------+-----------+-----+

SQL

-- Read with OGR in SQL (sample-data Volumes path)
SELECT * FROM ogr.`/Volumes/main/default/geobrix_samples/geobrix-examples/nyc/boroughs/nyc_boroughs.geojson`;

Example output
+--------------------+-----------+-----+
|geom_0              |geom_0_srid|...  |
+--------------------+-----------+-----+
|[BINARY]            |4326       |...  |
|...                 |...        |...  |
+--------------------+-----------+-----+

Output Schema

root
 |-- geom_0: binary (geometry in WKB format)
 |-- geom_0_srid: integer (spatial reference ID)
 |-- geom_0_srid_proj: string (projection definition)
 |-- <attribute_1>: <type> (feature attributes...)
 |-- <attribute_2>: <type>
 |-- ...

Databricks Integration

OGR (and named vector readers) output geometry in WKB format. To use with Databricks spatial functions, convert to GEOMETRY type. Example uses the Shapefile reader and sample-data Volumes path; the same pattern applies to any OGR-based reader.

Requires Databricks Runtime

These examples use st_geomfromwkb to convert GeoBrix WKB to Databricks GEOMETRY type.

Convert to GEOMETRY

# Convert WKB to Databricks GEOMETRY type
df = spark.read.format("shapefile_ogr").load("/Volumes/main/default/geobrix_samples/geobrix-examples/nyc/subway/nyc_subway.shp.zip")
df_with_geom = df.select("*", expr("st_geomfromwkb(geom_0)").alias("geometry"))

SQL Example

-- Read shapefile and convert to GEOMETRY in SQL
CREATE OR REPLACE TEMP VIEW stations AS
SELECT *, st_geomfromwkb(geom_0) as geometry
FROM shapefile_ogr.`/Volumes/main/default/geobrix_samples/geobrix-examples/nyc/subway/nyc_subway.shp.zip`;

SELECT name, geometry FROM stations LIMIT 10;

Named Readers vs OGR

For common formats, GeoBrix provides named readers for convenience (sample-data Volumes path):

# Named reader (recommended for common formats)
df = spark.read.format("shapefile_ogr").load("/Volumes/main/default/geobrix_samples/geobrix-examples/nyc/subway/nyc_subway.shp.zip")
# OGR with explicit driver (same result)
df = spark.read.format("ogr").option("driverName", "ESRI Shapefile").load("/Volumes/main/default/geobrix_samples/geobrix-examples/nyc/subway/nyc_subway.shp.zip")

When to use each:

Named readers (shapefile, geojson, ogr_gpkg, file_gdb): Better for common formats, cleaner syntax
OGR: Useful for less common formats or when you need OGR-specific options

Available Formats​

Options​

Typical pipeline: ingest into a table​

Format Name​

Overview​

Basic Usage​

Python​

Scala​

SQL​

Output Schema​

Databricks Integration​

Convert to GEOMETRY​

SQL Example​

Named Readers vs OGR​

Next Steps​