GeoJSON Reader
Both tiers produce the same (attrs..., geom_0, geom_0_srid, geom_0_srid_proj) schema — see Choosing an Execution Tier.
The lightweight (*_gbx) and heavyweight readers emit the same schema, but your
compute usually decides the tier: the lightweight tier needs no JAR or init script
and is the only option on Serverless, standard (shared), and ARM clusters. The
heavyweight tier requires a classic x86 cluster (JAR + GDAL init script); where it is
available it uses native GDAL on the JVM and tends to pull ahead on large workloads. See
the Benchmarking page for light-vs-heavy timings and methodology.
Options
Lightweight (geojson_gbx)
The driver (GeoJSON) is preset; there is no multi switch — geojson_gbx reads both GeoJSON and GeoJSONSeq files.
| Option | Default | Description |
|---|---|---|
asWKB | "true" | Output geometry as WKB (binary) vs WKT (text). |
chunkSize | "10000" | Arrow batch size for the per-file read; controls in-memory batching, not partition splitting. |
layerNumber | "0" | Layer index to read (0-based). |
layerName | "" | Layer name to read; takes precedence over layerNumber when set. |
Heavyweight (geojson_ogr)
| Option | Default | Description |
|---|---|---|
multi | "true" | "true" uses the GeoJSONSeq driver (newline-delimited, better for large files); "false" uses the GeoJSON driver (standard FeatureCollection). |
chunkSize | "10000" | Number of records per chunk for parallel reading. |
asWKB | "true" | Output geometry as WKB (binary) vs WKT (text). |
All other OGR reader options (driverName, layerN, layerName, …) are also available.
Example — reading newline-delimited GeoJSONSeq (the multi default):
# Read GeoJSONSeq (newline-delimited, sample-data path)
df = spark.read.format("geojson_ogr").load("/Volumes/main/default/geobrix_samples/geobrix-examples/nyc/boroughs/nyc_boroughs.geojsonl")
# Or explicitly: .option("multi", "true")
df.show()
+--------------------+-----------+---------+
|geom_0 |geom_0_srid|BoroName |
+--------------------+-----------+---------+
|[BINARY] |4326 |... |
|... |... |... |
+--------------------+-----------+---------+
- Lightweight · geojson_gbx
- Heavyweight · geojson_ogr
geojson_gbx is the lightweight GeoJSON reader (pyogrio-backed, no JAR). It reads GeoJSON and GeoJSONSeq and emits the same schema as the heavyweight geojson_ogr reader.
# Lightweight GeoJSON reader (pyogrio; no JAR)
from databricks.labs.gbx.ds.register import register
register(spark)
df = spark.read.format("geojson_gbx").load(SAMPLE) # (attrs..., geom_0, geom_0_srid, geom_0_srid_proj)
df.show()
It is the lightweight counterpart of the heavyweight geojson_ogr reader, supporting Python and SQL bindings (not Scala).
Typical pipeline: ingest into a table
The common pattern is to land GeoJSON files in a table for downstream analytics — on Databricks a managed table is Delta:
df = spark.read.format("geojson_gbx").load("/Volumes/main/geo/raw/") # a folder of .geojson files
df.write.mode("overwrite").saveAsTable("main.geo.boroughs") # Delta table on Databricks
Reading a folder fans the files across the cluster (one partition per file), so ingest scales with the data — unlike a single-node pyogrio.read_* that parses one file on one machine. See Benchmarking for light-vs-heavy ingest figures.
The GeoJSON reader provides support for reading GeoJSON and GeoJSONSeq (newline-delimited GeoJSON) formats.
Format Name
geojson_ogr
Overview
This is a named OGR Reader that intelligently switches between GeoJSON and GeoJSONSeq drivers based on the multi option.
Supported Formats
- GeoJSON (.geojson, .json) - Standard GeoJSON FeatureCollection
- GeoJSONSeq (.geojsonl, .geojsons) - Newline-delimited GeoJSON (default)
Basic Usage
Python
# Read standard GeoJSON (sample-data Volumes path)
df = spark.read.format("geojson_ogr") \
.option("multi", "false") \
.load("/Volumes/main/default/geobrix_samples/geobrix-examples/nyc/boroughs/nyc_boroughs.geojson")
df.show()
+--------------------+-----------+---------+
|geom_0 |geom_0_srid|BoroName |
+--------------------+-----------+---------+
|[BINARY] |4326 |Manhattan|
|... |... |... |
+--------------------+-----------+---------+
Scala
val df = spark.read.format("geojson_ogr")
| .option("multi", "false")
| .load("/Volumes/main/default/geobrix_samples/geobrix-examples/nyc/boroughs/nyc_boroughs.geojson")
+--------------------+-----------+---------+
|geom_0 |geom_0_srid|BoroName |
+--------------------+-----------+---------+
|[BINARY] |4326 |Manhattan|
|... |... |... |
+--------------------+-----------+---------+
SQL
-- Read GeoJSON in SQL (sample-data Volumes path)
SELECT * FROM geojson_ogr.`/Volumes/main/default/geobrix_samples/geobrix-examples/nyc/boroughs/nyc_boroughs.geojson`;
+--------------------+-----------+---------+
|geom_0 |geom_0_srid|BoroName |
+--------------------+-----------+---------+
|[BINARY] |4326 |... |
|... |... |... |
+--------------------+-----------+---------+
Output Schema
The output maintains attribute columns and adds three columns for geometry:
root
|-- geom_0: binary (geometry in WKB format)
|-- geom_0_srid: integer (spatial reference ID)
|-- geom_0_srid_proj: string (projection definition)
|-- <properties>: various types (GeoJSON properties)
GeoJSON vs GeoJSONSeq
Standard GeoJSON (FeatureCollection)
Format:
{
"type": "FeatureCollection",
"features": [
{
"type": "Feature",
"geometry": { "type": "Point", "coordinates": [0, 0] },
"properties": { "name": "Feature 1" }
}
]
}
When to use: Small to medium files, API responses, standard GeoJSON files
Read with: option("multi", "false")
GeoJSONSeq (Newline-Delimited)
Format:
{"type":"Feature","geometry":{"type":"Point","coordinates":[0,0]},"properties":{"name":"Feature 1"}}
{"type":"Feature","geometry":{"type":"Point","coordinates":[1,1]},"properties":{"name":"Feature 2"}}
When to use: Large files, streaming data, parallel processing, better Spark performance
Read with: option("multi", "true") (default) or omit the option