Skip to main content

GeoJSON Reader

Both tiers produce the same (attrs..., geom_0, geom_0_srid, geom_0_srid_proj) schema — see Choosing an Execution Tier.

Benchmark & tradeoff

The lightweight (*_gbx) and heavyweight readers emit the same schema, but your compute usually decides the tier: the lightweight tier needs no JAR or init script and is the only option on Serverless, standard (shared), and ARM clusters. The heavyweight tier requires a classic x86 cluster (JAR + GDAL init script); where it is available it uses native GDAL on the JVM and tends to pull ahead on large workloads. See the Benchmarking page for light-vs-heavy timings and methodology.

Options

Lightweight (geojson_gbx)

The driver (GeoJSON) is preset; there is no multi switch — geojson_gbx reads both GeoJSON and GeoJSONSeq files.

OptionDefaultDescription
asWKB"true"Output geometry as WKB (binary) vs WKT (text).
chunkSize"10000"Arrow batch size for the per-file read; controls in-memory batching, not partition splitting.
layerNumber"0"Layer index to read (0-based).
layerName""Layer name to read; takes precedence over layerNumber when set.

Heavyweight (geojson_ogr)

OptionDefaultDescription
multi"true""true" uses the GeoJSONSeq driver (newline-delimited, better for large files); "false" uses the GeoJSON driver (standard FeatureCollection).
chunkSize"10000"Number of records per chunk for parallel reading.
asWKB"true"Output geometry as WKB (binary) vs WKT (text).

All other OGR reader options (driverName, layerN, layerName, …) are also available.

Example — reading newline-delimited GeoJSONSeq (the multi default):

# Read GeoJSONSeq (newline-delimited, sample-data path)
df = spark.read.format("geojson_ogr").load("/Volumes/main/default/geobrix_samples/geobrix-examples/nyc/boroughs/nyc_boroughs.geojsonl")
# Or explicitly: .option("multi", "true")
df.show()
Example output
+--------------------+-----------+---------+
|geom_0 |geom_0_srid|BoroName |
+--------------------+-----------+---------+
|[BINARY] |4326 |... |
|... |... |... |
+--------------------+-----------+---------+

geojson_gbx is the lightweight GeoJSON reader (pyogrio-backed, no JAR). It reads GeoJSON and GeoJSONSeq and emits the same schema as the heavyweight geojson_ogr reader.

# Lightweight GeoJSON reader (pyogrio; no JAR)
from databricks.labs.gbx.ds.register import register
register(spark)
df = spark.read.format("geojson_gbx").load(SAMPLE) # (attrs..., geom_0, geom_0_srid, geom_0_srid_proj)
df.show()

It is the lightweight counterpart of the heavyweight geojson_ogr reader, supporting Python and SQL bindings (not Scala).

Typical pipeline: ingest into a table

The common pattern is to land GeoJSON files in a table for downstream analytics — on Databricks a managed table is Delta:

df = spark.read.format("geojson_gbx").load("/Volumes/main/geo/raw/")  # a folder of .geojson files
df.write.mode("overwrite").saveAsTable("main.geo.boroughs") # Delta table on Databricks

Reading a folder fans the files across the cluster (one partition per file), so ingest scales with the data — unlike a single-node pyogrio.read_* that parses one file on one machine. See Benchmarking for light-vs-heavy ingest figures.

Next Steps