GeoJSON Reader

Both tiers produce the same (attrs..., geom_0, geom_0_srid, geom_0_srid_proj) schema — see Choosing an Execution Tier.

Benchmark & tradeoff

The lightweight (*_gbx) and heavyweight readers emit the same schema, but your compute usually decides the tier: the lightweight tier needs no JAR or init script and is the only option on Serverless, standard (shared), and ARM clusters. The heavyweight tier requires a classic x86 cluster (JAR + GDAL init script); where it is available it uses native GDAL on the JVM and tends to pull ahead on large workloads. See the Benchmarking page for light-vs-heavy timings and methodology.

Options

Lightweight (`geojson_gbx`)

The driver (GeoJSON) is preset; there is no multi switch — geojson_gbx reads both GeoJSON and GeoJSONSeq files.

Option	Default	Description
`asWKB`	`"true"`	Output geometry as WKB (binary) vs WKT (text).
`chunkSize`	`"10000"`	Arrow batch size for the per-file read; controls in-memory batching, not partition splitting.
`layerNumber`	`"0"`	Layer index to read (0-based).
`layerName`	`""`	Layer name to read; takes precedence over `layerNumber` when set.

Heavyweight (`geojson_ogr`)

Option	Default	Description
`multi`	`"true"`	`"true"` uses the GeoJSONSeq driver (newline-delimited, better for large files); `"false"` uses the GeoJSON driver (standard FeatureCollection).
`chunkSize`	`"10000"`	Number of records per chunk for parallel reading.
`asWKB`	`"true"`	Output geometry as WKB (binary) vs WKT (text).

All other OGR reader options (driverName, layerN, layerName, …) are also available.

Example — reading newline-delimited GeoJSONSeq (the multi default):

# Read GeoJSONSeq (newline-delimited, sample-data path)
df = spark.read.format("geojson_ogr").load("/Volumes/main/default/geobrix_samples/geobrix-examples/nyc/boroughs/nyc_boroughs.geojsonl")
# Or explicitly: .option("multi", "true")
df.show()

Example output
+--------------------+-----------+---------+
|geom_0              |geom_0_srid|BoroName |
+--------------------+-----------+---------+
|[BINARY]            |4326       |...      |
|...                 |...        |...      |
+--------------------+-----------+---------+

Lightweight · geojson_gbx
Heavyweight · geojson_ogr

geojson_gbx is the lightweight GeoJSON reader (pyogrio-backed, no JAR). It reads GeoJSON and GeoJSONSeq and emits the same schema as the heavyweight geojson_ogr reader.

# Lightweight GeoJSON reader (pyogrio; no JAR)
from databricks.labs.gbx.ds.register import register
register(spark)
df = spark.read.format("geojson_gbx").load(SAMPLE)   # (attrs..., geom_0, geom_0_srid, geom_0_srid_proj)
df.show()

It is the lightweight counterpart of the heavyweight geojson_ogr reader, supporting Python and SQL bindings (not Scala).

Typical pipeline: ingest into a table

The common pattern is to land GeoJSON files in a table for downstream analytics — on Databricks a managed table is Delta:

df = spark.read.format("geojson_gbx").load("/Volumes/main/geo/raw/")  # a folder of .geojson files
df.write.mode("overwrite").saveAsTable("main.geo.boroughs")            # Delta table on Databricks

Reading a folder fans the files across the cluster (one partition per file), so ingest scales with the data — unlike a single-node pyogrio.read_* that parses one file on one machine. See Benchmarking for light-vs-heavy ingest figures.

The GeoJSON reader provides support for reading GeoJSON and GeoJSONSeq (newline-delimited GeoJSON) formats.

Format Name

geojson_ogr

Overview

This is a named OGR Reader that intelligently switches between GeoJSON and GeoJSONSeq drivers based on the multi option.

Supported Formats

GeoJSON (.geojson, .json) - Standard GeoJSON FeatureCollection
GeoJSONSeq (.geojsonl, .geojsons) - Newline-delimited GeoJSON (default)

Basic Usage

Python

# Read standard GeoJSON (sample-data Volumes path)
df = spark.read.format("geojson_ogr") \
    .option("multi", "false") \
    .load("/Volumes/main/default/geobrix_samples/geobrix-examples/nyc/boroughs/nyc_boroughs.geojson")
df.show()

Example output
+--------------------+-----------+---------+
|geom_0              |geom_0_srid|BoroName |
+--------------------+-----------+---------+
|[BINARY]            |4326       |Manhattan|
|...                 |...        |...      |
+--------------------+-----------+---------+

Scala

val df = spark.read.format("geojson_ogr")
      |  .option("multi", "false")
      |  .load("/Volumes/main/default/geobrix_samples/geobrix-examples/nyc/boroughs/nyc_boroughs.geojson")

Example output
+--------------------+-----------+---------+
|geom_0              |geom_0_srid|BoroName |
+--------------------+-----------+---------+
|[BINARY]            |4326       |Manhattan|
|...                 |...        |...      |
+--------------------+-----------+---------+

SQL

-- Read GeoJSON in SQL (sample-data Volumes path)
SELECT * FROM geojson_ogr.`/Volumes/main/default/geobrix_samples/geobrix-examples/nyc/boroughs/nyc_boroughs.geojson`;

Example output
+--------------------+-----------+---------+
|geom_0              |geom_0_srid|BoroName |
+--------------------+-----------+---------+
|[BINARY]            |4326       |...      |
|...                 |...        |...      |
+--------------------+-----------+---------+

Output Schema

The output maintains attribute columns and adds three columns for geometry:

root
 |-- geom_0: binary (geometry in WKB format)
 |-- geom_0_srid: integer (spatial reference ID)
 |-- geom_0_srid_proj: string (projection definition)
 |-- <properties>: various types (GeoJSON properties)

GeoJSON vs GeoJSONSeq

Standard GeoJSON (FeatureCollection)

Format:

{
  "type": "FeatureCollection",
  "features": [
    {
      "type": "Feature",
      "geometry": { "type": "Point", "coordinates": [0, 0] },
      "properties": { "name": "Feature 1" }
    }
  ]
}

When to use: Small to medium files, API responses, standard GeoJSON files

Read with: option("multi", "false")

GeoJSONSeq (Newline-Delimited)

Format:

{"type":"Feature","geometry":{"type":"Point","coordinates":[0,0]},"properties":{"name":"Feature 1"}}
{"type":"Feature","geometry":{"type":"Point","coordinates":[1,1]},"properties":{"name":"Feature 2"}}

When to use: Large files, streaming data, parallel processing, better Spark performance

Read with: option("multi", "true") (default) or omit the option

Options​

Lightweight (geojson_gbx)​

Heavyweight (geojson_ogr)​

Typical pipeline: ingest into a table​

Format Name​

Overview​

Supported Formats​

Basic Usage​

Python​

Scala​

SQL​

Output Schema​

GeoJSON vs GeoJSONSeq​

Standard GeoJSON (FeatureCollection)​

GeoJSONSeq (Newline-Delimited)​

Next Steps​