Skip to main content

GeoJSONL Writer (multi-file)

Emit a directory of newline-delimited GeoJSONL shards (OGR driver GeoJSONSeq, one Feature per line) — one shard per partition, no driver merge. Unlike the single-file GeoJSON writer — which merges every partition into one FeatureCollection file on the driver — the GeoJSONL writer's shards are the dataset, so write throughput scales with partitions. An optional maxRecordsPerFile splits a partition into several shards.

Both tiers take the same input shape and produce a directory of part-<uuid>.geojsonl shards that round-trips with the GeoJSON reader's directory mode:

  • Lightweight geojsonl_gbx — pure-Python, Serverless-safe (pyogrio); read back via geojson_gbx with option("multi","true").
  • Heavyweight geojsonl — DataSource V2 writer on native GDAL/OGR (JVM); read back via geojson_ogr with option("multi","true").
Which tier?

The lightweight (*_gbx) writer needs no JAR or init script and is the only option on Serverless, standard (shared), and ARM clusters. The heavyweight geojsonl writer requires a classic x86 cluster with the GeoBrix JAR + GDAL init script; where available it encodes each shard with native GDAL/OGR on the JVM. Your compute usually decides the tier. See the Writers Overview.

Register first

Call register(spark) once before using any *_gbx format (see the Writers Overview).

Input geometry format

The vector writers take a geometry column encoded as WKB (binary) or WKT (text) — the same encodings the readers emit. The SRID/CRS is taken from the companion *_srid / *_srid_proj columns, so plain WKB or WKT is all that is needed. Extended forms (EWKB, EWKT) and GeoJSON-encoded geometry are not accepted as writer input.

Writing from a Databricks GEOMETRY / GEOGRAPHY column? Convert it to a supported interchange format first with an ST export function: use ST_AsBinary(geom) for WKB (recommended) or ST_AsText(geom) for WKT.

Options

OptionDefaultBehavior
maxRecordsPerFileunset (one shard per partition)If set, split each partition into multiple shards of at most this many features.
modeoverwriteoverwrite only; append is rejected. The target directory is cleared once before the shards land.
geometryTypeinferred from the dataOverride the OGR geometry type.
layerNamedriver defaultOutput layer name where supported.

Schema

Input schema — a geometry column plus its SRID companion:

root
|-- <geom>: binary (WKB) or string (WKT) # the geometry
|-- <geom>_srid: string # REQUIRED — CRS authority code, e.g. "4326" ("0" if unknown)
|-- <geom>_srid_proj: string # optional — PROJ4 string, used as a CRS fallback when srid is "0"
|-- ...any other columns # written as feature attributes

The writer locates the geometry as the column X that has a companion X_srid column (geom_0 by convention — what the readers emit). The _srid / _srid_proj columns are consumed for the CRS and are not written as fields. Every other column is written as a feature attribute.

Output: a directory (out/) containing one part-<uuid>.geojsonl shard per non-empty partition (plus an advisory _SUCCESS marker). Each shard is a sequence of newline-delimited GeoJSON Feature objects; attributes are written as feature properties.

Write it

Coerce your DataFrame/table to the writer shape, then write:

from pyspark.sql import functions as F

df.select(
F.col("my_geom_wkb").alias("geom_0"), # geometry as WKB (or WKT)
F.lit("4326").alias("geom_0_srid"), # CRS authority code
F.lit("").alias("geom_0_srid_proj"), # optional PROJ4 fallback
"name", "population", # -> written as feature attributes
).write.format("geojsonl_gbx").mode("overwrite").save("/Volumes/cat/sch/vol/out")

Read it back with the GeoJSON reader's directory mode — multi=true enumerates the .geojsonl shards and parses each as a GeoJSONSeq sequence:

back = spark.read.format("geojson_gbx").option("multi", "true").load("/Volumes/cat/sch/vol/out")

How it scales

The single-file GeoJSON writer writes one merged file: each partition writes a fragment, then the driver concatenates all fragments into a single FeatureCollection. That driver-side merge is sequential — fine for "one file for the table", but a single-node bottleneck at scale.

The GeoJSONL writer writes one shard per partition with no driver merge. Each executor encodes its partition to a worker-local GeoJSONL shard and sequentially copies it into the output directory (rename-free, so safe on FUSE-mounted Unity Catalog Volumes / DBFS). Because GeoJSONL is splittable and concatenable, the directory of shards is the dataset — there is nothing to assemble on the driver, so write throughput scales with the number of partitions. Use df.repartition(n) to set shard granularity, or maxRecordsPerFile to cap features per shard (splitting a large partition into several shards). This holds for both tiers — the lightweight writer parallelizes per partition with pyogrio, the heavyweight writer with native GDAL/OGR on the JVM.

Choose the single-file GeoJSON writer when you want one file; choose GeoJSONL for large, streaming, or highly parallel writes.

Typical pipeline: export a table to a sharded directory

Export a table of vector data to a directory of GeoJSONL shards — repartition to control write parallelism (one shard per partition):

StepCode (lightweight)
1. Load a tabledf = spark.table("main.geo.parcels")
2. Set parallelismdf = df.repartition(64)
3. Write shardeddf.write.format("geojsonl_gbx").mode("overwrite").save("/Volumes/main/geo/exports/parcels")
4. Read backspark.read.format("geojson_gbx").option("multi","true").load("/Volumes/main/geo/exports/parcels")
df = spark.table("main.geo.parcels")          # vector data in a (Delta) table
df.repartition(64).write.format("geojsonl_gbx").mode("overwrite").save(
"/Volumes/main/geo/exports/parcels"
)
# -> a directory of 64 part-<uuid>.geojsonl shards, written in parallel, no driver merge

On a classic x86 cluster, swap the format names for the heavyweight tier (geojsonl to write, geojson_ogr with multi=true to read back).

To cap features per shard regardless of partition size:

df.write.format("geojsonl_gbx").mode("overwrite").option(
"maxRecordsPerFile", "50000"
).save("/Volumes/main/geo/exports/parcels")

Next Steps