Skip to main content

GeoJSON Writer

geojson_gbx — pyogrio-backed, pure-Python DataSource V2 writer with the OGR driver preset to GeoJSON. It round-trips with the matching geojson_gbx reader; the schema (attrs..., geom_0, geom_0_srid, geom_0_srid_proj) is shared across all lightweight vector readers and writers.

Single-file write is lightweight-only

This format's single-file writer is lightweight-only — its GDAL/OGR write path isn't implemented in the heavyweight tier. For large/any-scale output, the sharded GeoJSONL writer (geojsonl_gbx lightweight / geojsonl heavyweight) is available in both tiers.

Register first

Call register(spark) once before using any *_gbx format (see the Writers Overview).

Input geometry format

The vector writers take a geometry column encoded as WKB (binary) or WKT (text) — the same encodings the *_gbx readers emit (controlled by the reader's asWKB option). The SRID/CRS is taken from the companion *_srid / *_srid_proj columns, so plain WKB or WKT is all that is needed. Extended forms (EWKB, EWKT) and GeoJSON-encoded geometry are not accepted as writer input.

Writing from a Databricks GEOMETRY / GEOGRAPHY column? Convert it to a supported interchange format first with an ST export function: use ST_AsBinary(geom) for WKB (recommended) or ST_AsText(geom) for WKT. Avoid ST_GeomAsEWKB, ST_AsEWKT, and ST_AsGeoJSON — those encodings are not accepted as input.

Benchmark & tradeoff

The lightweight (*_gbx) writers need no JAR or init script and are the only option on Serverless, standard (shared), and ARM clusters. The heavyweight raster/PMTiles writers require a classic x86 cluster (JAR + GDAL init script); where available they use native GDAL on the JVM. So your compute usually decides the tier — then data scale. See the Benchmarking page for timings and methodology.

Options

OptionDefaultBehavior
driverNameGeoJSON (preset)OGR driver. Preset by this named writer; override only if needed.
modeoverwriteoverwrite only; append is rejected.
geometryTypeinferred from the dataOverride the OGR geometry type.
layerNamedriver defaultOutput layer name where supported.

Schema

Input schema — a geometry column plus its SRID companion:

root
|-- <geom>: binary (WKB) or string (WKT) # the geometry
|-- <geom>_srid: string # REQUIRED — CRS authority code, e.g. "4326" ("0" if unknown)
|-- <geom>_srid_proj: string # optional — PROJ4 string, used as a CRS fallback when srid is "0"
|-- ...any other columns # written as feature attributes

The writer locates the geometry as the column X that has a companion X_srid column, so the geometry column may be named anything (geom_0 by convention — what the *_gbx readers emit). <geom>_srid is required: it identifies the geometry column and supplies the CRS. The _srid / _srid_proj columns are consumed for the CRS and are not written as fields. Every other column is written as a feature attribute.

Coerce your DataFrame/table to this shape before writing:

from pyspark.sql import functions as F

df.select(
F.col("my_geom_wkb").alias("geom_0"), # geometry as WKB (or WKT)
F.lit("4326").alias("geom_0_srid"), # CRS authority code
F.lit("").alias("geom_0_srid_proj"), # optional PROJ4 fallback
"name", "population", # -> written as feature attributes
).write.format("geojson_gbx").mode("overwrite").save("/Volumes/cat/sch/vol/out")

From a Databricks GEOMETRY / GEOGRAPHY column. Databricks native spatial types are not a writer input format directly — export them to WKB plus an SRID first. Use ST_AsWKB (the equivalent of ST_AsBinary) for the geometry column and ST_SRID for the CRS, casting the SRID to a string:

df.selectExpr(
"ST_AsWKB(my_geom) AS geom_0", # GEOMETRY/GEOGRAPHY -> WKB
"CAST(ST_SRID(my_geom) AS STRING) AS geom_0_srid", # SRID -> string
"'' AS geom_0_srid_proj",
"name", "population", # -> feature attributes
).write.format("geojson_gbx").mode("overwrite").save("/Volumes/cat/sch/vol/out")

See Databricks Spatial for the full ST function reference.

The geometry must be WKB or WKT (see the input-geometry note above; convert a Databricks GEOMETRY with ST_AsBinary). The written file round-trips with the matching *_gbx reader — reading it back yields (…attributes, geom_0, geom_0_srid, geom_0_srid_proj).

Output: a single GeoJSON .geojson text file (a FeatureCollection); attributes are written as feature properties.

How it scales

Unlike a single-node pyogrio.write_* call that serializes one file on one machine, the geojson_gbx writer runs as a Spark DataSource V2 two-phase write: each partition is written concurrently by its executor to a scratch fragment, then the driver merges the fragments into one output file. The merge is sequential and rename-free, so it is safe on FUSE-mounted cloud storage (Unity Catalog Volumes, DBFS). Repartition the input to control write parallelism.

Example

# Lightweight GeoJSON writer (pyogrio; OGR driver preset to "GeoJSON")
from databricks.labs.gbx.ds.register import register
register(spark)
src = f"{SAMPLE_DATA_BASE}/nyc/boroughs/nyc_boroughs.geojson"
df = spark.read.format("geojson_gbx").load(src)
out = "/tmp/nyc_boroughs.geojson" # a single named output file
df.write.format("geojson_gbx").mode("overwrite").save(out)
back = spark.read.format("geojson_gbx").load(out)
assert back.count() == df.count()

Typical pipeline: export a table to a single file

Export a table of vector data to one GeoJSON file — no coalesce needed (the writer merges partitions into a single file on commit):

df = spark.table("main.geo.boroughs")  # vector data in a (Delta) table
df.write.format("geojson_gbx").mode("overwrite").save("/Volumes/main/geo/exports/boroughs.geojson")

Each partition is written concurrently, then merged into one output file. See Benchmarking for light-vs-heavy export figures.