GeoJSON Writer
geojson_gbx — pyogrio-backed, pure-Python DataSource V2 writer with the OGR
driver preset to GeoJSON. It round-trips with the matching geojson_gbx reader;
the schema (attrs..., geom_0, geom_0_srid, geom_0_srid_proj) is shared across all
lightweight vector readers and writers.
This format's single-file writer is lightweight-only — its GDAL/OGR write path
isn't implemented in the heavyweight tier. For large/any-scale output, the sharded
GeoJSONL writer (geojsonl_gbx lightweight / geojsonl heavyweight)
is available in both tiers.
Call register(spark) once before using any *_gbx format (see the
Writers Overview).
The vector writers take a geometry column encoded as WKB (binary) or WKT (text) —
the same encodings the *_gbx readers emit (controlled by the reader's asWKB option). The
SRID/CRS is taken from the companion *_srid / *_srid_proj columns, so plain WKB or WKT is
all that is needed. Extended forms (EWKB, EWKT) and GeoJSON-encoded geometry are not
accepted as writer input.
Writing from a Databricks GEOMETRY / GEOGRAPHY column? Convert it to a supported interchange
format first with an ST export function:
use ST_AsBinary(geom) for WKB (recommended) or ST_AsText(geom) for WKT. Avoid
ST_GeomAsEWKB, ST_AsEWKT, and ST_AsGeoJSON — those encodings are not accepted as input.
The lightweight (*_gbx) writers need no JAR or init script and are the only option
on Serverless, standard (shared), and ARM clusters. The heavyweight raster/PMTiles writers
require a classic x86 cluster (JAR + GDAL init script); where available they use native
GDAL on the JVM. So your compute usually decides the tier — then data scale. See the
Benchmarking page for timings and methodology.
Options
| Option | Default | Behavior |
|---|---|---|
driverName | GeoJSON (preset) | OGR driver. Preset by this named writer; override only if needed. |
mode | overwrite | overwrite only; append is rejected. |
geometryType | inferred from the data | Override the OGR geometry type. |
layerName | driver default | Output layer name where supported. |
Schema
Input schema — a geometry column plus its SRID companion:
root
|-- <geom>: binary (WKB) or string (WKT) # the geometry
|-- <geom>_srid: string # REQUIRED — CRS authority code, e.g. "4326" ("0" if unknown)
|-- <geom>_srid_proj: string # optional — PROJ4 string, used as a CRS fallback when srid is "0"
|-- ...any other columns # written as feature attributes
The writer locates the geometry as the column X that has a companion X_srid column, so the geometry column may be named anything (geom_0 by convention — what the *_gbx readers emit). <geom>_srid is required: it identifies the geometry column and supplies the CRS. The _srid / _srid_proj columns are consumed for the CRS and are not written as fields. Every other column is written as a feature attribute.
Coerce your DataFrame/table to this shape before writing:
from pyspark.sql import functions as F
df.select(
F.col("my_geom_wkb").alias("geom_0"), # geometry as WKB (or WKT)
F.lit("4326").alias("geom_0_srid"), # CRS authority code
F.lit("").alias("geom_0_srid_proj"), # optional PROJ4 fallback
"name", "population", # -> written as feature attributes
).write.format("geojson_gbx").mode("overwrite").save("/Volumes/cat/sch/vol/out")
From a Databricks GEOMETRY / GEOGRAPHY column. Databricks native spatial types are not
a writer input format directly — export them to WKB plus an SRID first. Use ST_AsWKB (the
equivalent of ST_AsBinary) for the geometry column and ST_SRID for the CRS, casting the
SRID to a string:
df.selectExpr(
"ST_AsWKB(my_geom) AS geom_0", # GEOMETRY/GEOGRAPHY -> WKB
"CAST(ST_SRID(my_geom) AS STRING) AS geom_0_srid", # SRID -> string
"'' AS geom_0_srid_proj",
"name", "population", # -> feature attributes
).write.format("geojson_gbx").mode("overwrite").save("/Volumes/cat/sch/vol/out")
See Databricks Spatial for the full ST function reference.
The geometry must be WKB or WKT (see the input-geometry note above; convert a Databricks GEOMETRY with ST_AsBinary). The written file round-trips with the matching *_gbx reader — reading it back yields (…attributes, geom_0, geom_0_srid, geom_0_srid_proj).
Output: a single GeoJSON .geojson text file (a FeatureCollection); attributes are written as feature properties.
How it scales
Unlike a single-node pyogrio.write_* call that serializes one file on one machine, the
geojson_gbx writer runs as a Spark DataSource V2 two-phase write: each partition is
written concurrently by its executor to a scratch fragment, then the driver merges the
fragments into one output file. The merge is sequential and rename-free, so it is safe on
FUSE-mounted cloud storage (Unity Catalog Volumes, DBFS). Repartition the input to control
write parallelism.
Example
# Lightweight GeoJSON writer (pyogrio; OGR driver preset to "GeoJSON")
from databricks.labs.gbx.ds.register import register
register(spark)
src = f"{SAMPLE_DATA_BASE}/nyc/boroughs/nyc_boroughs.geojson"
df = spark.read.format("geojson_gbx").load(src)
out = "/tmp/nyc_boroughs.geojson" # a single named output file
df.write.format("geojson_gbx").mode("overwrite").save(out)
back = spark.read.format("geojson_gbx").load(out)
assert back.count() == df.count()
Typical pipeline: export a table to a single file
Export a table of vector data to one GeoJSON file — no coalesce needed (the writer merges partitions into a single file on commit):
df = spark.table("main.geo.boroughs") # vector data in a (Delta) table
df.write.format("geojson_gbx").mode("overwrite").save("/Volumes/main/geo/exports/boroughs.geojson")
Each partition is written concurrently, then merged into one output file. See Benchmarking for light-vs-heavy export figures.