Writers Overview

GeoBrix provides Spark writers for geospatial file formats.

Named Vector Formats

The vector readers and writers share a small column contract for geometry and its coordinate reference system (CRS). A vector frame carries one geometry column plus up to two CRS-companion columns:

<geom>            binary (WKB)  or  string (WKT)   # the geometry
<geom>_srid       string  — REQUIRED              # CRS authority code, e.g. "4326" ("0" if unknown)
<geom>_srid_proj  string  — optional              # PROJ4 string; CRS fallback when srid is "0"

The geometry encoding is read from the column's type — binary is WKB, string is WKT. The CRS is EPSG:<srid> when <geom>_srid is set, falling back to the PROJ4 string in <geom>_srid_proj when the srid is "0". Every other column is written as an attribute; <geom>_srid / <geom>_srid_proj are CRS metadata and are not written as fields.

What each format expects when written differs — how (and whether) the geometry is named, field-name and type limits, and how the CRS is stored:

Format	Geometry on output	Field-name / type limits	CRS
Shapefile	The shape record — no named geometry field; one geometry type per file.	Attribute names truncated to 10 characters (`.dbf`); limited types. Output is a sidecar set (`.shp/.shx/.dbf/.prj/.cpg`).	`.prj` sidecar.
GeoJSON / GeoJSONL	Structural GeoJSON `geometry` member — no named field.	Properties keep full names; JSON types (string/number/bool/null).	WGS84 (EPSG:4326) per RFC 7946; a `crs` member otherwise.
GeoPackage	A named geometry column, default `geom`.	Full field names; rich types (int64, real, text, blob, datetime).	Stored per geometry column (`gpkg_spatial_ref_sys`).
FileGDB	A named geometry field, default `SHAPE`.	Esri limits (≤ 64 chars; reserved names like `OBJECTID` disallowed).	Per feature class.

What `.save(path)` produces

The format (driver) is chosen by the writer name, not the path extension — but some formats are a single file and others a directory, so name path accordingly:

Writer	`.save(path)` writes	Recommended `path`
`gpkg_gbx`	a single file	`…/merged.gpkg`
`geojson_gbx`	a single file	`…/merged.geojson`
`shapefile_gbx`	a directory holding the `.shp/.shx/.dbf/.prj/.cpg` bundle; or a single `.shp.zip` when `zip=true`	`…/roads` or `…/roads.shp.zip`
`geojsonl_gbx`	a directory of `part-*.geojsonl` shards	`…/edges`
`file_gdb_gbx`	a directory `…/name.gdb`	`…/roads.gdb`

The extension is cosmetic for writing (the driver decides the format), but include it on single-file outputs so the result is recognizable and portable.

Output naming for single-file writers

The single-file vector writers (gpkg_gbx, geojson_gbx, shapefile_gbx with zip=true, and file_gdb_gbx) apply an adaptive naming contract that resolves the final output path from the .save(path) argument and an optional fileName option. Extension auto-completion handles partial names so you never produce a misnamed file.

Resolution rules (evaluated in order):

Case	Condition	Resolved output
1	`.option("fileName", name)` given	`path` is treated as the parent directory (created if missing); output = `path/complete(name)`
2	No `fileName`; `path` is an existing directory	output = `path/complete(basename(path))` — named after the directory, written under it
3	No `fileName`; `path` does not yet exist (stem/file path)	output = `complete(path)`; parent directory created if missing

Extension auto-completion (complete(name)) appends only the missing parts of the canonical extension for the format — so roads → roads.shp.zip, roads.shp → roads.shp.zip, roads.shp.zip → unchanged. If name ends with a different recognized geo extension (for example, .gpkg passed to the Shapefile writer), the writer raises a clear error rather than silently producing a doubly-suffixed name.

Examples:

# Case 1 — explicit fileName, path is the parent dir (created as needed)
df.write.format("gpkg_gbx").option("fileName", "districts").mode("overwrite") \
    .save("/Volumes/main/geo/exports")
# -> /Volumes/main/geo/exports/districts.gpkg

# Case 2 — path is an existing directory; output named after the dir, under it
df.write.format("geojson_gbx").mode("overwrite") \
    .save("/Volumes/main/geo/exports")
# -> /Volumes/main/geo/exports/exports.geojson  (if exports/ already exists)

# Case 3 — stem path; extension completed automatically
df.write.format("shapefile_gbx").option("zip", "true").mode("overwrite") \
    .save("/Volumes/main/geo/exports/roads")
# -> /Volumes/main/geo/exports/roads.shp.zip

The nameCol option on the sharded and raster writers (geojsonl_gbx, raster_gbx, gtiff_gbx) is a separate, per-row concept and is not affected by this contract.

Choosing a writer for large datasets

The single-file writers (geojson_gbx, shapefile_gbx, gpkg_gbx) assemble the output on the driver, but they stream the partitions through one pass with bounded memory (the whole dataset is never held at once), so they scale much further than a naive single-node merge. They are still single-node writes, so:

shapefile_gbx — streamed, but the format hard-caps each .shp/.dbf at 2 GB (the .dbf is one fixed-width record per row), so very large attribute tables eventually hit that ceiling.
geojson_gbx / gpkg_gbx — streamed, single file, no hard size cap; bounded by the single-node write throughput.

For the largest datasets, prefer the writers that spread work across partitions:

geojsonl_gbx — one shard per partition (no driver assembly at all); throughput scales with partitions. The most scalable option.
gpkg_gbx — a single file with no 2 GB cap, a good single-file choice when one file is required.

(file_gdb_gbx uses the native GDAL path and currently assembles in memory — a heavy-tier writer; prefer the above on Serverless.)

Large single-file writes may need a larger cluster

The single-file writers assemble the output in Spark's Python worker processes, so a very large result (for example, a Shapefile whose .dbf runs to ~1 GB+) can exceed the worker memory on Serverless and fail mid-write — seen as a CancelledKeyException (the driver is lost) or a "Python worker exited unexpectedly" crash. If you hit that:

Run on a classic cluster with a larger node — the proven path for the largest single-file writes.
Or avoid single-node assembly entirely: use geojsonl_gbx (shards per partition) or gpkg_gbx, which scale on Serverless without this ceiling.

Note: high-memory Serverless ("High: 32 GB") raises the notebook REPL memory but not the Spark Python-worker memory the writer assembles in, so it does not extend this ceiling — prefer a classic cluster or the sharded writers. The write itself is correct at scale (it completes on a classic cluster); this is a memory ceiling, not a writer limit.

Lightweight (pyrx)
Heavyweight (rasterx)

The lightweight tier ships native Python DataSource V2 writers — no JAR, no init script.

Why this scales beyond a single node

These are Spark DataSource V2 writers, not single-node pyogrio/rasterio wrappers. Each partition is written concurrently by its executor, then the driver merges the parts into the final output (a two-phase write). The PMTiles writer goes further with distributed spatial sharding — partitioning tiles into bounded per-shard archives written in parallel and then cataloging them — which scales horizontally instead of building one memory-bound archive on a single node. Merges are sequential and rename-free, so they are safe on FUSE-mounted Unity Catalog Volumes / DBFS.

The two-phase staging is isolated per write: each write stages its parts under a unique, hidden scratch namespace, so concurrent jobs — or multiple users — writing to the same output location never see or overwrite one another's in-progress data. Scratch left behind by an interrupted job is reclaimed automatically on a later write to the same location.

Unlike the heavyweight writers (auto-discovered from the JAR), the lightweight Python DataSources are not auto-registered — call register(spark) once per session before using any *_gbx format:

from databricks.labs.gbx.ds.register import register
register(spark)

To register only the formats this session uses, pass only= (by format name, with or without the _gbx suffix):

register(spark, only=["raster_gbx", "geojson_gbx"])

An unrecognized format raises ValueError.

Available Writers

Writer	Format Name	Description
Raster Writer	`raster_gbx`	Pure-Python catch-all raster writer (no JAR; DataSource V2)
GeoTIFF Writer	`gtiff_gbx`	Pure-Python GeoTIFF writer (driver forced to GTiff)
PMTiles Writer	`pmtiles_gbx`	Package a tile pyramid into spatially-sharded PMTiles archives + a catalog.
Vector Writer	`vector_gbx`	Pure-Python generic vector writer (pyogrio); any OGR-supported driver.
Shapefile Writer	`shapefile_gbx`	Pure-Python Shapefile writer (OGR driver: `ESRI Shapefile`).
GeoJSON Writer	`geojson_gbx`	Pure-Python GeoJSON writer (OGR driver: `GeoJSON`) — single merged file.
GeoJSONL Writer	`geojsonl_gbx` / `geojsonl_ogr`	Multi-file newline-delimited GeoJSONL — one shard per partition, no driver merge. Available in both tiers.
GeoPackage Writer	`gpkg_gbx`	Pure-Python GeoPackage writer (OGR driver: `GPKG`).
GeoDatabase Writer	`file_gdb_gbx`	Hybrid File Geodatabase writer (OGR driver: `OpenFileGDB`) — requires the native GDAL libraries (e.g. as provided by the heavyweight tier), not pure-Python.

See the Raster Writer page for full usage, options, and the nameCol / ext controls.

Benchmarks

Each *_gbx lightweight writer is benchmarked against its heavyweight counterpart on the same cluster, same corpus, and same row counts, with the median of measured iterations reported. Parity is a hard gate — the two tiers' outputs must decode to the same records (and, where applicable, byte-identical tile data) or the run fails immediately.

Per-format timing results — spanning the vector writers (GeoJSON, Shapefile, GeoPackage, GeoJSONL) and the PMTiles writer — are published on the Benchmarking page. For the full methodology and raster-function results, see the same page.

Heavyweight writers are implemented as Spark DataSource V2 connectors backed by GDAL (raster) and a native Scala PMTiles encoder. They are registered automatically when the GeoBrix JAR is on the classpath — no additional configuration needed. For vector output, the heavyweight tier provides the sharded geojsonl_ogr writer; other vector output flows through the lightweight writers, Spark's built-in writers, or the product's native geospatial writers.

Available Writers

Writer	Format Name	Description
Raster Writer	`gdal`	Emits each row's tile using the GDAL driver recorded in the tile's metadata.
GeoTIFF Writer	`gtiff_gdal`	Named writer for GeoTIFF files (preset `driver="GTiff"`).
PMTiles Writer	`pmtiles`	Streams a `(z, x, y, bytes)` tile set into a single PMTiles v3 archive file.
GeoJSONL Writer	`geojsonl_ogr`	Multi-file newline-delimited GeoJSONL — one shard per partition. The heavyweight tier's first vector writer; the lightweight peer is `geojsonl_gbx`.

At a Glance

GDAL writer:

Input schema: exactly (source: string, tile: struct) — the reader's default schema. Don't .select() or add columns.
Mode: .mode("append") only.
Format on disk: comes from the driver stored in each tile's metadata. ext controls the filename suffix only.
Target directory: must already exist; the writer does not create Volume roots.

PMTiles writer:

Input schema: exactly (z: int, x: int, y: int, bytes: binary).
Mode: .mode("overwrite") is required; default ErrorIfExists is rejected upstream by Spark.
Output path: the final .pmtiles file, not a directory. Read support is not implemented in 0.4.0.

Heavyweight vector writing

The heavyweight tier writes raster (GDAL), PMTiles, and — its first vector writer — the sharded GeoJSONL writer (geojsonl_ogr). The single-file vector formats (GeoJSON, Shapefile, GeoPackage, File Geodatabase) are lightweight-only to write — those OGR write paths aren't implemented in the heavyweight tier. For single-file vector output, use the lightweight vector writers: Vector (vector_gbx), Shapefile (shapefile_gbx), GeoJSON (geojson_gbx), GeoPackage (gpkg_gbx), or GeoDatabase (file_gdb_gbx).

Hybrid writer: GeoDatabase (file_gdb_gbx)

The GeoDatabase writer (file_gdb_gbx) is a hybrid: it is registered in the lightweight tier but, unlike the other *_gbx vector writers, it is not pure-Python — it drives the native GDAL libraries (osgeo) to write the File Geodatabase. It therefore runs only where those natives are present, such as on a cluster with the heavyweight tier (the GeoBrix JAR) installed. On a runtime without native GDAL it raises a clear error at write time; use another single-file vector writer (GeoPackage, GeoJSON, Shapefile) there.

Benchmarks

Each heavyweight writer is benchmarked against its *_gbx lightweight counterpart on the same cluster, same corpus, and same row counts, with the median of measured iterations reported. Parity (matching output records, and byte-identical tile data where applicable) is a hard gate — a mismatch fails the run immediately.

Per-format timing results — the PMTiles writer and the sharded GeoJSONL writer — are published on the Benchmarking page, alongside the full methodology and raster-function results.

Next Steps

Raster Writer — full details, options, and examples for raster output.
GeoTIFF Writer — full details, options, and examples for GeoTIFF output.
PMTiles Writer — full details, options, and examples for tile-pyramid output.
Readers Overview — the corresponding read paths.
Helios notebooks — worked end-to-end example using the PMTiles writer to package raster and terrain pyramids over a San Francisco AOI.

Named Vector Formats​

What .save(path) produces​

Output naming for single-file writers​

Choosing a writer for large datasets​

Available Writers​

Benchmarks​

Available Writers​

At a Glance​

Benchmarks​

Next Steps​

Named Vector Formats

What `.save(path)` produces

Output naming for single-file writers

Choosing a writer for large datasets

Available Writers

Benchmarks

Available Writers

At a Glance

Benchmarks

Next Steps