Raster Writer
Write raster tiles from the shared (source, tile) schema. The lightweight
raster_gbx writer (rasterio-backed, JAR-free, supports overwrite) and the
heavyweight GDAL-backed gdal writer (append-only) take the same schema — see
Choosing an Execution Tier.
The lightweight (*_gbx) writers need no JAR or init script and are the only option
on Serverless, standard (shared), and ARM clusters. The heavyweight raster/PMTiles writers
require a classic x86 cluster (JAR + GDAL init script); where available they use native
GDAL on the JVM. So your compute usually decides the tier — then data scale. See the
Benchmarking page for timings and methodology.
Schema
Input schema — exactly (source, tile):
root
|-- source: string
|-- tile: struct
| |-- cellid: bigint
| |-- raster: binary
| |-- metadata: map<string,string>
This is the exact schema the raster readers emit. The writer requires these two columns and nothing else — extra OR missing columns both raise an error (there is no implicit projection). The on-disk format comes from tile.metadata (the GDAL driver/extension recorded at read time); the ext option controls only the filename suffix.
Typically you write a reader's output unchanged. To control output filenames, use the nameCol option — it overwrites the source value in place; do not add a column (that breaks the exact-schema check):
# (source, tile) straight from a reader -> write as-is
df.write.format("gtiff_gbx").mode("append").save("/Volumes/cat/sch/vol/out")
# name outputs from an existing column (overwrite `source`, don't add a column):
df.withColumn("source", df["scene_id"]) \
.write.format("gtiff_gbx").mode("append").save("/Volumes/cat/sch/vol/out")
Output: one raster file per input row, written under the target directory; the file format/extension is whatever tile.metadata records (e.g. GeoTIFF for gtiff_gbx).
Options
Both tiers require the exact (source, tile) schema and take the same two writer options. The on-disk driver, compression, and block layout come from tile.metadata, not from writer options (see Output encoding).
| Option | Default | Description |
|---|---|---|
ext | "tif" | Filename suffix appended to each written file. Does not change the on-disk format; that is driven by tile.metadata["driver"]. |
nameCol | unset | Name of an existing string column whose value becomes the output filename (without extension). Must be one of the two table columns — in practice, overwrite source (see Filename Control). When unset, an opaque unique name is used. |
Lightweight (raster_gbx)
Supports .mode("overwrite"). Passes whole-file GeoTIFF tiles through verbatim; re-encodes non-GTiff tiles (e.g. COG) via rasterio.
Heavyweight (gdal)
Append-only — .mode("overwrite") is not supported. Re-encodes every tile through native GDAL (decoded pixels are identical to the lightweight output).
- Lightweight · raster_gbx
- Heavyweight · gdal
Pure-Python/PySpark raster writer — the lightweight tier's drop-in for the
GDAL-backed gdal writer. Requires the exact (source, tile) schema,
the same as the heavy writer. Writer options are path / nameCol / ext; the
on-disk format and compression come from tile.metadata, not writer options.
Write raster tiles
# Catch-all lightweight writer (output driver from tile.metadata; default GTiff)
from databricks.labs.gbx.ds.register import register
register(spark)
df = spark.read.format("raster_gbx").load("{SAMPLE_RASTER_PATH}")
df.write.format("raster_gbx").mode("overwrite").save(OUT_DIR)
A whole-file GeoTIFF tile is written through verbatim (no re-encode); a tile whose
metadata["driver"] is non-GTiff (e.g. COG) is re-encoded via rasterio.
For an explicit GeoTIFF-forced writer, see the Lightweight GeoTIFF Writer.
Like the lightweight readers, the lightweight writers are not auto-registered
(Python Data Source V2 has no classpath auto-discovery, unlike the JAR-based
heavyweight writers). Call register(spark) first, as shown above. See
Lightweight Raster Readers → Register for the
full explanation.
Control filenames (nameCol)
# Control output filenames: overwrite 'source', set nameCol
from pyspark.sql.functions import concat, lit, monotonically_increasing_id
(df.withColumn("source", concat(lit("tile_"), monotonically_increasing_id()))
.write.format("gtiff_gbx").mode("overwrite")
.option("nameCol", "source").option("ext", "tif").save(OUT_DIR))
Output format & compression
# On-disk format/compression come from tile.metadata, NOT writer options
# driver/format -> output driver (default GTiff; GTiff = passed through verbatim)
# compression/blocksize/zlevel/zstd_level -> applied when re-encoding (non-GTiff)
# Change them via upstream transforms, then write.
| Option | Default | Description |
|---|---|---|
nameCol | unset | Existing string column whose value is the output filename (overwrite source). When unset, an opaque unique name is used. |
ext | "tif" | Filename suffix. Does not change the on-disk format. |
The driver / compression / blocksize / zlevel / zstd_level are read from
tile.metadata (same as the heavy gdal writer).
The lightweight writer passes whole-file GeoTIFF tiles through verbatim, while the
heavyweight writer re-encodes every tile (decoded pixels are identical either way). The
heavyweight gdal writer is also append-only, whereas the lightweight writer supports
overwrite. See Choosing an Execution Tier and the
Benchmarking page for light-vs-heavy timings.
It is the lightweight counterpart of the heavyweight gdal writer, supporting Python and SQL bindings (not Scala).
The GDAL writer emits each row's tile to a raster file under a target directory. It is the counterpart to the GDAL Reader.
Format Name
gdal
How the Output Format Is Chosen
The GDAL driver comes from the tile, not from the writer. Each tile carries its originating driver (e.g. GTiff, COG, HFA) in its metadata — that driver is what GDAL uses to serialize the tile to disk. The ext option only controls the filename suffix.
| Source of output format | Value |
|---|---|
| Raster format on disk | tile.metadata["driver"] — set when the tile was read or produced. |
| Filename suffix | ext option (default "tif"). |
That means:
- A tile read as GeoTIFF will be written back as GeoTIFF regardless of what
extsays. Settingext = "jp2"on a GeoTIFF tile gives you a file named*.jp2whose bytes are still a GeoTIFF — confusing, not what you want. - Pick
extto match the driver carried in your tiles. For a read-GeoTIFF → write pipeline, the defaultext = "tif"is correct. - To change the on-disk format, change the driver in the tile (via upstream transforms), not via a writer option.
Basic Usage
Python
# Read, (optionally transform), then write back as raster files.
# Keep the reader's full schema (source, tile): the writer looks up both by name.
(
spark.read.format("gdal").load("{SAMPLE_RASTER_PATH}")
.write
.format("gdal")
.mode("append") # required -- other modes are not supported
.option("ext", "tif") # file extension (default: 'tif')
.save("{OUTPUT_DIR}")
)
(no DataFrame is returned by .save(); list the output directory to inspect files)
$ ls /Volumes/.../out/writer-docs-example
946817315_0_0.tif
...
Scala
// Read, (optionally transform), then write back as raster files.
|// Keep the reader's full schema (source, tile): the writer looks up both by name.
|spark.read.format("gdal").load("/Volumes/main/default/geobrix_samples/geobrix-examples/nyc/sentinel2/nyc_sentinel2_red.tif")
| .write
| .format("gdal")
| .mode("append") // required -- other modes are not supported
| .option("ext", "tif") // file extension (default: 'tif')
| .save("/Volumes/main/default/geobrix_samples/geobrix-examples/out/writer-docs-example")
(.save() returns Unit; list the output directory to inspect files)
|$ ls /Volumes/.../out/writer-docs-example
|946817315_0_0.tif
|...
Required Conventions
1. Input schema must be exactly (source, tile)
The writer's table schema is fixed and Spark V2 enforces it by arity and name. The simplest way to satisfy this is to pass the reader's schema through untouched:
spark.read.format("gdal").load(IN).write.format("gdal")... # ✅ (source, tile) preserved
df.withColumn("source", expr("...")) # ✅ overwrite source in place
.write.format("gdal")...
df.select("tile").write.format("gdal")... # ❌ missing 'source'
df.withColumn("extra", lit(1)).write.format("gdal")... # ❌ 3 columns, table has 2
2. Only .mode("append") is supported
overwrite, errorIfExists, and ignore are not implemented. Clear the target directory yourself (dbutils.fs.rm(OUT_DIR, recurse=True)) when you want overwrite semantics.
3. Target directory must already exist
The writer writes files into the directory you point at — it does not create Volume roots:
OUT_DIR = "/Volumes/main/default/test-data/geobrix-examples/out/writer-docs-example"
dbutils.fs.mkdirs(OUT_DIR) # safe on Volumes (parent dirs are ok)
Output encoding (from tile metadata)
The output driver, compression, and block layout are read from
tile.metadata, not from writer options. They are set when the tile is read or
produced (e.g. via RST_AsFormat); the writer honors them on serialization.
| Metadata key | Default | Effect |
|---|---|---|
driver / format | GTiff | GDAL output driver. |
compression | DEFLATE | DEFLATE / ZSTD / LZW / … creation compression. |
blocksize | 512 | Tile/block size in pixels (floored to a multiple of 16, clamped to the raster size). |
zlevel | 6 | DEFLATE level. |
zstd_level | 9 | ZSTD level. |
# Output encoding is read from tile.metadata, not writer options:
# format/driver (default GTiff), compression (DEFLATE), blocksize (512),
# zlevel (6), zstd_level (9). Set them upstream (e.g. RST_AsFormat), then write.
Named GeoTIFF writer (gtiff_gdal)
gtiff_gdal is the gdal writer with the GeoTIFF driver preset — use it to make
GeoTIFF output explicit.
# Named GeoTIFF writer (gtiff_gdal = gdal writer with driver preset)
spark.read.format("gtiff_gdal").load(SAMPLE_RASTER_PATH) \
.write.format("gtiff_gdal").mode("append").option("ext", "tif").save(OUT_DIR)
Filename Control
Default filenames look like 946817315_0_0.tif — collision-resistant within a single write, but with two rough edges:
- Re-running duplicates files.
.mode("append")can't clean up. - Names aren't traceable back to source rows.
To control names, overwrite the source column with your desired prefix and set nameCol = "source". (The schema is fixed at (source, tile), so you can't add a separate name column.)
# Overwrite the reader's 'source' column with your desired filename prefix,
# then point nameCol at it. The writer needs the fixed (source, tile) schema,
# so replacing an existing column is the only way to inject a name.
from pyspark.sql.functions import monotonically_increasing_id, concat, lit
(
spark.read.format("gdal").load("{SAMPLE_RASTER_PATH}")
.withColumn("source", concat(lit("tile_"), monotonically_increasing_id()))
.write
.format("gdal")
.mode("append")
.option("nameCol", "source") # 'source' now carries the filename
.option("ext", "tif")
.save("{OUTPUT_DIR}")
)
// Overwrite the reader's 'source' column with your desired filename prefix,
|// then point nameCol at it. The writer needs the fixed (source, tile) schema,
|// so replacing an existing column is the only way to inject a name.
|import org.apache.spark.sql.functions.{concat, lit, monotonically_increasing_id}
|
|spark.read.format("gdal").load("/Volumes/main/default/geobrix_samples/geobrix-examples/nyc/sentinel2/nyc_sentinel2_red.tif")
| .withColumn("source", concat(lit("tile_"), monotonically_increasing_id()))
| .write
| .format("gdal")
| .mode("append")
| .option("nameCol", "source") // 'source' now carries the filename
| .option("ext", "tif")
| .save("/Volumes/main/default/geobrix_samples/geobrix-examples/out/writer-docs-example")
Materialization Pattern
Spark is lazy: every .display(), .count(), and downstream transform re-runs the plan unless the intermediate data is materialized. For multi-step raster pipelines, write intermediate stages to a Volume and read them back for the next step:
# Performance pattern: materialize intermediate results to avoid
# repeating expensive transforms. Spark is lazy; each .display()/.count()
# re-runs the plan unless the source is already materialized.
import databricks.labs.gbx.rasterx as rx
rx.register(spark)
stacked_df = (
spark.read.format("gtiff_gdal").load("{SAMPLE_RASTER_PATH}")
# ... add transforms here (reproject, retile, etc.) ...
)
# Materialize to a Volume directory before follow-on work
(
stacked_df
.filter(rx.rst_tryopen("tile")) # skip invalid tiles before writing
.write
.format("gdal")
.mode("append")
.option("ext", "tif")
.save("{OUTPUT_DIR}")
)
# Follow-on steps read the materialized output -- fast, no recompute
next_df = spark.read.format("gtiff_gdal").load("{OUTPUT_DIR}")
Tile Payloads vs Local Paths
After a write round-trip, raster expressions should consume the binary payload inside the tile struct, not any /tmp/... paths that may appear in tile metadata. Those local paths refer to transient staging files on whichever executor produced the tile and will not exist elsewhere. GeoBrix expressions (rst_*) use the binary payload by default; hand-rolled UDFs should too.
Viewing Output
Written rasters are standard GDAL-compatible files:
- QGIS / ArcGIS / any GDAL-aware viewer opens them directly.
gdalinfo <file>(via the GDAL CLI) prints metadata, geotransform, and band statistics.- Read them back through GeoBrix with
spark.read.format("gtiff_gdal").load(OUT_DIR)— covered by the round-trip docs test.
Next Steps
- GDAL Reader — The corresponding read path.
- Raster Functions — Transforms to run before the write.