GDAL Writer
The GDAL writer emits each row's tile to a raster file under a target directory. It is the counterpart to the GDAL Reader.
Format Name
gdal
How the Output Format Is Chosen
The GDAL driver comes from the tile, not from the writer. Each tile carries its originating driver (e.g. GTiff, COG, HFA) in its metadata — that driver is what GDAL uses to serialize the tile to disk. The ext option only controls the filename suffix.
| Source of output format | Value |
|---|---|
| Raster format on disk | tile.metadata["driver"] — set when the tile was read or produced. |
| Filename suffix | ext option (default "tif"). |
That means:
- A tile read as GeoTIFF will be written back as GeoTIFF regardless of what
extsays. Settingext = "jp2"on a GeoTIFF tile gives you a file named*.jp2whose bytes are still a GeoTIFF — confusing, not what you want. - Pick
extto match the driver carried in your tiles. For a read-GeoTIFF → write pipeline, the defaultext = "tif"is correct. - To change the on-disk format, change the driver in the tile (via upstream transforms), not via a writer option.
Basic Usage
Python
# Read, (optionally transform), then write back as raster files.
# Keep the reader's full schema (source, tile): the writer looks up both by name.
(
spark.read.format("gdal").load("{SAMPLE_RASTER_PATH}")
.write
.format("gdal")
.mode("append") # required -- other modes are not supported
.option("ext", "tif") # file extension (default: 'tif')
.save("{OUTPUT_DIR}")
)
(no DataFrame is returned by .save(); list the output directory to inspect files)
$ ls /Volumes/.../out/writer-docs-example
946817315_0_0.tif
...
Scala
// Read, (optionally transform), then write back as raster files.
|// Keep the reader's full schema (source, tile): the writer looks up both by name.
|spark.read.format("gdal").load("/Volumes/main/default/geobrix_samples/geobrix-examples/nyc/sentinel2/nyc_sentinel2_red.tif")
| .write
| .format("gdal")
| .mode("append") // required -- other modes are not supported
| .option("ext", "tif") // file extension (default: 'tif')
| .save("/Volumes/main/default/geobrix_samples/geobrix-examples/out/writer-docs-example")
(.save() returns Unit; list the output directory to inspect files)
|$ ls /Volumes/.../out/writer-docs-example
|946817315_0_0.tif
|...
Required Conventions
1. Input schema must be exactly (source, tile)
The writer's table schema is fixed and Spark V2 enforces it by arity and name. The simplest way to satisfy this is to pass the reader's schema through untouched:
spark.read.format("gdal").load(IN).write.format("gdal")... # ✅ (source, tile) preserved
df.withColumn("source", expr("...")) # ✅ overwrite source in place
.write.format("gdal")...
df.select("tile").write.format("gdal")... # ❌ missing 'source'
df.withColumn("extra", lit(1)).write.format("gdal")... # ❌ 3 columns, table has 2
2. Only .mode("append") is supported
overwrite, errorIfExists, and ignore are not implemented. Clear the target directory yourself (dbutils.fs.rm(OUT_DIR, recurse=True)) when you want overwrite semantics.
3. Target directory must already exist
The writer writes files into the directory you point at — it does not create Volume roots:
OUT_DIR = "/Volumes/main/default/test-data/geobrix-examples/out/writer-docs-example"
dbutils.fs.mkdirs(OUT_DIR) # safe on Volumes (parent dirs are ok)
Options
| Option | Default | Description |
|---|---|---|
ext | "tif" | Filename suffix appended to each written file. Does not change the on-disk format; that is driven by tile.metadata["driver"]. |
nameCol | unset | Name of an existing string column whose value becomes the output filename (without extension). Must be one of the two table columns — in practice, overwrite source (see below). When unset, names are MurmurHash3(tile)_pid_tid. |
Filename Control
Default filenames look like 946817315_0_0.tif — collision-resistant within a single write, but with two rough edges:
- Re-running duplicates files.
.mode("append")can't clean up. - Names aren't traceable back to source rows.
To control names, overwrite the source column with your desired prefix and set nameCol = "source". (The schema is fixed at (source, tile), so you can't add a separate name column.)
# Overwrite the reader's 'source' column with your desired filename prefix,
# then point nameCol at it. The writer needs the fixed (source, tile) schema,
# so replacing an existing column is the only way to inject a name.
from pyspark.sql.functions import monotonically_increasing_id, concat, lit
(
spark.read.format("gdal").load("{SAMPLE_RASTER_PATH}")
.withColumn("source", concat(lit("tile_"), monotonically_increasing_id()))
.write
.format("gdal")
.mode("append")
.option("nameCol", "source") # 'source' now carries the filename
.option("ext", "tif")
.save("{OUTPUT_DIR}")
)
// Overwrite the reader's 'source' column with your desired filename prefix,
|// then point nameCol at it. The writer needs the fixed (source, tile) schema,
|// so replacing an existing column is the only way to inject a name.
|import org.apache.spark.sql.functions.{concat, lit, monotonically_increasing_id}
|
|spark.read.format("gdal").load("/Volumes/main/default/geobrix_samples/geobrix-examples/nyc/sentinel2/nyc_sentinel2_red.tif")
| .withColumn("source", concat(lit("tile_"), monotonically_increasing_id()))
| .write
| .format("gdal")
| .mode("append")
| .option("nameCol", "source") // 'source' now carries the filename
| .option("ext", "tif")
| .save("/Volumes/main/default/geobrix_samples/geobrix-examples/out/writer-docs-example")
Materialization Pattern
Spark is lazy: every .display(), .count(), and downstream transform re-runs the plan unless the intermediate data is materialized. For multi-step raster pipelines, write intermediate stages to a Volume and read them back for the next step:
# Performance pattern: materialize intermediate results to avoid
# repeating expensive transforms. Spark is lazy; each .display()/.count()
# re-runs the plan unless the source is already materialized.
import databricks.labs.gbx.rasterx as rx
rx.register(spark)
stacked_df = (
spark.read.format("gtiff_gdal").load("{SAMPLE_RASTER_PATH}")
# ... add transforms here (reproject, retile, etc.) ...
)
# Materialize to a Volume directory before follow-on work
(
stacked_df
.filter(rx.rst_tryopen("tile")) # skip invalid tiles before writing
.write
.format("gdal")
.mode("append")
.option("ext", "tif")
.save("{OUTPUT_DIR}")
)
# Follow-on steps read the materialized output -- fast, no recompute
next_df = spark.read.format("gtiff_gdal").load("{OUTPUT_DIR}")
Tile Payloads vs Local Paths
After a write round-trip, raster expressions should consume the binary payload inside the tile struct, not any /tmp/... paths that may appear in tile metadata. Those local paths refer to transient staging files on whichever executor produced the tile and will not exist elsewhere. GeoBrix expressions (rst_*) use the binary payload by default; hand-rolled UDFs should too.
Viewing Output
Written rasters are standard GDAL-compatible files:
- QGIS / ArcGIS / any GDAL-aware viewer opens them directly.
gdalinfo <file>(via the GDAL CLI) prints metadata, geotransform, and band statistics.- Read them back through GeoBrix with
spark.read.format("gtiff_gdal").load(OUT_DIR)— covered by the round-trip docs test.
Next Steps
- GDAL Reader — The corresponding read path.
- RasterX Functions — Transforms to run before the write.