Skip to main content

GDAL Writer

The GDAL writer emits each row's tile to a raster file under a target directory. It is the counterpart to the GDAL Reader.

Format Name

gdal

How the Output Format Is Chosen

The GDAL driver comes from the tile, not from the writer. Each tile carries its originating driver (e.g. GTiff, COG, HFA) in its metadata — that driver is what GDAL uses to serialize the tile to disk. The ext option only controls the filename suffix.

Source of output formatValue
Raster format on disktile.metadata["driver"] — set when the tile was read or produced.
Filename suffixext option (default "tif").

That means:

  • A tile read as GeoTIFF will be written back as GeoTIFF regardless of what ext says. Setting ext = "jp2" on a GeoTIFF tile gives you a file named *.jp2 whose bytes are still a GeoTIFF — confusing, not what you want.
  • Pick ext to match the driver carried in your tiles. For a read-GeoTIFF → write pipeline, the default ext = "tif" is correct.
  • To change the on-disk format, change the driver in the tile (via upstream transforms), not via a writer option.

Basic Usage

Python

# Read, (optionally transform), then write back as raster files.
# Keep the reader's full schema (source, tile): the writer looks up both by name.
(
spark.read.format("gdal").load("{SAMPLE_RASTER_PATH}")
.write
.format("gdal")
.mode("append") # required -- other modes are not supported
.option("ext", "tif") # file extension (default: 'tif')
.save("{OUTPUT_DIR}")
)
Example output
(no DataFrame is returned by .save(); list the output directory to inspect files)
$ ls /Volumes/.../out/writer-docs-example
946817315_0_0.tif
...

Scala

// Read, (optionally transform), then write back as raster files.
|// Keep the reader's full schema (source, tile): the writer looks up both by name.
|spark.read.format("gdal").load("/Volumes/main/default/geobrix_samples/geobrix-examples/nyc/sentinel2/nyc_sentinel2_red.tif")
| .write
| .format("gdal")
| .mode("append") // required -- other modes are not supported
| .option("ext", "tif") // file extension (default: 'tif')
| .save("/Volumes/main/default/geobrix_samples/geobrix-examples/out/writer-docs-example")
Example output
(.save() returns Unit; list the output directory to inspect files)
|$ ls /Volumes/.../out/writer-docs-example
|946817315_0_0.tif
|...

Required Conventions

1. Input schema must be exactly (source, tile)

The writer's table schema is fixed and Spark V2 enforces it by arity and name. The simplest way to satisfy this is to pass the reader's schema through untouched:

spark.read.format("gdal").load(IN).write.format("gdal")...   # ✅ (source, tile) preserved
df.withColumn("source", expr("...")) # ✅ overwrite source in place
.write.format("gdal")...
df.select("tile").write.format("gdal")... # ❌ missing 'source'
df.withColumn("extra", lit(1)).write.format("gdal")... # ❌ 3 columns, table has 2

2. Only .mode("append") is supported

overwrite, errorIfExists, and ignore are not implemented. Clear the target directory yourself (dbutils.fs.rm(OUT_DIR, recurse=True)) when you want overwrite semantics.

3. Target directory must already exist

The writer writes files into the directory you point at — it does not create Volume roots:

OUT_DIR = "/Volumes/main/default/test-data/geobrix-examples/out/writer-docs-example"
dbutils.fs.mkdirs(OUT_DIR) # safe on Volumes (parent dirs are ok)

Options

OptionDefaultDescription
ext"tif"Filename suffix appended to each written file. Does not change the on-disk format; that is driven by tile.metadata["driver"].
nameColunsetName of an existing string column whose value becomes the output filename (without extension). Must be one of the two table columns — in practice, overwrite source (see below). When unset, names are MurmurHash3(tile)_pid_tid.

Filename Control

Default filenames look like 946817315_0_0.tif — collision-resistant within a single write, but with two rough edges:

  • Re-running duplicates files. .mode("append") can't clean up.
  • Names aren't traceable back to source rows.

To control names, overwrite the source column with your desired prefix and set nameCol = "source". (The schema is fixed at (source, tile), so you can't add a separate name column.)

# Overwrite the reader's 'source' column with your desired filename prefix,
# then point nameCol at it. The writer needs the fixed (source, tile) schema,
# so replacing an existing column is the only way to inject a name.
from pyspark.sql.functions import monotonically_increasing_id, concat, lit

(
spark.read.format("gdal").load("{SAMPLE_RASTER_PATH}")
.withColumn("source", concat(lit("tile_"), monotonically_increasing_id()))
.write
.format("gdal")
.mode("append")
.option("nameCol", "source") # 'source' now carries the filename
.option("ext", "tif")
.save("{OUTPUT_DIR}")
)
// Overwrite the reader's 'source' column with your desired filename prefix,
|// then point nameCol at it. The writer needs the fixed (source, tile) schema,
|// so replacing an existing column is the only way to inject a name.
|import org.apache.spark.sql.functions.{concat, lit, monotonically_increasing_id}
|
|spark.read.format("gdal").load("/Volumes/main/default/geobrix_samples/geobrix-examples/nyc/sentinel2/nyc_sentinel2_red.tif")
| .withColumn("source", concat(lit("tile_"), monotonically_increasing_id()))
| .write
| .format("gdal")
| .mode("append")
| .option("nameCol", "source") // 'source' now carries the filename
| .option("ext", "tif")
| .save("/Volumes/main/default/geobrix_samples/geobrix-examples/out/writer-docs-example")

Materialization Pattern

Spark is lazy: every .display(), .count(), and downstream transform re-runs the plan unless the intermediate data is materialized. For multi-step raster pipelines, write intermediate stages to a Volume and read them back for the next step:

# Performance pattern: materialize intermediate results to avoid
# repeating expensive transforms. Spark is lazy; each .display()/.count()
# re-runs the plan unless the source is already materialized.
import databricks.labs.gbx.rasterx as rx
rx.register(spark)

stacked_df = (
spark.read.format("gtiff_gdal").load("{SAMPLE_RASTER_PATH}")
# ... add transforms here (reproject, retile, etc.) ...
)

# Materialize to a Volume directory before follow-on work
(
stacked_df
.filter(rx.rst_tryopen("tile")) # skip invalid tiles before writing
.write
.format("gdal")
.mode("append")
.option("ext", "tif")
.save("{OUTPUT_DIR}")
)

# Follow-on steps read the materialized output -- fast, no recompute
next_df = spark.read.format("gtiff_gdal").load("{OUTPUT_DIR}")

Tile Payloads vs Local Paths

After a write round-trip, raster expressions should consume the binary payload inside the tile struct, not any /tmp/... paths that may appear in tile metadata. Those local paths refer to transient staging files on whichever executor produced the tile and will not exist elsewhere. GeoBrix expressions (rst_*) use the binary payload by default; hand-rolled UDFs should too.

Viewing Output

Written rasters are standard GDAL-compatible files:

  • QGIS / ArcGIS / any GDAL-aware viewer opens them directly.
  • gdalinfo <file> (via the GDAL CLI) prints metadata, geotransform, and band statistics.
  • Read them back through GeoBrix with spark.read.format("gtiff_gdal").load(OUT_DIR) — covered by the round-trip docs test.

Next Steps