Raster Writer

Write raster tiles from the shared (source, tile) schema. The lightweight raster_gbx writer (rasterio-backed, JAR-free, supports overwrite) and the heavyweight GDAL-backed gdal writer (append-only) take the same schema — see Choosing an Execution Tier.

Benchmark & tradeoff

The lightweight (*_gbx) writers need no JAR or init script and are the only option on Serverless, standard (shared), and ARM clusters. The heavyweight raster/PMTiles writers require a classic x86 cluster (JAR + GDAL init script); where available they use native GDAL on the JVM. So your compute usually decides the tier — then data scale. See the Benchmarking page for timings and methodology.

Schema

Input schema — exactly (source, tile):

root
 |-- source: string
 |-- tile: struct
 |    |-- cellid: bigint
 |    |-- raster: binary
 |    |-- metadata: map<string,string>

This is the exact schema the raster readers emit. The writer requires these two columns and nothing else — extra OR missing columns both raise an error (there is no implicit projection). The on-disk format comes from tile.metadata (the GDAL driver/extension recorded at read time); the ext option controls only the filename suffix.

Typically you write a reader's output unchanged. To control output filenames, use the nameCol option — it overwrites the source value in place; do not add a column (that breaks the exact-schema check):

# (source, tile) straight from a reader -> write as-is
df.write.format("gtiff_gbx").mode("append").save("/Volumes/cat/sch/vol/out")

# name outputs from an existing column (overwrite `source`, don't add a column):
df.withColumn("source", df["scene_id"]) \
  .write.format("gtiff_gbx").mode("append").save("/Volumes/cat/sch/vol/out")

Output: one raster file per input row, written under the target directory; the file format/extension is whatever tile.metadata records (e.g. GeoTIFF for gtiff_gbx).

Options

Both tiers require the exact (source, tile) schema and take the same two writer options. The on-disk driver, compression, and block layout come from tile.metadata, not from writer options (see Output encoding).

Option	Default	Description
`ext`	`"tif"`	Filename suffix appended to each written file. Does not change the on-disk format; that is driven by `tile.metadata["driver"]`.
`nameCol`	unset	Name of an existing string column whose value becomes the output filename (without extension). Must be one of the two table columns — in practice, overwrite `source` (see Filename Control). When unset, an opaque unique name is used.

Lightweight (`raster_gbx`)

Supports .mode("overwrite"). Passes whole-file GeoTIFF tiles through verbatim; re-encodes non-GTiff tiles (e.g. COG) via rasterio.

Heavyweight (`gdal`)

Append-only — .mode("overwrite") is not supported. Re-encodes every tile through native GDAL (decoded pixels are identical to the lightweight output).

Lightweight · raster_gbx
Heavyweight · gdal

Pure-Python/PySpark raster writer — the lightweight tier's drop-in for the GDAL-backed gdal writer. Requires the exact (source, tile) schema, the same as the heavy writer. Writer options are path / nameCol / ext; the on-disk format and compression come from tile.metadata, not writer options.

Write raster tiles

# Catch-all lightweight writer (output driver from tile.metadata; default GTiff)
from databricks.labs.gbx.ds.register import register
register(spark)
df = spark.read.format("raster_gbx").load("{SAMPLE_RASTER_PATH}")
df.write.format("raster_gbx").mode("overwrite").save(OUT_DIR)

A whole-file GeoTIFF tile is written through verbatim (no re-encode); a tile whose metadata["driver"] is non-GTiff (e.g. COG) is re-encoded via rasterio.

For an explicit GeoTIFF-forced writer, see the Lightweight GeoTIFF Writer.

Like the lightweight readers, the lightweight writers are not auto-registered (Python Data Source V2 has no classpath auto-discovery, unlike the JAR-based heavyweight writers). Call register(spark) first, as shown above. See Lightweight Raster Readers → Register for the full explanation.

Control filenames (`nameCol`)

# Control output filenames: overwrite 'source', set nameCol
from pyspark.sql.functions import concat, lit, monotonically_increasing_id
(df.withColumn("source", concat(lit("tile_"), monotonically_increasing_id()))
   .write.format("gtiff_gbx").mode("overwrite")
   .option("nameCol", "source").option("ext", "tif").save(OUT_DIR))

Output format & compression

# On-disk format/compression come from tile.metadata, NOT writer options
#   driver/format -> output driver (default GTiff; GTiff = passed through verbatim)
#   compression/blocksize/zlevel/zstd_level -> applied when re-encoding (non-GTiff)
# Change them via upstream transforms, then write.

Option	Default	Description
`nameCol`	unset	Existing string column whose value is the output filename (overwrite `source`). When unset, an opaque unique name is used.
`ext`	`"tif"`	Filename suffix. Does not change the on-disk format.

The driver / compression / blocksize / zlevel / zstd_level are read from tile.metadata (same as the heavy gdal writer).

The lightweight writer passes whole-file GeoTIFF tiles through verbatim, while the heavyweight writer re-encodes every tile (decoded pixels are identical either way). The heavyweight gdal writer is also append-only, whereas the lightweight writer supports overwrite. See Choosing an Execution Tier and the Benchmarking page for light-vs-heavy timings.

It is the lightweight counterpart of the heavyweight gdal writer, supporting Python and SQL bindings (not Scala).

The GDAL writer emits each row's tile to a raster file under a target directory. It is the counterpart to the GDAL Reader.

Format Name

gdal

How the Output Format Is Chosen

The GDAL driver comes from the tile, not from the writer. Each tile carries its originating driver (e.g. GTiff, COG, HFA) in its metadata — that driver is what GDAL uses to serialize the tile to disk. The ext option only controls the filename suffix.

Source of output format	Value
Raster format on disk	`tile.metadata["driver"]` — set when the tile was read or produced.
Filename suffix	`ext` option (default `"tif"`).

That means:

A tile read as GeoTIFF will be written back as GeoTIFF regardless of what ext says. Setting ext = "jp2" on a GeoTIFF tile gives you a file named *.jp2 whose bytes are still a GeoTIFF — confusing, not what you want.
Pick ext to match the driver carried in your tiles. For a read-GeoTIFF → write pipeline, the default ext = "tif" is correct.
To change the on-disk format, change the driver in the tile (via upstream transforms), not via a writer option.

Basic Usage

Python

# Read, (optionally transform), then write back as raster files.
# Keep the reader's full schema (source, tile): the writer looks up both by name.
(
    spark.read.format("gdal").load("{SAMPLE_RASTER_PATH}")
        .write
            .format("gdal")
            .mode("append")           # required -- other modes are not supported
            .option("ext", "tif")     # file extension (default: 'tif')
        .save("{OUTPUT_DIR}")
)

Example output
(no DataFrame is returned by .save(); list the output directory to inspect files)
$ ls /Volumes/.../out/writer-docs-example
946817315_0_0.tif
...

Scala

// Read, (optionally transform), then write back as raster files.
          |// Keep the reader's full schema (source, tile): the writer looks up both by name.
          |spark.read.format("gdal").load("/Volumes/main/default/geobrix_samples/geobrix-examples/nyc/sentinel2/nyc_sentinel2_red.tif")
          |    .write
          |      .format("gdal")
          |      .mode("append")           // required -- other modes are not supported
          |      .option("ext", "tif")     // file extension (default: 'tif')
          |    .save("/Volumes/main/default/geobrix_samples/geobrix-examples/out/writer-docs-example")

Example output
(.save() returns Unit; list the output directory to inspect files)
          |$ ls /Volumes/.../out/writer-docs-example
          |946817315_0_0.tif
          |...

Required Conventions

1. Input schema must be exactly `(source, tile)`

The writer's table schema is fixed and Spark V2 enforces it by arity and name. The simplest way to satisfy this is to pass the reader's schema through untouched:

spark.read.format("gdal").load(IN).write.format("gdal")...   # ✅ (source, tile) preserved
df.withColumn("source", expr("..."))                         # ✅ overwrite source in place
  .write.format("gdal")...
df.select("tile").write.format("gdal")...                    # ❌ missing 'source'
df.withColumn("extra", lit(1)).write.format("gdal")...       # ❌ 3 columns, table has 2

2. Only `.mode("append")` is supported

overwrite, errorIfExists, and ignore are not implemented. Clear the target directory yourself (dbutils.fs.rm(OUT_DIR, recurse=True)) when you want overwrite semantics.

3. Target directory must already exist

The writer writes files into the directory you point at — it does not create Volume roots:

OUT_DIR = "/Volumes/main/default/test-data/geobrix-examples/out/writer-docs-example"
dbutils.fs.mkdirs(OUT_DIR)   # safe on Volumes (parent dirs are ok)

Output encoding (from tile metadata)

The output driver, compression, and block layout are read from tile.metadata, not from writer options. They are set when the tile is read or produced (e.g. via RST_AsFormat); the writer honors them on serialization.

Metadata key	Default	Effect
`driver` / `format`	`GTiff`	GDAL output driver.
`compression`	`DEFLATE`	`DEFLATE` / `ZSTD` / `LZW` / … creation compression.
`blocksize`	`512`	Tile/block size in pixels (floored to a multiple of 16, clamped to the raster size).
`zlevel`	`6`	DEFLATE level.
`zstd_level`	`9`	ZSTD level.

# Output encoding is read from tile.metadata, not writer options:
#   format/driver (default GTiff), compression (DEFLATE), blocksize (512),
#   zlevel (6), zstd_level (9). Set them upstream (e.g. RST_AsFormat), then write.

Named GeoTIFF writer (`gtiff_gdal`)

gtiff_gdal is the gdal writer with the GeoTIFF driver preset — use it to make GeoTIFF output explicit.

# Named GeoTIFF writer (gtiff_gdal = gdal writer with driver preset)
spark.read.format("gtiff_gdal").load(SAMPLE_RASTER_PATH) \
    .write.format("gtiff_gdal").mode("append").option("ext", "tif").save(OUT_DIR)

Filename Control

Default filenames look like 946817315_0_0.tif — collision-resistant within a single write, but with two rough edges:

Re-running duplicates files. .mode("append") can't clean up.
Names aren't traceable back to source rows.

To control names, overwrite the source column with your desired prefix and set nameCol = "source". (The schema is fixed at (source, tile), so you can't add a separate name column.)

# Overwrite the reader's 'source' column with your desired filename prefix,
# then point nameCol at it. The writer needs the fixed (source, tile) schema,
# so replacing an existing column is the only way to inject a name.
from pyspark.sql.functions import monotonically_increasing_id, concat, lit

(
    spark.read.format("gdal").load("{SAMPLE_RASTER_PATH}")
        .withColumn("source", concat(lit("tile_"), monotonically_increasing_id()))
        .write
            .format("gdal")
            .mode("append")
            .option("nameCol", "source")    # 'source' now carries the filename
            .option("ext", "tif")
        .save("{OUTPUT_DIR}")
)

// Overwrite the reader's 'source' column with your desired filename prefix,
          |// then point nameCol at it. The writer needs the fixed (source, tile) schema,
          |// so replacing an existing column is the only way to inject a name.
          |import org.apache.spark.sql.functions.{concat, lit, monotonically_increasing_id}
          |
          |spark.read.format("gdal").load("/Volumes/main/default/geobrix_samples/geobrix-examples/nyc/sentinel2/nyc_sentinel2_red.tif")
          |    .withColumn("source", concat(lit("tile_"), monotonically_increasing_id()))
          |    .write
          |      .format("gdal")
          |      .mode("append")
          |      .option("nameCol", "source")   // 'source' now carries the filename
          |      .option("ext", "tif")
          |    .save("/Volumes/main/default/geobrix_samples/geobrix-examples/out/writer-docs-example")

Materialization Pattern

Spark is lazy: every .display(), .count(), and downstream transform re-runs the plan unless the intermediate data is materialized. For multi-step raster pipelines, write intermediate stages to a Volume and read them back for the next step:

# Performance pattern: materialize intermediate results to avoid
# repeating expensive transforms. Spark is lazy; each .display()/.count()
# re-runs the plan unless the source is already materialized.
import databricks.labs.gbx.rasterx as rx
rx.register(spark)

stacked_df = (
    spark.read.format("gtiff_gdal").load("{SAMPLE_RASTER_PATH}")
        # ... add transforms here (reproject, retile, etc.) ...
)

# Materialize to a Volume directory before follow-on work
(
    stacked_df
        .filter(rx.rst_tryopen("tile"))   # skip invalid tiles before writing
        .write
            .format("gdal")
            .mode("append")
            .option("ext", "tif")
        .save("{OUTPUT_DIR}")
)

# Follow-on steps read the materialized output -- fast, no recompute
next_df = spark.read.format("gtiff_gdal").load("{OUTPUT_DIR}")

Tile Payloads vs Local Paths

After a write round-trip, raster expressions should consume the binary payload inside the tile struct, not any /tmp/... paths that may appear in tile metadata. Those local paths refer to transient staging files on whichever executor produced the tile and will not exist elsewhere. GeoBrix expressions (rst_*) use the binary payload by default; hand-rolled UDFs should too.

Viewing Output

Written rasters are standard GDAL-compatible files:

QGIS / ArcGIS / any GDAL-aware viewer opens them directly.
gdalinfo <file> (via the GDAL CLI) prints metadata, geotransform, and band statistics.
Read them back through GeoBrix with spark.read.format("gtiff_gdal").load(OUT_DIR) — covered by the round-trip docs test.

Next Steps

GDAL Reader — The corresponding read path.
Raster Functions — Transforms to run before the write.

Schema​

Options​

Lightweight (raster_gbx)​

Heavyweight (gdal)​

Write raster tiles​

Control filenames (nameCol)​

Output format & compression​

Format Name​

How the Output Format Is Chosen​

Basic Usage​

Python​

Scala​

Required Conventions​

1. Input schema must be exactly (source, tile)​

2. Only .mode("append") is supported​

3. Target directory must already exist​

Output encoding (from tile metadata)​

Named GeoTIFF writer (gtiff_gdal)​

Filename Control​

Materialization Pattern​

Tile Payloads vs Local Paths​

Viewing Output​

Next Steps​