Skip to main content

Raster Writer

Write raster tiles from the shared (source, tile) schema. The lightweight raster_gbx writer (rasterio-backed, JAR-free, supports overwrite) and the heavyweight GDAL-backed gdal writer (append-only) take the same schema — see Choosing an Execution Tier.

Benchmark & tradeoff

The lightweight (*_gbx) writers need no JAR or init script and are the only option on Serverless, standard (shared), and ARM clusters. The heavyweight raster/PMTiles writers require a classic x86 cluster (JAR + GDAL init script); where available they use native GDAL on the JVM. So your compute usually decides the tier — then data scale. See the Benchmarking page for timings and methodology.

Schema

Input schema — exactly (source, tile):

root
|-- source: string
|-- tile: struct
| |-- cellid: bigint
| |-- raster: binary
| |-- metadata: map<string,string>

This is the exact schema the raster readers emit. The writer requires these two columns and nothing else — extra OR missing columns both raise an error (there is no implicit projection). The on-disk format comes from tile.metadata (the GDAL driver/extension recorded at read time); the ext option controls only the filename suffix.

Typically you write a reader's output unchanged. To control output filenames, use the nameCol option — it overwrites the source value in place; do not add a column (that breaks the exact-schema check):

# (source, tile) straight from a reader -> write as-is
df.write.format("gtiff_gbx").mode("append").save("/Volumes/cat/sch/vol/out")

# name outputs from an existing column (overwrite `source`, don't add a column):
df.withColumn("source", df["scene_id"]) \
.write.format("gtiff_gbx").mode("append").save("/Volumes/cat/sch/vol/out")

Output: one raster file per input row, written under the target directory; the file format/extension is whatever tile.metadata records (e.g. GeoTIFF for gtiff_gbx).

Options

Both tiers require the exact (source, tile) schema and take the same two writer options. The on-disk driver, compression, and block layout come from tile.metadata, not from writer options (see Output encoding).

OptionDefaultDescription
ext"tif"Filename suffix appended to each written file. Does not change the on-disk format; that is driven by tile.metadata["driver"].
nameColunsetName of an existing string column whose value becomes the output filename (without extension). Must be one of the two table columns — in practice, overwrite source (see Filename Control). When unset, an opaque unique name is used.

Lightweight (raster_gbx)

Supports .mode("overwrite"). Passes whole-file GeoTIFF tiles through verbatim; re-encodes non-GTiff tiles (e.g. COG) via rasterio.

Heavyweight (gdal)

Append-only — .mode("overwrite") is not supported. Re-encodes every tile through native GDAL (decoded pixels are identical to the lightweight output).

Pure-Python/PySpark raster writer — the lightweight tier's drop-in for the GDAL-backed gdal writer. Requires the exact (source, tile) schema, the same as the heavy writer. Writer options are path / nameCol / ext; the on-disk format and compression come from tile.metadata, not writer options.

Write raster tiles

# Catch-all lightweight writer (output driver from tile.metadata; default GTiff)
from databricks.labs.gbx.ds.register import register
register(spark)
df = spark.read.format("raster_gbx").load("{SAMPLE_RASTER_PATH}")
df.write.format("raster_gbx").mode("overwrite").save(OUT_DIR)

A whole-file GeoTIFF tile is written through verbatim (no re-encode); a tile whose metadata["driver"] is non-GTiff (e.g. COG) is re-encoded via rasterio.

For an explicit GeoTIFF-forced writer, see the Lightweight GeoTIFF Writer.

Register before writing

Like the lightweight readers, the lightweight writers are not auto-registered (Python Data Source V2 has no classpath auto-discovery, unlike the JAR-based heavyweight writers). Call register(spark) first, as shown above. See Lightweight Raster Readers → Register for the full explanation.

Control filenames (nameCol)

# Control output filenames: overwrite 'source', set nameCol
from pyspark.sql.functions import concat, lit, monotonically_increasing_id
(df.withColumn("source", concat(lit("tile_"), monotonically_increasing_id()))
.write.format("gtiff_gbx").mode("overwrite")
.option("nameCol", "source").option("ext", "tif").save(OUT_DIR))

Output format & compression

# On-disk format/compression come from tile.metadata, NOT writer options
# driver/format -> output driver (default GTiff; GTiff = passed through verbatim)
# compression/blocksize/zlevel/zstd_level -> applied when re-encoding (non-GTiff)
# Change them via upstream transforms, then write.
OptionDefaultDescription
nameColunsetExisting string column whose value is the output filename (overwrite source). When unset, an opaque unique name is used.
ext"tif"Filename suffix. Does not change the on-disk format.

The driver / compression / blocksize / zlevel / zstd_level are read from tile.metadata (same as the heavy gdal writer).

The lightweight writer passes whole-file GeoTIFF tiles through verbatim, while the heavyweight writer re-encodes every tile (decoded pixels are identical either way). The heavyweight gdal writer is also append-only, whereas the lightweight writer supports overwrite. See Choosing an Execution Tier and the Benchmarking page for light-vs-heavy timings.

It is the lightweight counterpart of the heavyweight gdal writer, supporting Python and SQL bindings (not Scala).

Next Steps