Skip to main content

Raster Reader

Read rasters into the shared (source, tile) schema. GeoBrix offers two interchangeable tiers: a lightweight pure-Python/PySpark reader (raster_gbx, rasterio-backed, JAR-free, Serverless-safe) and a heavyweight GDAL-backed reader (gdal). They emit the same schema, so swapping is a one-line format(...) change — see Choosing an Execution Tier.

The heavyweight gdal reader supports the full set of GDAL drivers (NetCDF, HDF5, COG, …); the lightweight reader covers the common raster path. The pairing is a corresponding general raster reader per tier, not a feature-identical one.

Benchmark & tradeoff

The lightweight (*_gbx) and heavyweight readers emit the same schema, but your compute usually decides the tier: the lightweight tier needs no JAR or init script and is the only option on Serverless, standard (shared), and ARM clusters. The heavyweight tier requires a classic x86 cluster (JAR + GDAL init script); where it is available it uses native GDAL on the JVM and tends to pull ahead on large workloads. See the Benchmarking page for light-vs-heavy timings and methodology.

Options

Lightweight (raster_gbx)

OptionDefaultDescription
sizeInMB"-1"Default (<= 0) = no split: one whole-image tile per file. Set a positive MB value to tile large rasters into multiple tiles.
filterRegex".*"When loading a directory, keep files whose full path matches this regex.
# Options: sizeInMB (tile split threshold) + filterRegex (directory listing)
df = (spark.read.format("raster_gbx")
.option("sizeInMB", "16")
.option("filterRegex", r".*\.tif$")
.load("{SAMPLE_RASTER_PATH}"))

gtiff_gbx is raster_gbx with the GeoTIFF driver preset.

Heavyweight (gdal)

OptionDefaultDescription
driverauto-detected from extensionExplicitly specify the GDAL driver to use (regardless of extension).
sizeInMB"-1"Default (<= 0) = no split: one whole-image tile per file. Set a positive MB value to split large files into multiple tiles for parallel processing.
filterRegex".*"Filter files by regex when reading from a directory.
readSubdatasets"false"Read subdatasets if present (e.g. HDF, NetCDF).
rasterAsGrid"false"Read as grid instead of tiles.
retile"false"Retile rasters for optimal processing.
tileSize"256"Tile size in pixels (if retiling enabled).

Example — forcing the GDAL driver explicitly:

# Read with explicit driver (sample-data Volumes path)
df = spark.read.format("gdal") \
.option("driver", "GTiff") \
.load("{SAMPLE_RASTER_PATH}")
df.show()
Example output
+--------------------------------------------------+-----+
|path |tile |
+--------------------------------------------------+-----+
|/Volumes/.../nyc_sentinel2_red.tif |{...}|
+--------------------------------------------------+-----+

Register

# Register the lightweight raster DataSources (once per session)
from databricks.labs.gbx.ds.register import register
register(spark)
Lightweight readers and writers are not auto-registered

This is a key difference from the heavyweight tier. The heavyweight readers/writers (gdal, gtiff_gdal, …) are auto-discovered from the JAR on the classpath via Spark's JVM DataSourceRegister service loader, so spark.read.format("gdal") works with no setup call. The lightweight readers/writers are Python Data Source V2 sources, and Python has no classpath auto-discovery equivalent — so you must register them explicitly with register(spark) (above) before using format("raster_gbx") / format("gtiff_gbx") for reads or writes.

Importing databricks.labs.gbx.ds will opportunistically register them if a Spark session is already active, but the explicit register(spark) call is the reliable path, in case your session is created after imports. (This mirrors the heavyweight gbx_rst_* SQL functions, which also require an explicit register(spark).)

Read (catch-all)

# Catch-all lightweight reader (any rasterio-readable raster)
df = spark.read.format("raster_gbx").load("{SAMPLE_RASTER_PATH}")
df.show()
Example output
+--------------------------------------------------+-----+
|source |tile |
+--------------------------------------------------+-----+
|/Volumes/.../nyc_sentinel2_red.tif |{...}|
+--------------------------------------------------+-----+

gtiff_gbx is raster_gbx with the GeoTIFF driver preset. See the Lightweight GeoTIFF Reader for the named-reader page. See the page-level Options section below for the available reader options.

See Choosing an Execution Tier for the full tradeoff and the Benchmarking page for light-vs-heavy timings.

It is the lightweight counterpart of the heavyweight gdal reader, supporting Python and SQL bindings (not Scala).

Common Raster Formats Explained

GeoTIFF - The Universal Choice

Best for: General-purpose geospatial rasters, aerial imagery, DEMs

GeoTIFF combines TIFF image format with embedded geospatial metadata (coordinate system, geotransform). It's the de facto standard because it's simple, widely supported, and works everywhere. Cloud-Optimized GeoTIFF (COG) adds internal tiling and overviews for efficient cloud storage access.

# Standard GeoTIFF
df = spark.read.format("gtiff_gdal").load("/path/to/elevation.tif")

# Works with COG too
df = spark.read.format("gtiff_gdal").load("s3://bucket/cog-file.tif")

NetCDF - Multi-Dimensional Science Data

Best for: Climate models, oceanographic data, time-series rasters

NetCDF excels at storing multi-dimensional arrays with labeled dimensions (time, latitude, longitude, elevation). Common in scientific computing for weather forecasts, climate projections, and oceanographic measurements.

# NetCDF with multiple variables/subdatasets
df = spark.read.format("gdal") \
.option("driver", "NetCDF") \
.option("readSubdatasets", "true") \
.load("/path/to/climate_model.nc")

HDF5 - Massive Hierarchical Data

Best for: Large scientific datasets, satellite products (MODIS, Sentinel)

HDF5 (Hierarchical Data Format) handles extremely large datasets with complex internal structures. NASA and ESA use it for satellite products. Like NetCDF, it often contains multiple subdatasets.

# HDF5 from satellite products
df = spark.read.format("gdal") \
.option("driver", "HDF5") \
.option("readSubdatasets", "true") \
.load("/path/to/MOD13Q1.hdf")

GRIB/GRIB2 - Weather Models

Best for: Numerical weather prediction, meteorological data

GRIB (GRIdded Binary) is the standard format for weather model outputs from agencies like NOAA, ECMWF. Highly compressed and optimized for meteorological variables. Example uses NOAA HRRR weather data from sample-data (nyc/hrrr-weather).

Complete bundle only

The hrrr-weather dataset is included in the complete sample-data bundle, not the essential bundle. See Sample data for download options.

# GRIB2 weather data (sample-data HRRR)
df = spark.read.format("gdal") \
.option("driver", "GRIB") \
.load("{SAMPLE_HRRR_PATH}")
Example output
+--------------------------------------------------+-----+
|path |tile |
+--------------------------------------------------+-----+
|.../nyc/hrrr-weather/hrrr_nyc_....grib2 |{...}|
+--------------------------------------------------+-----+

Format Selection Guide

FormatSizeCompressionMulti-BandTime-SeriesCloud-Friendly
GeoTIFFGoodGoodYesNoYes (COG)
NetCDFExcellentGoodYesYesModerate
HDF5ExcellentGoodYesYesPoor
GRIB2ExcellentExcellentYesYesPoor
JPEG2000ExcellentExcellentYesNoModerate
ZarrExcellentGoodYesYesExcellent

General Rules:

  • Start with GeoTIFF (use COG for cloud)
  • Use NetCDF for multi-dimensional scientific data
  • Use HDF5 when required by data provider (e.g., MODIS)
  • Use GRIB2 for weather models
  • Use Zarr for cloud-native analysis at scale

Next Steps