Raster Reader
Read rasters into the shared (source, tile) schema. GeoBrix offers two
interchangeable tiers: a lightweight pure-Python/PySpark reader (raster_gbx,
rasterio-backed, JAR-free, Serverless-safe) and a heavyweight GDAL-backed
reader (gdal). They emit the same schema, so swapping is a one-line
format(...) change — see Choosing an Execution Tier.
The heavyweight
gdalreader supports the full set of GDAL drivers (NetCDF, HDF5, COG, …); the lightweight reader covers the common raster path. The pairing is a corresponding general raster reader per tier, not a feature-identical one.
The lightweight (*_gbx) and heavyweight readers emit the same schema, but your
compute usually decides the tier: the lightweight tier needs no JAR or init script
and is the only option on Serverless, standard (shared), and ARM clusters. The
heavyweight tier requires a classic x86 cluster (JAR + GDAL init script); where it is
available it uses native GDAL on the JVM and tends to pull ahead on large workloads. See
the Benchmarking page for light-vs-heavy timings and methodology.
Options
Lightweight (raster_gbx)
| Option | Default | Description |
|---|---|---|
sizeInMB | "-1" | Default (<= 0) = no split: one whole-image tile per file. Set a positive MB value to tile large rasters into multiple tiles. |
filterRegex | ".*" | When loading a directory, keep files whose full path matches this regex. |
# Options: sizeInMB (tile split threshold) + filterRegex (directory listing)
df = (spark.read.format("raster_gbx")
.option("sizeInMB", "16")
.option("filterRegex", r".*\.tif$")
.load("{SAMPLE_RASTER_PATH}"))
gtiff_gbx is raster_gbx with the GeoTIFF driver preset.
Heavyweight (gdal)
| Option | Default | Description |
|---|---|---|
driver | auto-detected from extension | Explicitly specify the GDAL driver to use (regardless of extension). |
sizeInMB | "-1" | Default (<= 0) = no split: one whole-image tile per file. Set a positive MB value to split large files into multiple tiles for parallel processing. |
filterRegex | ".*" | Filter files by regex when reading from a directory. |
readSubdatasets | "false" | Read subdatasets if present (e.g. HDF, NetCDF). |
rasterAsGrid | "false" | Read as grid instead of tiles. |
retile | "false" | Retile rasters for optimal processing. |
tileSize | "256" | Tile size in pixels (if retiling enabled). |
Example — forcing the GDAL driver explicitly:
# Read with explicit driver (sample-data Volumes path)
df = spark.read.format("gdal") \
.option("driver", "GTiff") \
.load("{SAMPLE_RASTER_PATH}")
df.show()
+--------------------------------------------------+-----+
|path |tile |
+--------------------------------------------------+-----+
|/Volumes/.../nyc_sentinel2_red.tif |{...}|
+--------------------------------------------------+-----+
- Lightweight · raster_gbx
- Heavyweight · gdal
Register
# Register the lightweight raster DataSources (once per session)
from databricks.labs.gbx.ds.register import register
register(spark)
This is a key difference from the heavyweight tier. The heavyweight readers/writers
(gdal, gtiff_gdal, …) are auto-discovered from the JAR on the classpath via
Spark's JVM DataSourceRegister service loader, so spark.read.format("gdal")
works with no setup call. The lightweight readers/writers are Python Data
Source V2 sources, and Python has no classpath auto-discovery equivalent — so you
must register them explicitly with register(spark) (above) before using
format("raster_gbx") / format("gtiff_gbx") for reads or writes.
Importing databricks.labs.gbx.ds will opportunistically register them if a
Spark session is already active, but the explicit register(spark) call is the
reliable path, in case your session is created after imports. (This mirrors
the heavyweight gbx_rst_* SQL functions, which also require an explicit
register(spark).)
Read (catch-all)
# Catch-all lightweight reader (any rasterio-readable raster)
df = spark.read.format("raster_gbx").load("{SAMPLE_RASTER_PATH}")
df.show()
+--------------------------------------------------+-----+
|source |tile |
+--------------------------------------------------+-----+
|/Volumes/.../nyc_sentinel2_red.tif |{...}|
+--------------------------------------------------+-----+
gtiff_gbx is raster_gbx with the GeoTIFF driver preset. See the Lightweight GeoTIFF Reader for the named-reader page. See the page-level Options section below for the available reader options.
See Choosing an Execution Tier for the full tradeoff and the Benchmarking page for light-vs-heavy timings.
It is the lightweight counterpart of the heavyweight gdal reader, supporting Python and SQL bindings (not Scala).
The GDAL reader provides generic support for reading raster data formats through the GDAL library. This is the base reader that powers all raster format readers in GeoBrix.
GeoBrix is currently most focused on support for GeoTIFF format. While the GDAL reader can work with many formats, GeoTIFF receives the most testing and optimization. For other formats, your experience may vary depending on GDAL driver availability and maturity.
The generic gdal reader accepts any GDAL-supported format (~150+ drivers). Individual GDAL drivers have historically had parser vulnerabilities, so a malformed file from an untrusted source can exercise bugs deep inside the native library.
- Only load raster files from sources you trust. For third-party data, prefer validating with
gdalinfo(or a sandboxed job) before ingesting into production pipelines. - Network-capable drivers (
WMS,WMTS,WCS,WFS,HTTP,CSW,OGCAPI) are disabled by default because they can trigger outbound HTTP fetches at open time. To re-enable them, setspark.gdal.GDAL_SKIP=""(disable all skipping) or a narrower space-separated list in your Spark cluster config. See Installation for cluster configuration guidance.
Format Name
gdal
Overview
The GDAL reader is a generic raster data reader that can handle any format supported by GDAL. While GeoBrix provides named readers for common formats (GeoTIFF), you can use the GDAL reader directly for any available format.
Understanding Raster Formats: Raster data represents geographic information as a grid of cells (pixels), where each cell contains a value. Unlike vector data (points, lines, polygons), rasters are ideal for continuous phenomena like elevation, temperature, or satellite imagery. Common raster use cases include Digital Elevation Models (DEMs) for terrain analysis, multispectral satellite imagery for land cover classification, weather model outputs (temperature, precipitation), and aerial photography. Raster formats vary in their compression methods, band organization, and metadata capabilities—GeoTIFF is the most universal, NetCDF excels at multi-dimensional scientific data, HDF5 handles massive hierarchical datasets, and GRIB2 is standard for meteorological models.
Available Formats
The GDAL reader can work with many GDAL raster drivers, including:
- GeoTIFF (.tif, .tiff) - Most common geospatial raster format
- Cloud-Optimized GeoTIFF (COG) - Web-optimized GeoTIFF variant
- NetCDF (.nc) - Multi-dimensional scientific data
- HDF5 (.h5, .hdf) - Hierarchical data format
- GRIB/GRIB2 (.grb, .grib2) - Meteorological data
- JPEG2000 (.jp2) - High-compression imagery
- ENVI (.hdr) - Remote sensing format
- Zarr - Cloud-native array storage
- And 150+ more formats
Experience varies across GDAL formats. Not all formats are available by default—some require additional packages or drivers to be installed in your environment. Refer to GDAL for driver names.
Basic Usage
Python
# Read raster file (sample-data Volumes path)
df = spark.read.format("gdal").load("{SAMPLE_RASTER_PATH}")
df.show()
+--------------------------------------------------+-----+
|path |tile |
+--------------------------------------------------+-----+
|/Volumes/.../nyc_sentinel2_red.tif |{...}|
+--------------------------------------------------+-----+
Scala
val df = spark.read.format("gdal").load("/Volumes/main/default/geobrix_samples/geobrix-examples/nyc/sentinel2/nyc_sentinel2_red.tif")
+--------------------------------------------------+-----+
|path |tile |
+--------------------------------------------------+-----+
|/Volumes/.../nyc_sentinel2_red.tif |{...}|
+--------------------------------------------------+-----+
SQL
-- Read raster in SQL (sample-data Volumes path)
SELECT * FROM gdal.`{SAMPLE_RASTER_PATH}` LIMIT 10;
+--------------------------------------------------+-----+
|path |tile |
+--------------------------------------------------+-----+
|/Volumes/.../nyc_sentinel2_red.tif |{...}|
+--------------------------------------------------+-----+
Output Schema
root
|-- tile: struct (GeoBrix raster tile structure)
|-- cellid: bigint (grid cell ID, nullable)
|-- raster: binary (raster file content)
|-- metadata: map<string,string> (driver, extension, etc.)
The tile column contains the complete raster data structure. See Tile Structure for detailed field descriptions.
Named Readers vs GDAL
For common formats, GeoBrix provides named readers for convenience:
# Using named reader (recommended for GeoTIFF)
df = spark.read.format("gtiff_gdal").load("/path/to/file.tif")
# Using GDAL (works but less convenient)
df = spark.read.format("gdal").option("driver", "GTiff").load("/path/to/file.tif")
When to use each:
- Named readers (gtiff_gdal): Better for common formats, cleaner syntax
- GDAL: Useful for less common formats or when you need driver-specific options
Common Raster Formats Explained
GeoTIFF - The Universal Choice
Best for: General-purpose geospatial rasters, aerial imagery, DEMs
GeoTIFF combines TIFF image format with embedded geospatial metadata (coordinate system, geotransform). It's the de facto standard because it's simple, widely supported, and works everywhere. Cloud-Optimized GeoTIFF (COG) adds internal tiling and overviews for efficient cloud storage access.
# Standard GeoTIFF
df = spark.read.format("gtiff_gdal").load("/path/to/elevation.tif")
# Works with COG too
df = spark.read.format("gtiff_gdal").load("s3://bucket/cog-file.tif")
NetCDF - Multi-Dimensional Science Data
Best for: Climate models, oceanographic data, time-series rasters
NetCDF excels at storing multi-dimensional arrays with labeled dimensions (time, latitude, longitude, elevation). Common in scientific computing for weather forecasts, climate projections, and oceanographic measurements.
# NetCDF with multiple variables/subdatasets
df = spark.read.format("gdal") \
.option("driver", "NetCDF") \
.option("readSubdatasets", "true") \
.load("/path/to/climate_model.nc")
HDF5 - Massive Hierarchical Data
Best for: Large scientific datasets, satellite products (MODIS, Sentinel)
HDF5 (Hierarchical Data Format) handles extremely large datasets with complex internal structures. NASA and ESA use it for satellite products. Like NetCDF, it often contains multiple subdatasets.
# HDF5 from satellite products
df = spark.read.format("gdal") \
.option("driver", "HDF5") \
.option("readSubdatasets", "true") \
.load("/path/to/MOD13Q1.hdf")
GRIB/GRIB2 - Weather Models
Best for: Numerical weather prediction, meteorological data
GRIB (GRIdded Binary) is the standard format for weather model outputs from agencies like NOAA, ECMWF. Highly compressed and optimized for meteorological variables. Example uses NOAA HRRR weather data from sample-data (nyc/hrrr-weather).
The hrrr-weather dataset is included in the complete sample-data bundle, not the essential bundle. See Sample data for download options.
# GRIB2 weather data (sample-data HRRR)
df = spark.read.format("gdal") \
.option("driver", "GRIB") \
.load("{SAMPLE_HRRR_PATH}")
+--------------------------------------------------+-----+
|path |tile |
+--------------------------------------------------+-----+
|.../nyc/hrrr-weather/hrrr_nyc_....grib2 |{...}|
+--------------------------------------------------+-----+
Format Selection Guide
| Format | Size | Compression | Multi-Band | Time-Series | Cloud-Friendly |
|---|---|---|---|---|---|
| GeoTIFF | Good | Good | Yes | No | Yes (COG) |
| NetCDF | Excellent | Good | Yes | Yes | Moderate |
| HDF5 | Excellent | Good | Yes | Yes | Poor |
| GRIB2 | Excellent | Excellent | Yes | Yes | Poor |
| JPEG2000 | Excellent | Excellent | Yes | No | Moderate |
| Zarr | Excellent | Good | Yes | Yes | Excellent |
General Rules:
- Start with GeoTIFF (use COG for cloud)
- Use NetCDF for multi-dimensional scientific data
- Use HDF5 when required by data provider (e.g., MODIS)
- Use GRIB2 for weather models
- Use Zarr for cloud-native analysis at scale
Next Steps
- GeoTIFF Reader - Named reader for GeoTIFF format
- Raster Functions - Raster processing operations
- Quick Start - Get started with GeoBrix