GDAL Reader

The GDAL reader provides generic support for reading raster data formats through the GDAL library. This is the base reader that powers all raster format readers in GeoBrix.

GeoTIFF Focus

GeoBrix is currently most focused on support for GeoTIFF format. While the GDAL reader can work with many formats, GeoTIFF receives the most testing and optimization. For other formats, your experience may vary depending on GDAL driver availability and maturity.

Untrusted inputs

The generic gdal reader accepts any GDAL-supported format (~150+ drivers). Individual GDAL drivers have historically had parser vulnerabilities, so a malformed file from an untrusted source can exercise bugs deep inside the native library.

Only load raster files from sources you trust. For third-party data, prefer validating with gdalinfo (or a sandboxed job) before ingesting into production pipelines.
Network-capable drivers (WMS, WMTS, WCS, WFS, HTTP, CSW, OGCAPI) are disabled by default because they can trigger outbound HTTP fetches at open time. To re-enable them, set spark.gdal.GDAL_SKIP="" (disable all skipping) or a narrower space-separated list in your Spark cluster config. See Installation for cluster configuration guidance.

Format Name

gdal

Overview

The GDAL reader is a generic raster data reader that can handle any format supported by GDAL. While GeoBrix provides named readers for common formats (GeoTIFF), you can use the GDAL reader directly for any available format.

Understanding Raster Formats: Raster data represents geographic information as a grid of cells (pixels), where each cell contains a value. Unlike vector data (points, lines, polygons), rasters are ideal for continuous phenomena like elevation, temperature, or satellite imagery. Common raster use cases include Digital Elevation Models (DEMs) for terrain analysis, multispectral satellite imagery for land cover classification, weather model outputs (temperature, precipitation), and aerial photography. Raster formats vary in their compression methods, band organization, and metadata capabilities—GeoTIFF is the most universal, NetCDF excels at multi-dimensional scientific data, HDF5 handles massive hierarchical datasets, and GRIB2 is standard for meteorological models.

Available Formats

The GDAL reader can work with many GDAL raster drivers, including:

GeoTIFF (.tif, .tiff) - Most common geospatial raster format
Cloud-Optimized GeoTIFF (COG) - Web-optimized GeoTIFF variant
NetCDF (.nc) - Multi-dimensional scientific data
HDF5 (.h5, .hdf) - Hierarchical data format
GRIB/GRIB2 (.grb, .grib2) - Meteorological data
JPEG2000 (.jp2) - High-compression imagery
ENVI (.hdr) - Remote sensing format
Zarr - Cloud-native array storage
And 150+ more formats

Format Availability

Experience varies across GDAL formats. Not all formats are available by default—some require additional packages or drivers to be installed in your environment. Refer to GDAL for driver names.

Basic Usage

Python

# Read raster file (sample-data Volumes path)
df = spark.read.format("gdal").load("{SAMPLE_RASTER_PATH}")
df.show()

Example output
+--------------------------------------------------+-----+
|path                                              |tile |
+--------------------------------------------------+-----+
|/Volumes/.../nyc_sentinel2_red.tif                |{...}|
+--------------------------------------------------+-----+

Scala

val df = spark.read.format("gdal").load("/Volumes/main/default/geobrix_samples/geobrix-examples/nyc/sentinel2/nyc_sentinel2_red.tif")

Example output
+--------------------------------------------------+-----+
|path                                              |tile |
+--------------------------------------------------+-----+
|/Volumes/.../nyc_sentinel2_red.tif                |{...}|
+--------------------------------------------------+-----+

SQL

-- Read raster in SQL (sample-data Volumes path)
SELECT * FROM gdal.`{SAMPLE_RASTER_PATH}` LIMIT 10;

Example output
+--------------------------------------------------+-----+
|path                                              |tile |
+--------------------------------------------------+-----+
|/Volumes/.../nyc_sentinel2_red.tif                |{...}|
+--------------------------------------------------+-----+

Options

`driver`

Default: Auto-detected from file extension if not specified

Explicitly specify the GDAL driver to use (regardless of extension).

# Read with explicit driver (sample-data Volumes path)
df = spark.read.format("gdal") \
    .option("driver", "GTiff") \
    .load("{SAMPLE_RASTER_PATH}")
df.show()

Example output
+--------------------------------------------------+-----+
|path                                              |tile |
+--------------------------------------------------+-----+
|/Volumes/.../nyc_sentinel2_red.tif                |{...}|
+--------------------------------------------------+-----+

Other Options

Option	Default	Description
`sizeInMB`	`"16"`	Split threshold for large files (parallel processing)
`filterRegex`	`".*"`	Filter files by regex when reading from directory
`readSubdatasets`	`"false"`	Read subdatasets if present (e.g., HDF, NetCDF)
`rasterAsGrid`	`"false"`	Read as grid instead of tiles
`retile`	`"false"`	Retile rasters for optimal processing
`tileSize`	`"256"`	Tile size in pixels (if retiling enabled)

Output Schema

root
 |-- tile: struct (GeoBrix raster tile structure)
     |-- cellid: bigint (grid cell ID, nullable)
     |-- raster: binary (raster file content)
     |-- metadata: map<string,string> (driver, extension, etc.)

The tile column contains the complete raster data structure. See Tile Structure for detailed field descriptions.

Named Readers vs GDAL

For common formats, GeoBrix provides named readers for convenience:

# Using named reader (recommended for GeoTIFF)
df = spark.read.format("gtiff_gdal").load("/path/to/file.tif")

# Using GDAL (works but less convenient)
df = spark.read.format("gdal").option("driver", "GTiff").load("/path/to/file.tif")

When to use each:

Named readers (gtiff_gdal): Better for common formats, cleaner syntax
GDAL: Useful for less common formats or when you need driver-specific options

Common Raster Formats Explained

GeoTIFF - The Universal Choice

Best for: General-purpose geospatial rasters, aerial imagery, DEMs

GeoTIFF combines TIFF image format with embedded geospatial metadata (coordinate system, geotransform). It's the de facto standard because it's simple, widely supported, and works everywhere. Cloud-Optimized GeoTIFF (COG) adds internal tiling and overviews for efficient cloud storage access.

# Standard GeoTIFF
df = spark.read.format("gtiff_gdal").load("/path/to/elevation.tif")

# Works with COG too
df = spark.read.format("gtiff_gdal").load("s3://bucket/cog-file.tif")

NetCDF - Multi-Dimensional Science Data

Best for: Climate models, oceanographic data, time-series rasters

NetCDF excels at storing multi-dimensional arrays with labeled dimensions (time, latitude, longitude, elevation). Common in scientific computing for weather forecasts, climate projections, and oceanographic measurements.

# NetCDF with multiple variables/subdatasets
df = spark.read.format("gdal") \
    .option("driver", "NetCDF") \
    .option("readSubdatasets", "true") \
    .load("/path/to/climate_model.nc")

HDF5 - Massive Hierarchical Data

Best for: Large scientific datasets, satellite products (MODIS, Sentinel)

HDF5 (Hierarchical Data Format) handles extremely large datasets with complex internal structures. NASA and ESA use it for satellite products. Like NetCDF, it often contains multiple subdatasets.

# HDF5 from satellite products
df = spark.read.format("gdal") \
    .option("driver", "HDF5") \
    .option("readSubdatasets", "true") \
    .load("/path/to/MOD13Q1.hdf")

GRIB/GRIB2 - Weather Models

Best for: Numerical weather prediction, meteorological data

GRIB (GRIdded Binary) is the standard format for weather model outputs from agencies like NOAA, ECMWF. Highly compressed and optimized for meteorological variables. Example uses NOAA HRRR weather data from sample-data (nyc/hrrr-weather).

Complete bundle only

The hrrr-weather dataset is included in the complete sample-data bundle, not the essential bundle. See Sample data for download options.

# GRIB2 weather data (sample-data HRRR)
df = spark.read.format("gdal") \
    .option("driver", "GRIB") \
    .load("{SAMPLE_HRRR_PATH}")

Example output
+--------------------------------------------------+-----+
|path                                              |tile |
+--------------------------------------------------+-----+
|.../nyc/hrrr-weather/hrrr_nyc_....grib2           |{...}|
+--------------------------------------------------+-----+

Format Selection Guide

Format	Size	Compression	Multi-Band	Time-Series	Cloud-Friendly
GeoTIFF	Good	Good	Yes	No	Yes (COG)
NetCDF	Excellent	Good	Yes	Yes	Moderate
HDF5	Excellent	Good	Yes	Yes	Poor
GRIB2	Excellent	Excellent	Yes	Yes	Poor
JPEG2000	Excellent	Excellent	Yes	No	Moderate
Zarr	Excellent	Good	Yes	Yes	Excellent

General Rules:

Start with GeoTIFF (use COG for cloud)
Use NetCDF for multi-dimensional scientific data
Use HDF5 when required by data provider (e.g., MODIS)
Use GRIB2 for weather models
Use Zarr for cloud-native analysis at scale

Next Steps

GeoTIFF Reader - Named reader for GeoTIFF format
RasterX Functions - Raster processing operations
Quick Start - Get started with GeoBrix

Format Name​

Overview​

Available Formats​

Basic Usage​

Python​

Scala​

SQL​

Options​

driver​

Other Options​

Output Schema​

Named Readers vs GDAL​

Common Raster Formats Explained​

GeoTIFF - The Universal Choice​

NetCDF - Multi-Dimensional Science Data​

HDF5 - Massive Hierarchical Data​

GRIB/GRIB2 - Weather Models​

Format Selection Guide​

Next Steps​