Skip to main content

Readers Overview

GeoBrix provides Spark readers for geospatial file formats. Readers are automatically registered when the GeoBrix JAR is on the classpath.

Available Readers

Raster Readers (GDAL-based)

ReaderFormat NameDescription
GDAL ReadergdalGeneric reader for any GDAL-supported raster format
GeoTIFF Readergtiff_gdalNamed reader for GeoTIFF files (preset driver="GTiff")

Vector Readers (OGR-based)

ReaderFormat NameDescription
OGR ReaderogrGeneric reader for any OGR-supported vector format
Shapefile Readershapefile_ogrNamed reader for ESRI Shapefiles
GeoJSON Readergeojson_ogrNamed reader for GeoJSON/GeoJSONSeq
GeoPackage Readergpkg_ogrNamed reader for GeoPackage
FileGDB Readerfile_gdb_ogrNamed reader for ESRI File Geodatabase

Basic Usage

All readers follow the same pattern:

# Python
df = spark.read.format("<reader_name>").load("/path/to/file")

# Scala
val df = spark.read.format("<reader_name>").load("/path/to/file")
-- SQL
SELECT * FROM <reader_name>.`/path/to/file`;

Examples

Raster (GeoTIFF):

df = spark.read.format("gtiff_gdal").load("/path/to/raster.tif")

Vector (Shapefile):

df = spark.read.format("shapefile_ogr").load("/path/to/data.shp")

Generic (any format):

# Generic GDAL for rasters
df = spark.read.format("gdal").option("driver", "NetCDF").load("/path/to/data.nc")

# Generic OGR for vectors
df = spark.read.format("ogr").option("driverName", "KML").load("/path/to/data.kml")

Output Schemas

Raster Output

Produces the standard tile format used by GeoBrix APIs.

root
|-- tile: struct
|-- cellid: bigint (grid cell ID, nullable)
|-- raster: binary (raster file content)
|-- metadata: map<string,string> (driver, extension, etc.)

See Tile Structure for detailed field descriptions.

Vector Output

Output will vary, depending on format / driver conventions.

root
|-- geom_0: binary (geometry in WKB format)
|-- geom_0_srid: integer (spatial reference ID)
|-- geom_0_srid_proj: string (projection definition)
|-- <attribute_columns>: various (feature attributes)

Path Types

All readers support:

  • Single file: /path/to/file.tif
  • Directory: /path/to/directory/
  • Wildcard: /path/to/*.tif
  • Cloud storage: s3://bucket/path, abfss://..., gs://bucket/path
  • Unity Catalog Volumes: /Volumes/catalog/schema/volume/path

Common Options

Raster Options

OptionDefaultDescription
driverAuto-detectGDAL driver name (e.g., "GTiff", "NetCDF")
sizeInMB"16"Split threshold for large files
filterRegex".*"Filter files by regex pattern

Vector Options

OptionDefaultDescription
driverNameAuto-detectOGR driver name (e.g., "ESRI Shapefile")
chunkSize"10000"Records per chunk for parallel reading
layerName""Layer name for multi-layer formats
layerN"0"Layer index for multi-layer formats
asWKB"true"Output geometry as WKB (binary) vs WKT (text)

Reader Types Explained

Generic Readers (gdal, ogr):

  • Work with any format supported by GDAL/OGR
  • Require driver or driverName option for non-standard formats
  • Use when format doesn't have a named reader

Named Readers (gtiff_gdal, shapefile_ogr, etc.):

  • Preset the driver option for common formats
  • Cleaner syntax, no driver option needed
  • Recommended for supported formats

Example:

# Named reader (cleaner)
df = spark.read.format("gtiff_gdal").load("/path/to/file.tif")

# Generic reader (more verbose, same result)
df = spark.read.format("gdal").option("driver", "GTiff").load("/path/to/file.tif")

Performance Tips

  1. Use appropriate split size for rasters:

    df = spark.read.format("gdal").option("sizeInMB", "32").load("/path")
  2. Use chunk size for vectors:

    df = spark.read.format("ogr").option("chunkSize", "50000").load("/path")
  3. Filter files with regex:

    df = spark.read.format("gdal").option("filterRegex", ".*_2024.*\\.tif").load("/path")
  4. Write data after read to a table (avoid repeat loading from file):

    df = spark.read.format("shapefile_ogr").load("/path").write.saveAsTable(...)

Next Steps