Readers Overview
GeoBrix provides Spark readers for geospatial file formats. Readers are automatically registered when the GeoBrix JAR is on the classpath.
Available Readers
Raster Readers (GDAL-based)
| Reader | Format Name | Description |
|---|---|---|
| GDAL Reader | gdal | Generic reader for any GDAL-supported raster format |
| GeoTIFF Reader | gtiff_gdal | Named reader for GeoTIFF files (preset driver="GTiff") |
Vector Readers (OGR-based)
| Reader | Format Name | Description |
|---|---|---|
| OGR Reader | ogr | Generic reader for any OGR-supported vector format |
| Shapefile Reader | shapefile_ogr | Named reader for ESRI Shapefiles |
| GeoJSON Reader | geojson_ogr | Named reader for GeoJSON/GeoJSONSeq |
| GeoPackage Reader | gpkg_ogr | Named reader for GeoPackage |
| FileGDB Reader | file_gdb_ogr | Named reader for ESRI File Geodatabase |
Basic Usage
All readers follow the same pattern:
# Python
df = spark.read.format("<reader_name>").load("/path/to/file")
# Scala
val df = spark.read.format("<reader_name>").load("/path/to/file")
-- SQL
SELECT * FROM <reader_name>.`/path/to/file`;
Examples
Raster (GeoTIFF):
df = spark.read.format("gtiff_gdal").load("/path/to/raster.tif")
Vector (Shapefile):
df = spark.read.format("shapefile_ogr").load("/path/to/data.shp")
Generic (any format):
# Generic GDAL for rasters
df = spark.read.format("gdal").option("driver", "NetCDF").load("/path/to/data.nc")
# Generic OGR for vectors
df = spark.read.format("ogr").option("driverName", "KML").load("/path/to/data.kml")
Output Schemas
Raster Output
Produces the standard tile format used by GeoBrix APIs.
root
|-- tile: struct
|-- cellid: bigint (grid cell ID, nullable)
|-- raster: binary (raster file content)
|-- metadata: map<string,string> (driver, extension, etc.)
See Tile Structure for detailed field descriptions.
Vector Output
Output will vary, depending on format / driver conventions.
root
|-- geom_0: binary (geometry in WKB format)
|-- geom_0_srid: integer (spatial reference ID)
|-- geom_0_srid_proj: string (projection definition)
|-- <attribute_columns>: various (feature attributes)
Path Types
All readers support:
- Single file:
/path/to/file.tif - Directory:
/path/to/directory/ - Wildcard:
/path/to/*.tif - Cloud storage:
s3://bucket/path,abfss://...,gs://bucket/path - Unity Catalog Volumes:
/Volumes/catalog/schema/volume/path
Common Options
Raster Options
| Option | Default | Description |
|---|---|---|
driver | Auto-detect | GDAL driver name (e.g., "GTiff", "NetCDF") |
sizeInMB | "16" | Split threshold for large files |
filterRegex | ".*" | Filter files by regex pattern |
Vector Options
| Option | Default | Description |
|---|---|---|
driverName | Auto-detect | OGR driver name (e.g., "ESRI Shapefile") |
chunkSize | "10000" | Records per chunk for parallel reading |
layerName | "" | Layer name for multi-layer formats |
layerN | "0" | Layer index for multi-layer formats |
asWKB | "true" | Output geometry as WKB (binary) vs WKT (text) |
Reader Types Explained
Generic Readers (gdal, ogr):
- Work with any format supported by GDAL/OGR
- Require
driverordriverNameoption for non-standard formats - Use when format doesn't have a named reader
Named Readers (gtiff_gdal, shapefile_ogr, etc.):
- Preset the driver option for common formats
- Cleaner syntax, no driver option needed
- Recommended for supported formats
Example:
# Named reader (cleaner)
df = spark.read.format("gtiff_gdal").load("/path/to/file.tif")
# Generic reader (more verbose, same result)
df = spark.read.format("gdal").option("driver", "GTiff").load("/path/to/file.tif")
Performance Tips
-
Use appropriate split size for rasters:
df = spark.read.format("gdal").option("sizeInMB", "32").load("/path") -
Use chunk size for vectors:
df = spark.read.format("ogr").option("chunkSize", "50000").load("/path") -
Filter files with regex:
df = spark.read.format("gdal").option("filterRegex", ".*_2024.*\\.tif").load("/path") -
Write data after read to a table (avoid repeat loading from file):
df = spark.read.format("shapefile_ogr").load("/path").write.saveAsTable(...)
Next Steps
- GDAL Reader - Generic raster reader
- GeoTIFF Reader - Named GeoTIFF reader
- OGR Reader - Generic vector reader
- Shapefile Reader - Named Shapefile reader
- GeoJSON Reader - Named GeoJSON reader
- GeoPackage Reader - Named GeoPackage reader
- FileGDB Reader - Named File Geodatabase reader