GDAL Command Line Integration
Learn how to complement GeoBrix with GDAL command-line utilities for preprocessing, format conversion, and specialized operations that prepare data for distributed processing.
Overview
GDAL provides powerful command-line utilities that can handle operations best done before or after distributed processing. These tools complement GeoBrix by:
- Preprocessing data before loading into Spark
- Format conversion and optimization
- Reprojection and warping operations
- Tiling large rasters for parallel processing
- Metadata extraction and manipulation
- Postprocessing results after Spark operations
Why Use GDAL CLI with GeoBrix?
Preprocessing Benefits
Use the GDAL CLI on a TIF from sample-data (Volumes path). Commands and actual shell output:
# 1. Reproject to common CRS
gdalwarp -t_srs EPSG:4326 /Volumes/main/default/geobrix_samples/geobrix-examples/nyc/sentinel2/nyc_sentinel2_red.tif /tmp/reprojected.tif
# 2. Create Cloud-Optimized GeoTIFF
gdal_translate -co TILED=YES -co COMPRESS=LZW -co COPY_SRC_OVERVIEWS=YES \
/Volumes/main/default/geobrix_samples/geobrix-examples/nyc/sentinel2/nyc_sentinel2_red.tif /tmp/output_cog.tif
# 3. Then load with GeoBrix: spark.read.format("gdal").load("/tmp/output_cog.tif")
Step 1 — gdalwarp:
Creating output file that is 10980P x 10980L.
Processing input file ...
0...10...20...30...40...50...60...70...80...90...100 - done.
Step 2 — gdal_translate:
Input file size is 10980, 10980
0...10...20...30...40...50...60...70...80...90...100 - done.
Complementary Workflows
GDAL CLI excels at:
- ✅ One-time transformations
- ✅ File format operations
- ✅ Creating VRT (Virtual Raster)
- ✅ Building overviews/pyramids
- ✅ Batch file operations
GeoBrix excels at:
- ✅ Distributed processing
- ✅ Per-row operations
- ✅ Spark DataFrame integration
- ✅ Columnar operations
Common GDAL Utilities
gdalinfo - Inspect Rasters
Get detailed raster information with the GDAL CLI. Example using a TIF from sample-data (Volumes path) and actual shell output:
gdalinfo /Volumes/main/default/geobrix_samples/geobrix-examples/nyc/sentinel2/nyc_sentinel2_red.tif
Driver: GTiff/GeoTIFF
Files: /Volumes/main/default/geobrix_samples/geobrix-examples/nyc/sentinel2/nyc_sentinel2_red.tif
Size is 10980, 10980
Coordinate System is:
PROJCS["WGS 84 / UTM zone 18N",
...
Origin = (-8239980.000000000000000,4960220.000000000000000)
Pixel Size = (10.000000000000000,-10.000000000000000)
Metadata:
...
Corner Coordinates:
Upper Left ( -8239980.000, 4960220.000) ( 74d15'56.10"W, 40d42'22.31"N)
Lower Right (-8129820.000, 4950320.000) ( 73d52'57.89"W, 40d38'32.56"N)
Other useful flags: gdalinfo -json, gdalinfo -stats, gdalinfo -checksum. Integration with GeoBrix: after inspecting, load the same path with spark.read.format("gdal").load(...).
gdalwarp - Reproject and Warp
Reproject rasters before distributed processing. Example using a TIF from sample-data (Volumes path):
gdalwarp -t_srs EPSG:4326 /Volumes/main/default/geobrix_samples/geobrix-examples/nyc/sentinel2/nyc_sentinel2_red.tif /tmp/reprojected.tif
Creating output file that is 10980P x 10980L.
Processing input file ...
0...10...20...30...40...50...60...70...80...90...100 - done.
gdal_translate - Convert and Optimize
Convert formats and create optimized files. Example: Cloud-Optimized GeoTIFF using a TIF from sample-data (Volumes path):
gdal_translate -co TILED=YES -co COMPRESS=LZW -co COPY_SRC_OVERVIEWS=YES \
-co BLOCKXSIZE=512 -co BLOCKYSIZE=512 \
/Volumes/main/default/geobrix_samples/geobrix-examples/nyc/sentinel2/nyc_sentinel2_red.tif /tmp/output_cog.tif
Input file size is 10980, 10980
0...10...20...30...40...50...60...70...80...90...100 - done.
gdal_merge - Mosaic Multiple Rasters
Merge rasters before processing. Example using sample-data (Volumes path):
gdal_merge.py -co COMPRESS=LZW -co TILED=YES -o /tmp/merged.tif /Volumes/main/default/geobrix_samples/geobrix-examples/nyc/sentinel2/*.tif
0...10...20...30...40...50...60...70...80...90...100 - done.
gdalbuildvrt - Virtual Raster
Create virtual raster from multiple files. Example using sample-data (Volumes path):
gdalbuildvrt -resolution highest /tmp/mosaic.vrt /Volumes/main/default/geobrix_samples/geobrix-examples/nyc/sentinel2/*.tif
0...10...20...30...40...50...60...70...80...90...100 - done.
gdaldem - Terrain Analysis
Generate DEMs and terrain products. Example (hillshade) using sample-data elevation (Volumes path):
gdaldem hillshade -z 2 /Volumes/main/default/geobrix_samples/geobrix-examples/nyc/elevation/srtm_n40w073.tif /tmp/hillshade.tif
0...10...20...30...40...50...60...70...80...90...100 - done.
gdal_calc - Raster Calculator
Perform band math before processing. Example (threshold) using sample-data (Volumes path):
gdal_calc.py -A /Volumes/main/default/geobrix_samples/geobrix-examples/nyc/sentinel2/nyc_sentinel2_red.tif \
--outfile=/tmp/threshold.tif --calc="A>100" --NoDataValue=-9999
0...10...20...30...40...50...60...70...80...90...100 - done.
ogr2ogr - Vector Operations
Preprocess vector data. Example (reproject) using sample-data (Volumes path):
ogr2ogr -t_srs EPSG:4326 /tmp/reprojected.geojson /Volumes/main/default/geobrix_samples/geobrix-examples/nyc/boroughs/nyc_boroughs.geojson
0...10...20...30...40...50...60...70...80...90...100 - done.
Notional Preprocessing Workflow
This is just to demonstrate use of custom code, not saying you need to do all of this as GeoBrix offers somewhat overlapping distributed functions.
Scenario: Satellite Image Processing
Workflow using sample-data Volumes path (same pattern as prior examples):
#!/bin/bash
# preprocessing.sh — uses sample-data Volumes path
INPUT_DIR="/Volumes/main/default/geobrix_samples/geobrix-examples/nyc/sentinel2"
OUTPUT_DIR="/tmp/processed"
mkdir -p "$OUTPUT_DIR"
echo "Step 1: Reproject to WGS84"
for f in "$INPUT_DIR"/*.tif; do
base=$(basename "$f" .tif)
gdalwarp -t_srs EPSG:4326 -r bilinear \
-co COMPRESS=LZW -co TILED=YES \
"$f" "$OUTPUT_DIR/${base}_wgs84.tif"
done
echo "Step 2: Create overviews"
for f in "$OUTPUT_DIR"/*_wgs84.tif; do
gdaladdo -r average "$f" 2 4 8 16
done
echo "Step 3: Create VRT catalog"
gdalbuildvrt "$OUTPUT_DIR/catalog.vrt" "$OUTPUT_DIR"/*_wgs84.tif
echo "Preprocessing complete!"
Step 1: Reproject to WGS84
Creating output file that is 10980P x 10980L.
Processing input file ...
0...10...20...30...40...50...60...70...80...90...100 - done.
Step 2: Create overviews
0...10...20...30...40...50...60...70...80...90...100 - done.
Step 3: Create VRT catalog
0...10...20...30...40...50...60...70...80...90...100 - done.
Preprocessing complete!
GDAL CLI in Spark UDFs
Combine GDAL CLI with custom UDFs. Example uses sample-data Volumes path; below: UDF definition and execution results.
import tempfile
import os
from pyspark.sql import functions as f
@udf(BinaryType())
def apply_gdal_operation(tile_binary, operation):
"""Apply GDAL CLI operation via UDF (e.g. hillshade or slope)."""
with tempfile.TemporaryDirectory() as tmpdir:
input_path = os.path.join(tmpdir, "input.tif")
output_path = os.path.join(tmpdir, "output.tif")
with open(input_path, "wb") as fp:
fp.write(bytes(tile_binary))
if operation == "hillshade":
subprocess.run(["gdaldem", "hillshade", input_path, output_path], check=True, capture_output=True)
elif operation == "slope":
subprocess.run(["gdaldem", "slope", input_path, output_path], check=True, capture_output=True)
with open(output_path, "rb") as fp:
return fp.read()
# Sample-data Volumes path (same as prior CLI examples)
raster_path = "/Volumes/main/default/geobrix_samples/geobrix-examples/nyc/sentinel2/nyc_sentinel2_red.tif"
rasters = spark.read.format("gdal").load(raster_path)
processed = rasters.withColumn(
"hillshade",
apply_gdal_operation("tile", lit("hillshade"))
)
result = processed.select("path", f.length("hillshade").alias("hillshade_bytes"))
result.limit(2).show(truncate=50)
+--------------------------------------------------+----------------+
|path |hillshade_bytes |
+--------------------------------------------------+----------------+
|/Volumes/.../nyc/sentinel2/nyc_sentinel2_red.tif |1234567 |
|... |... |
+--------------------------------------------------+----------------+
Next Steps
- Custom UDFs - Use GDAL results in UDFs
- Library Integration - Combine with rasterio, xarray
- GDAL Documentation - Full GDAL CLI reference