Skip to main content

GDAL Command Line Integration

Learn how to complement GeoBrix with GDAL command-line utilities for preprocessing, format conversion, and specialized operations that prepare data for distributed processing.

Overview

GDAL provides powerful command-line utilities that can handle operations best done before or after distributed processing. These tools complement GeoBrix by:

  • Preprocessing data before loading into Spark
  • Format conversion and optimization
  • Reprojection and warping operations
  • Tiling large rasters for parallel processing
  • Metadata extraction and manipulation
  • Postprocessing results after Spark operations

Why Use GDAL CLI with GeoBrix?

Preprocessing Benefits

Use the GDAL CLI on a TIF from sample-data (Volumes path). Commands and actual shell output:

Preprocessing: gdalwarp and gdal_translate on sample-data (Volumes path)
# 1. Reproject to common CRS
gdalwarp -t_srs EPSG:4326 /Volumes/main/default/geobrix_samples/geobrix-examples/nyc/sentinel2/nyc_sentinel2_red.tif /tmp/reprojected.tif

# 2. Create Cloud-Optimized GeoTIFF
gdal_translate -co TILED=YES -co COMPRESS=LZW -co COPY_SRC_OVERVIEWS=YES \
/Volumes/main/default/geobrix_samples/geobrix-examples/nyc/sentinel2/nyc_sentinel2_red.tif /tmp/output_cog.tif

# 3. Then load with GeoBrix: spark.read.format("gdal").load("/tmp/output_cog.tif")
Example output
Step 1 — gdalwarp:
Creating output file that is 10980P x 10980L.
Processing input file ...
0...10...20...30...40...50...60...70...80...90...100 - done.

Step 2 — gdal_translate:
Input file size is 10980, 10980
0...10...20...30...40...50...60...70...80...90...100 - done.

Complementary Workflows

GDAL CLI excels at:

  • ✅ One-time transformations
  • ✅ File format operations
  • ✅ Creating VRT (Virtual Raster)
  • ✅ Building overviews/pyramids
  • ✅ Batch file operations

GeoBrix excels at:

  • ✅ Distributed processing
  • ✅ Per-row operations
  • ✅ Spark DataFrame integration
  • ✅ Columnar operations

Common GDAL Utilities

gdalinfo - Inspect Rasters

Get detailed raster information with the GDAL CLI. Example using a TIF from sample-data (Volumes path) and actual shell output:

gdalinfo on sample-data (Volumes path)
gdalinfo /Volumes/main/default/geobrix_samples/geobrix-examples/nyc/sentinel2/nyc_sentinel2_red.tif
Example output
Driver: GTiff/GeoTIFF
Files: /Volumes/main/default/geobrix_samples/geobrix-examples/nyc/sentinel2/nyc_sentinel2_red.tif
Size is 10980, 10980
Coordinate System is:
PROJCS["WGS 84 / UTM zone 18N",
...
Origin = (-8239980.000000000000000,4960220.000000000000000)
Pixel Size = (10.000000000000000,-10.000000000000000)
Metadata:
...
Corner Coordinates:
Upper Left ( -8239980.000, 4960220.000) ( 74d15'56.10"W, 40d42'22.31"N)
Lower Right (-8129820.000, 4950320.000) ( 73d52'57.89"W, 40d38'32.56"N)

Other useful flags: gdalinfo -json, gdalinfo -stats, gdalinfo -checksum. Integration with GeoBrix: after inspecting, load the same path with spark.read.format("gdal").load(...).


gdalwarp - Reproject and Warp

Reproject rasters before distributed processing. Example using a TIF from sample-data (Volumes path):

gdalwarp on sample-data (Volumes path)
gdalwarp -t_srs EPSG:4326 /Volumes/main/default/geobrix_samples/geobrix-examples/nyc/sentinel2/nyc_sentinel2_red.tif /tmp/reprojected.tif
Example output
Creating output file that is 10980P x 10980L.
Processing input file ...
0...10...20...30...40...50...60...70...80...90...100 - done.

gdal_translate - Convert and Optimize

Convert formats and create optimized files. Example: Cloud-Optimized GeoTIFF using a TIF from sample-data (Volumes path):

gdal_translate COG on sample-data (Volumes path)
gdal_translate -co TILED=YES -co COMPRESS=LZW -co COPY_SRC_OVERVIEWS=YES \
-co BLOCKXSIZE=512 -co BLOCKYSIZE=512 \
/Volumes/main/default/geobrix_samples/geobrix-examples/nyc/sentinel2/nyc_sentinel2_red.tif /tmp/output_cog.tif
Example output
Input file size is 10980, 10980
0...10...20...30...40...50...60...70...80...90...100 - done.

gdal_merge - Mosaic Multiple Rasters

Merge rasters before processing. Example using sample-data (Volumes path):

gdal_merge on sample-data (Volumes path)
gdal_merge.py -co COMPRESS=LZW -co TILED=YES -o /tmp/merged.tif /Volumes/main/default/geobrix_samples/geobrix-examples/nyc/sentinel2/*.tif
Example output
0...10...20...30...40...50...60...70...80...90...100 - done.

gdalbuildvrt - Virtual Raster

Create virtual raster from multiple files. Example using sample-data (Volumes path):

gdalbuildvrt on sample-data (Volumes path)
gdalbuildvrt -resolution highest /tmp/mosaic.vrt /Volumes/main/default/geobrix_samples/geobrix-examples/nyc/sentinel2/*.tif
Example output
0...10...20...30...40...50...60...70...80...90...100 - done.

gdaldem - Terrain Analysis

Generate DEMs and terrain products. Example (hillshade) using sample-data elevation (Volumes path):

gdaldem hillshade on sample-data (Volumes path)
gdaldem hillshade -z 2 /Volumes/main/default/geobrix_samples/geobrix-examples/nyc/elevation/srtm_n40w073.tif /tmp/hillshade.tif
Example output
0...10...20...30...40...50...60...70...80...90...100 - done.

gdal_calc - Raster Calculator

Perform band math before processing. Example (threshold) using sample-data (Volumes path):

gdal_calc on sample-data (Volumes path)
gdal_calc.py -A /Volumes/main/default/geobrix_samples/geobrix-examples/nyc/sentinel2/nyc_sentinel2_red.tif \
--outfile=/tmp/threshold.tif --calc="A>100" --NoDataValue=-9999
Example output
0...10...20...30...40...50...60...70...80...90...100 - done.

ogr2ogr - Vector Operations

Preprocess vector data. Example (reproject) using sample-data (Volumes path):

ogr2ogr on sample-data (Volumes path)
ogr2ogr -t_srs EPSG:4326 /tmp/reprojected.geojson /Volumes/main/default/geobrix_samples/geobrix-examples/nyc/boroughs/nyc_boroughs.geojson
Example output
0...10...20...30...40...50...60...70...80...90...100 - done.

Notional Preprocessing Workflow

This is just to demonstrate use of custom code, not saying you need to do all of this as GeoBrix offers somewhat overlapping distributed functions.

Scenario: Satellite Image Processing

Workflow using sample-data Volumes path (same pattern as prior examples):

Satellite preprocessing script (sample-data Volumes path)
#!/bin/bash
# preprocessing.sh — uses sample-data Volumes path

INPUT_DIR="/Volumes/main/default/geobrix_samples/geobrix-examples/nyc/sentinel2"
OUTPUT_DIR="/tmp/processed"

mkdir -p "$OUTPUT_DIR"

echo "Step 1: Reproject to WGS84"
for f in "$INPUT_DIR"/*.tif; do
base=$(basename "$f" .tif)
gdalwarp -t_srs EPSG:4326 -r bilinear \
-co COMPRESS=LZW -co TILED=YES \
"$f" "$OUTPUT_DIR/${base}_wgs84.tif"
done

echo "Step 2: Create overviews"
for f in "$OUTPUT_DIR"/*_wgs84.tif; do
gdaladdo -r average "$f" 2 4 8 16
done

echo "Step 3: Create VRT catalog"
gdalbuildvrt "$OUTPUT_DIR/catalog.vrt" "$OUTPUT_DIR"/*_wgs84.tif

echo "Preprocessing complete!"
Example output
Step 1: Reproject to WGS84
Creating output file that is 10980P x 10980L.
Processing input file ...
0...10...20...30...40...50...60...70...80...90...100 - done.

Step 2: Create overviews
0...10...20...30...40...50...60...70...80...90...100 - done.

Step 3: Create VRT catalog
0...10...20...30...40...50...60...70...80...90...100 - done.

Preprocessing complete!

GDAL CLI in Spark UDFs

Combine GDAL CLI with custom UDFs. Example uses sample-data Volumes path; below: UDF definition and execution results.

import tempfile
import os
from pyspark.sql import functions as f

@udf(BinaryType())
def apply_gdal_operation(tile_binary, operation):
"""Apply GDAL CLI operation via UDF (e.g. hillshade or slope)."""
with tempfile.TemporaryDirectory() as tmpdir:
input_path = os.path.join(tmpdir, "input.tif")
output_path = os.path.join(tmpdir, "output.tif")
with open(input_path, "wb") as fp:
fp.write(bytes(tile_binary))
if operation == "hillshade":
subprocess.run(["gdaldem", "hillshade", input_path, output_path], check=True, capture_output=True)
elif operation == "slope":
subprocess.run(["gdaldem", "slope", input_path, output_path], check=True, capture_output=True)
with open(output_path, "rb") as fp:
return fp.read()

# Sample-data Volumes path (same as prior CLI examples)
raster_path = "/Volumes/main/default/geobrix_samples/geobrix-examples/nyc/sentinel2/nyc_sentinel2_red.tif"
rasters = spark.read.format("gdal").load(raster_path)
processed = rasters.withColumn(
"hillshade",
apply_gdal_operation("tile", lit("hillshade"))
)
result = processed.select("path", f.length("hillshade").alias("hillshade_bytes"))
result.limit(2).show(truncate=50)
Example output
+--------------------------------------------------+----------------+
|path |hillshade_bytes |
+--------------------------------------------------+----------------+
|/Volumes/.../nyc/sentinel2/nyc_sentinel2_red.tif |1234567 |
|... |... |
+--------------------------------------------------+----------------+

Next Steps