Skip to main content

Sample Data for Examples

This page provides copy-paste code to download sample datasets for use with GeoBrix examples.

Quick Start

Copy any code block below into a Databricks notebook cell and run it. The data will be downloaded to Unity Catalog Volumes and ready to use.


Overview

GeoBrix provides two curated sample data bundles with geographically coherent datasets for NYC and London. All datasets within each region cover the same area, so you can run realistic workflows (clip rasters to boundaries, zonal statistics, multi-scale spatial joins).

Essential Bundle (~355MB) — Get Started Quickly

  • NYC: Taxi Zones, Boroughs, Sentinel-2 imagery, SRTM elevation
  • London: Postcodes, Sentinel-2 imagery, SRTM elevation
  • Formats: GeoJSON, GeoTIFF (including elevation DEMs)
  • Use for: Learning GeoBrix basics, RasterX operations, GridX BNG examples, basic spatial analysis
  • Jump to setup →

Complete Bundle (~795MB) — Full Format Coverage

  • Everything in Essential, plus:
    • NYC Neighborhoods, Parks, Subway, HRRR weather (NetCDF/GRIB2), extra SRTM tile (NYC East), GeoPackage
    • London Boroughs
  • Formats: All of the above + Shapefile, GeoPackage. FileGDB can be created optionally via the Vector Data script.
  • Use for: Multi-scale analysis, all GeoBrix reader formats, advanced workflows, production-like examples
  • Jump to setup →

Which to choose? Essential for learning and most docs; Complete if you need Shapefile/NetCDF/GeoPackage/FileGDB or want the full set of datasets.

Next: Setup — Configure storage location and run the Essential or Complete bundle scripts.


Geographic Coherence

All sample data is organized around two coherent geographic regions, enabling realistic workflows:

Regional Organization

Data is organized by region with parent folders:

  • nyc/ - All NYC datasets (vector, raster, weather, format examples)
  • london/ - All London datasets (vector, raster)

This structure makes it easy to find related datasets and understand geographic scope.

NYC Region (Primary)

Vector layers:

  • Taxi Zones (small areas)
  • Boroughs (large areas)
  • Neighborhoods (medium areas - Complete bundle)
  • Parks, Subway stations (Complete bundle)

Raster layers:

  • Sentinel-2 imagery (10m resolution)
  • SRTM elevation (30m resolution)
  • HRRR weather data (3km resolution, CONUS-only - Complete bundle)

Enables workflows like:

# Zonal statistics: Average elevation per borough
sample_path = "/Volumes/main/default/geobrix_samples/geobrix-examples"
boroughs = spark.read.format("geojson_ogr").load(f"{sample_path}/nyc/boroughs/nyc_boroughs.geojson")
elevation = spark.read.format("gdal").load(f"{sample_path}/nyc/elevation/srtm_n40w074.tif")

# Spatial join and compute statistics
borough_elevation = boroughs.join(elevation, "spatial_intersect") \
.groupBy("boro_name") \
.agg(rx.rst_avg("tile").alias("avg_elevation"))
borough_elevation.show()
Example output
+----------+-----------------+
|boro_name |avg_elevation |
+----------+-----------------+
|Manhattan |45.2 |
|Bronx |38.1 |
|Brooklyn |22.3 |
|Queens |28.7 |
|Staten Is.|42.0 |
+----------+-----------------+
# Clip satellite imagery to taxi zone
sample_path = "/Volumes/main/default/geobrix_samples/geobrix-examples"
zones = spark.read.format("geojson_ogr").load(f"{sample_path}/nyc/taxi-zones/nyc_taxi_zones.geojson")
sentinel = spark.read.format("gdal").load(f"{sample_path}/nyc/sentinel2/nyc_sentinel2_red.tif")

# Clip raster to specific zone
jfk_zone = zones.filter("LocationID = 132") # JFK Airport zone
jfk_imagery = sentinel.withColumn(
"clipped",
rx.rst_clip("tile", jfk_zone.geom)
)
jfk_imagery.select("path", "clipped").show(1, truncate=False)
Example output
+--------------------+------------------+
|path |clipped |
+--------------------+------------------+
|.../nyc_sentinel2...|[RasterTile(...)] |
+--------------------+------------------+
# Multi-scale analysis: Taxi zones within boroughs
sample_path = "/Volumes/main/default/geobrix_samples/geobrix-examples"
zones = spark.read.format("geojson_ogr").load(f"{sample_path}/nyc/taxi-zones/nyc_taxi_zones.geojson")
boroughs = spark.read.format("geojson_ogr").load(f"{sample_path}/nyc/boroughs/nyc_boroughs.geojson")
neighborhoods = spark.read.format("geojson_ogr").load(f"{sample_path}/nyc/neighborhoods/nyc_nta.geojson")

# Hierarchical spatial joins
zones_with_context = zones \
.join(boroughs, "spatial_within") \
.join(neighborhoods, "spatial_within")
zones_with_context.show(5)
Example output
+----+--------+----------+-----+
|... |boro_...|ntaname |... |
+----+--------+----------+-----+
|132 |Staten..|Port Rich.|... |
|... |... |... |... |
+----+--------+----------+-----+

London Region (For BNG Examples)

Vector layers:

  • Postcode zones
  • Boroughs (Complete bundle)

Raster layers:

  • Sentinel-2 imagery (10m resolution)
  • SRTM elevation (30m resolution)

Perfect for GridX BNG examples:

from databricks.labs.gbx.gridx.bng import functions as bx

sample_path = "/Volumes/main/default/geobrix_samples/geobrix-examples"
# Index London postcodes to BNG grid
postcodes = spark.read.format("geojson_ogr").load(f"{sample_path}/london/postcodes/london_postcodes.geojson")
postcodes_bng = postcodes.withColumn(
"bng_1km",
bx.bng_polyfill(f.col("geometry"), 1000)
)

# Aggregate Sentinel-2 data to BNG cells
sentinel = spark.read.format("gdal").load(f"{sample_path}/london/sentinel2/london_sentinel2_red.tif")
bng_cells = bx.bng_tessellate(sentinel, 1000)
bng_cells.show(5)
Example output
+-----------+-----------+-----+
|path |bng_1km |... |
+-----------+-----------+-----+
|.../londo..|[550000... |... |
+-----------+-----------+-----+

Next Steps


More

Sample Data subpages (same order as sidebar):

  • Setup — Configure storage and run Essential or Complete bundle scripts
  • Vector Data — Per-dataset download scripts (NYC Taxi Zones, Boroughs, London Postcodes, Neighborhoods, Parks, Subway, GeoPackage, FileGDB, London Boroughs)
  • Raster Data — Sentinel-2 (NYC & London), SRTM elevation, HRRR weather (GRIB2)
  • Additional — Synthetic data and alternative data sources (NYC Open Data, STAC, USGS, Ordnance Survey)