Sample Data for Examples
This page provides copy-paste code to download sample datasets for use with GeoBrix examples.
Copy any code block below into a Databricks notebook cell and run it. The data will be downloaded to Unity Catalog Volumes and ready to use.
Overview
GeoBrix provides two curated sample data bundles with geographically coherent datasets for NYC and London. All datasets within each region cover the same area, so you can run realistic workflows (clip rasters to boundaries, zonal statistics, multi-scale spatial joins).
Essential Bundle (~355MB) — Get Started Quickly
- NYC: Taxi Zones, Boroughs, Sentinel-2 imagery, SRTM elevation
- London: Postcodes, Sentinel-2 imagery, SRTM elevation
- Formats: GeoJSON, GeoTIFF (including elevation DEMs)
- Use for: Learning GeoBrix basics, RasterX operations, GridX BNG examples, basic spatial analysis
- Jump to setup →
Complete Bundle (~795MB) — Full Format Coverage
- Everything in Essential, plus:
- NYC Neighborhoods, Parks, Subway, HRRR weather (NetCDF/GRIB2), extra SRTM tile (NYC East), GeoPackage
- London Boroughs
- Formats: All of the above + Shapefile, GeoPackage. FileGDB can be created optionally via the Vector Data script.
- Use for: Multi-scale analysis, all GeoBrix reader formats, advanced workflows, production-like examples
- Jump to setup →
Which to choose? Essential for learning and most docs; Complete if you need Shapefile/NetCDF/GeoPackage/FileGDB or want the full set of datasets.
Next: Setup — Configure storage location and run the Essential or Complete bundle scripts.
Geographic Coherence
All sample data is organized around two coherent geographic regions, enabling realistic workflows:
Data is organized by region with parent folders:
nyc/- All NYC datasets (vector, raster, weather, format examples)london/- All London datasets (vector, raster)
This structure makes it easy to find related datasets and understand geographic scope.
NYC Region (Primary)
Vector layers:
- Taxi Zones (small areas)
- Boroughs (large areas)
- Neighborhoods (medium areas - Complete bundle)
- Parks, Subway stations (Complete bundle)
Raster layers:
- Sentinel-2 imagery (10m resolution)
- SRTM elevation (30m resolution)
- HRRR weather data (3km resolution, CONUS-only - Complete bundle)
Enables workflows like:
# Zonal statistics: Average elevation per borough
sample_path = "/Volumes/main/default/geobrix_samples/geobrix-examples"
boroughs = spark.read.format("geojson_ogr").load(f"{sample_path}/nyc/boroughs/nyc_boroughs.geojson")
elevation = spark.read.format("gdal").load(f"{sample_path}/nyc/elevation/srtm_n40w074.tif")
# Spatial join and compute statistics
borough_elevation = boroughs.join(elevation, "spatial_intersect") \
.groupBy("boro_name") \
.agg(rx.rst_avg("tile").alias("avg_elevation"))
borough_elevation.show()
+----------+-----------------+
|boro_name |avg_elevation |
+----------+-----------------+
|Manhattan |45.2 |
|Bronx |38.1 |
|Brooklyn |22.3 |
|Queens |28.7 |
|Staten Is.|42.0 |
+----------+-----------------+
# Clip satellite imagery to taxi zone
sample_path = "/Volumes/main/default/geobrix_samples/geobrix-examples"
zones = spark.read.format("geojson_ogr").load(f"{sample_path}/nyc/taxi-zones/nyc_taxi_zones.geojson")
sentinel = spark.read.format("gdal").load(f"{sample_path}/nyc/sentinel2/nyc_sentinel2_red.tif")
# Clip raster to specific zone
jfk_zone = zones.filter("LocationID = 132") # JFK Airport zone
jfk_imagery = sentinel.withColumn(
"clipped",
rx.rst_clip("tile", jfk_zone.geom)
)
jfk_imagery.select("path", "clipped").show(1, truncate=False)
+--------------------+------------------+
|path |clipped |
+--------------------+------------------+
|.../nyc_sentinel2...|[RasterTile(...)] |
+--------------------+------------------+
# Multi-scale analysis: Taxi zones within boroughs
sample_path = "/Volumes/main/default/geobrix_samples/geobrix-examples"
zones = spark.read.format("geojson_ogr").load(f"{sample_path}/nyc/taxi-zones/nyc_taxi_zones.geojson")
boroughs = spark.read.format("geojson_ogr").load(f"{sample_path}/nyc/boroughs/nyc_boroughs.geojson")
neighborhoods = spark.read.format("geojson_ogr").load(f"{sample_path}/nyc/neighborhoods/nyc_nta.geojson")
# Hierarchical spatial joins
zones_with_context = zones \
.join(boroughs, "spatial_within") \
.join(neighborhoods, "spatial_within")
zones_with_context.show(5)
+----+--------+----------+-----+
|... |boro_...|ntaname |... |
+----+--------+----------+-----+
|132 |Staten..|Port Rich.|... |
|... |... |... |... |
+----+--------+----------+-----+
London Region (For BNG Examples)
Vector layers:
- Postcode zones
- Boroughs (Complete bundle)
Raster layers:
- Sentinel-2 imagery (10m resolution)
- SRTM elevation (30m resolution)
Perfect for GridX BNG examples:
from databricks.labs.gbx.gridx.bng import functions as bx
sample_path = "/Volumes/main/default/geobrix_samples/geobrix-examples"
# Index London postcodes to BNG grid
postcodes = spark.read.format("geojson_ogr").load(f"{sample_path}/london/postcodes/london_postcodes.geojson")
postcodes_bng = postcodes.withColumn(
"bng_1km",
bx.bng_polyfill(f.col("geometry"), 1000)
)
# Aggregate Sentinel-2 data to BNG cells
sentinel = spark.read.format("gdal").load(f"{sample_path}/london/sentinel2/london_sentinel2_red.tif")
bng_cells = bx.bng_tessellate(sentinel, 1000)
bng_cells.show(5)
+-----------+-----------+-----+
|path |bng_1km |... |
+-----------+-----------+-----+
|.../londo..|[550000... |... |
+-----------+-----------+-----+
Next Steps
- Quick Start — Use this sample data in your first GeoBrix workflow
- API Reference — Explore all available functions
More
Sample Data subpages (same order as sidebar):
- Setup — Configure storage and run Essential or Complete bundle scripts
- Vector Data — Per-dataset download scripts (NYC Taxi Zones, Boroughs, London Postcodes, Neighborhoods, Parks, Subway, GeoPackage, FileGDB, London Boroughs)
- Raster Data — Sentinel-2 (NYC & London), SRTM elevation, HRRR weather (GRIB2)
- Additional — Synthetic data and alternative data sources (NYC Open Data, STAC, USGS, Ordnance Survey)