Skip to main content

Setup

Overview

Use the notebook below on a Databricks cluster to download the Essential and Complete sample-data bundles into a Unity Catalog Volume. The notebook uses the GeoBrix sample module (packaged with the GeoBrix WHL), so you only need to install GeoBrix and run the cells.

Prerequisites
  • Unity Catalog Volume must already exist: {CATALOG}.{SCHEMA}.{VOLUME}
  • Create volume in Databricks UI or via SQL: CREATE VOLUME IF NOT EXISTS {CATALOG}.{SCHEMA}.{VOLUME}
  • Volumes paths (/Volumes/...) work directly with GDAL, rasterio, and other libraries
  • No /dbfs translation needed in modern Databricks
Databricks vs local use

This configuration is intended for use with Databricks Unity Catalog volumes. The path format /Volumes/<catalog>/<schema>/<volume>/... can also be mimicked for local development—for example, use a local path such as ./Volumes/main/default/geobrix_samples/geobrix-examples and set the base path to that so the same code and examples work outside Databricks.


Notebook

Run the cells in order. The notebook uses databricks.labs.gbx.sample (included in the GeoBrix package): get_volumes_path, run_essential_bundle, and run_complete_bundle. Skips datasets that already exist; uses a temp directory for interim downloads; copies only final files to the Volume.

Download notebook

setup_sample_data.ipynb — download and import into your Databricks workspace to run the setup.

GeoBrix Sample Data Setup

Run this notebook on a Databricks cluster to download the Essential and Complete sample-data bundles into a Unity Catalog Volume. Uses the GeoBrix `sample` module (included in the GeoBrix WHL)—install GeoBrix and run the cells in order.

  • - Skips datasets that already exist at the Volumes path
  • Uses a local temp directory for interim downloads (zip, convert); only final files are copied to the Volume

1. Install dependencies (once per session)

Install dependencies
%pip install -q requests pystac-client planetary-computer geopandas

2. Config: catalog, schema, volume

from databricks.labs.gbx.sample import get_volumes_path, run_essential_bundle, run_complete_bundle
from pathlib import Path

CATALOG = "main"
SCHEMA = "default"
VOLUME = "geobrix_samples"
SAMPLE_DATA_PATH = get_volumes_path(CATALOG, SCHEMA, VOLUME)
print(f"SAMPLE_DATA_PATH = {SAMPLE_DATA_PATH}")
## -- uncomment to remove everything under 'geobrix-examples' subfolder
#dbutils.fs.rm(SAMPLE_DATA_PATH, recurse=True)

3. Run Essential Bundle (~355 MB)

NYC & London vectors (GeoJSON), Sentinel-2 rasters, SRTM elevation. Skips files that already exist.

try:
Path(SAMPLE_DATA_PATH).mkdir(parents=True, exist_ok=True)
except Exception:
print("... VOLUME path must already exist.")
pass
result_essential = run_essential_bundle(SAMPLE_DATA_PATH)
print(f"Files: {result_essential['file_count']}, Total: {result_essential['total_size_mb']:.1f} MB")
if result_essential["errors"]:
for name, err in result_essential["errors"]:
print(f" ⚠️ {name}: {err[:80]}...")

4. Run Complete Bundle (~185 MB more)

Neighborhoods, extra elevation, HRRR weather, shapefiles (parks, subway), multi-layer GeoPackage. Run after Essential.

result_complete = run_complete_bundle(SAMPLE_DATA_PATH)
print(f"Files: {result_complete['file_count']}, Total: {result_complete['total_size_mb']:.1f} MB")
if result_complete["errors"]:
for name, err in result_complete["errors"]:
print(f" ⚠️ {name}: {err[:80]}...")

5. Summary

print(f"Sample data base path: {SAMPLE_DATA_PATH}")
for item in sorted(Path(SAMPLE_DATA_PATH).rglob("*")):
if item.is_file():
rel = item.relative_to(SAMPLE_DATA_PATH)
print(f" {rel}: {item.stat().st_size / (1024*1024):.1f} MB")

Output

After running the notebook, the Volume (or your configured base path) contains the following files under geobrix-examples/. The table is ordered by bundle (Essential, then Complete) and within each by region (NYC, then London).

BundleSourcePathSizeFormat & use
EssentialNYC Boroughsnyc/boroughs/nyc_boroughs.geojson3.0 MBGeoJSON — vector boundaries; joins, overlay, mapping. Driver: GeoJSON
EssentialNYC Taxi Zonesnyc/taxi-zones/nyc_taxi_zones.geojson3.7 MBGeoJSON — vector zones; spatial joins, aggregation. Driver: GeoJSON
EssentialSRTM NYCnyc/elevation/srtm_n40w074.tif24.7 MBGeoTIFF — elevation DEM; slope, aspect, hillshade. Driver: GTiff
EssentialNYC Sentinel-2nyc/sentinel2/nyc_sentinel2_red.tif205.4 MBGeoTIFF — optical raster (red band); band math, indices. Driver: GTiff
EssentialLondon Postcodeslondon/postcodes/london_postcodes.geojson0.9 MBGeoJSON — vector postcode areas; point-in-polygon, mapping. Driver: GeoJSON
EssentialSRTM Londonlondon/elevation/srtm_n51w001.tif24.7 MBGeoTIFF — elevation DEM; terrain analysis. Driver: GTiff
EssentialLondon Sentinel-2london/sentinel2/london_sentinel2_red.tif92.7 MBGeoTIFF — optical raster; NDVI, classification. Driver: GTiff
CompleteNYC Neighborhoodsnyc/neighborhoods/nyc_nta.geojson4.1 MBGeoJSON — vector neighborhoods (NTA); aggregation, mapping. Driver: GeoJSON
CompleteSRTM NYC Eastnyc/elevation/srtm_n40w073.tif24.7 MBGeoTIFF — elevation DEM; extended terrain. Driver: GTiff
CompleteHRRR Weathernyc/hrrr-weather/hrrr_nyc_20260213_12z.grib2131.2 MBGRIB2 — weather model; temperature, wind, precipitation. Driver: GRIB
CompleteNYC Parksnyc/parks/nyc_parks.shp.zip2.1 MBShapefile (zip) — vector polygons; overlay, area stats. Driver: ESRI Shapefile (use /vsizip/ path)
CompleteNYC Subwaynyc/subway/nyc_subway.shp.zip0.1 MBShapefile (zip) — vector points (stations); proximity, routing. Driver: ESRI Shapefile (use /vsizip/ path)
CompleteMulti-Layer GeoPackagenyc/geopackage/nyc_complete.gpkg7.0 MBGeoPackage — multi-layer vector (boroughs, zones, parks, subway); single-file DB. Driver: GPKG
CompleteLondon Boroughslondon/boroughs/london_boroughs.geojson1.9 MBGeoJSON — vector boundaries; joins, overlay. Driver: GeoJSON

Exact sizes may vary slightly; the HRRR filename date reflects the run date at download time.