Setup
Overview
Use the notebook below on a Databricks cluster to download the Essential and Complete sample-data bundles into a Unity Catalog Volume. The notebook uses the GeoBrix sample module (packaged with the GeoBrix WHL), so you only need to install GeoBrix and run the cells.
- Unity Catalog Volume must already exist:
{CATALOG}.{SCHEMA}.{VOLUME} - Create volume in Databricks UI or via SQL:
CREATE VOLUME IF NOT EXISTS {CATALOG}.{SCHEMA}.{VOLUME} - Volumes paths (
/Volumes/...) work directly with GDAL, rasterio, and other libraries - No
/dbfstranslation needed in modern Databricks
This configuration is intended for use with Databricks Unity Catalog volumes. The path format /Volumes/<catalog>/<schema>/<volume>/... can also be mimicked for local development—for example, use a local path such as ./Volumes/main/default/geobrix_samples/geobrix-examples and set the base path to that so the same code and examples work outside Databricks.
Notebook
Run the cells in order. The notebook uses databricks.labs.gbx.sample (included in the GeoBrix package): get_volumes_path, run_essential_bundle, and run_complete_bundle. Skips datasets that already exist; uses a temp directory for interim downloads; copies only final files to the Volume.
setup_sample_data.ipynb — download and import into your Databricks workspace to run the setup.
GeoBrix Sample Data Setup
Run this notebook on a Databricks cluster to download the Essential and Complete sample-data bundles into a Unity Catalog Volume. Uses the GeoBrix `sample` module (included in the GeoBrix WHL)—install GeoBrix and run the cells in order.
- - Skips datasets that already exist at the Volumes path
- Uses a local temp directory for interim downloads (zip, convert); only final files are copied to the Volume
1. Install dependencies (once per session)
%pip install -q requests pystac-client planetary-computer geopandas
2. Config: catalog, schema, volume
from databricks.labs.gbx.sample import get_volumes_path, run_essential_bundle, run_complete_bundle
from pathlib import Path
CATALOG = "main"
SCHEMA = "default"
VOLUME = "geobrix_samples"
SAMPLE_DATA_PATH = get_volumes_path(CATALOG, SCHEMA, VOLUME)
print(f"SAMPLE_DATA_PATH = {SAMPLE_DATA_PATH}")
## -- uncomment to remove everything under 'geobrix-examples' subfolder
#dbutils.fs.rm(SAMPLE_DATA_PATH, recurse=True)
3. Run Essential Bundle (~355 MB)
NYC & London vectors (GeoJSON), Sentinel-2 rasters, SRTM elevation. Skips files that already exist.
try:
Path(SAMPLE_DATA_PATH).mkdir(parents=True, exist_ok=True)
except Exception:
print("... VOLUME path must already exist.")
pass
result_essential = run_essential_bundle(SAMPLE_DATA_PATH)
print(f"Files: {result_essential['file_count']}, Total: {result_essential['total_size_mb']:.1f} MB")
if result_essential["errors"]:
for name, err in result_essential["errors"]:
print(f" ⚠️ {name}: {err[:80]}...")
4. Run Complete Bundle (~185 MB more)
Neighborhoods, extra elevation, HRRR weather, shapefiles (parks, subway), multi-layer GeoPackage. Run after Essential.
result_complete = run_complete_bundle(SAMPLE_DATA_PATH)
print(f"Files: {result_complete['file_count']}, Total: {result_complete['total_size_mb']:.1f} MB")
if result_complete["errors"]:
for name, err in result_complete["errors"]:
print(f" ⚠️ {name}: {err[:80]}...")
5. Summary
print(f"Sample data base path: {SAMPLE_DATA_PATH}")
for item in sorted(Path(SAMPLE_DATA_PATH).rglob("*")):
if item.is_file():
rel = item.relative_to(SAMPLE_DATA_PATH)
print(f" {rel}: {item.stat().st_size / (1024*1024):.1f} MB")
Output
After running the notebook, the Volume (or your configured base path) contains the following files under geobrix-examples/. The table is ordered by bundle (Essential, then Complete) and within each by region (NYC, then London).
| Bundle | Source | Path | Size | Format & use |
|---|---|---|---|---|
| Essential | NYC Boroughs | nyc/boroughs/nyc_boroughs.geojson | 3.0 MB | GeoJSON — vector boundaries; joins, overlay, mapping. Driver: GeoJSON |
| Essential | NYC Taxi Zones | nyc/taxi-zones/nyc_taxi_zones.geojson | 3.7 MB | GeoJSON — vector zones; spatial joins, aggregation. Driver: GeoJSON |
| Essential | SRTM NYC | nyc/elevation/srtm_n40w074.tif | 24.7 MB | GeoTIFF — elevation DEM; slope, aspect, hillshade. Driver: GTiff |
| Essential | NYC Sentinel-2 | nyc/sentinel2/nyc_sentinel2_red.tif | 205.4 MB | GeoTIFF — optical raster (red band); band math, indices. Driver: GTiff |
| Essential | London Postcodes | london/postcodes/london_postcodes.geojson | 0.9 MB | GeoJSON — vector postcode areas; point-in-polygon, mapping. Driver: GeoJSON |
| Essential | SRTM London | london/elevation/srtm_n51w001.tif | 24.7 MB | GeoTIFF — elevation DEM; terrain analysis. Driver: GTiff |
| Essential | London Sentinel-2 | london/sentinel2/london_sentinel2_red.tif | 92.7 MB | GeoTIFF — optical raster; NDVI, classification. Driver: GTiff |
| Complete | NYC Neighborhoods | nyc/neighborhoods/nyc_nta.geojson | 4.1 MB | GeoJSON — vector neighborhoods (NTA); aggregation, mapping. Driver: GeoJSON |
| Complete | SRTM NYC East | nyc/elevation/srtm_n40w073.tif | 24.7 MB | GeoTIFF — elevation DEM; extended terrain. Driver: GTiff |
| Complete | HRRR Weather | nyc/hrrr-weather/hrrr_nyc_20260213_12z.grib2 | 131.2 MB | GRIB2 — weather model; temperature, wind, precipitation. Driver: GRIB |
| Complete | NYC Parks | nyc/parks/nyc_parks.shp.zip | 2.1 MB | Shapefile (zip) — vector polygons; overlay, area stats. Driver: ESRI Shapefile (use /vsizip/ path) |
| Complete | NYC Subway | nyc/subway/nyc_subway.shp.zip | 0.1 MB | Shapefile (zip) — vector points (stations); proximity, routing. Driver: ESRI Shapefile (use /vsizip/ path) |
| Complete | Multi-Layer GeoPackage | nyc/geopackage/nyc_complete.gpkg | 7.0 MB | GeoPackage — multi-layer vector (boroughs, zones, parks, subway); single-file DB. Driver: GPKG |
| Complete | London Boroughs | london/boroughs/london_boroughs.geojson | 1.9 MB | GeoJSON — vector boundaries; joins, overlay. Driver: GeoJSON |
Exact sizes may vary slightly; the HRRR filename date reflects the run date at download time.