Clipping - xView — Per-Object Raster Clipping with GeoBrix
An end-to-end example showing how to load high-resolution aerial GeoTIFFs from the xView Detection Challenge dataset into Lakehouse tables and clip rasters to labeled objects in the accompanying GeoJSON, using GeoBrix RasterX together with Databricks built-in Spatial SQL functions.
The single notebook moves from raw xView TGZ archives → binaryFile-loaded raster table → GeoJSON-derived object table (EWKT with SRID) → per-object clipped tiles written back to a Unity Catalog Volume as individual TIFs.
notebooks/examples/xview — download Clipping - xView.ipynb and import it into your Databricks workspace to run.
You need a free account at challenge.xviewdataset.org to obtain session-based download links for xView_train.tgz (training imagery) and the labels archive. Paste those URLs into the train_url / labels_url cells before running the download step.
Files
| File | Purpose |
|---|---|
Clipping - xView.ipynb | The full pipeline. Sets up the catalog / schema / Volume, downloads + extracts the xView training TGZ and label GeoJSON to /Volumes/<cat>/<schema>/data/, loads rasters via binaryFile → rst_fromfile, builds an object table from xView_train.geojson via st_geomfromgeojson + st_asewkt, joins objects to their source tiles, applies rst_clip(tile, wkt_clip, True) with EWKT-encoded polygons, and writes the clipped rasters back to the Volume as <index_right>_<type_id>_<feature_id>.tif. |
Prerequisites
- Databricks Runtime 17.3 LTS (Scala 2.13 / Spark 4 / Python 3.12).
- GeoBrix installed on the cluster (JAR + Python wheel). The notebook imports
databricks.labs.gbx.rasterxdirectly; use version >= 0.3.0. - Unity Catalog: set
catalog_name/schema_nameat the top of the notebook. A Volume nameddatamust already exist under<catalog>/<schema>; the notebook will create the schema but not the Volume. rasterio==1.4.3(installed via%pipat the top of the notebook) — used for in-notebook previewing of both full tiles and clipped outputs.- Compute sizing: an x86 cluster is required for the GDAL JNI natives. Memory/disk-optimized variants (e.g.
r6id.*,m5d.*) are recommended for the raster-processing step; xView training tiles are ~3000×3000 RGB GeoTIFFs.
Pipeline
xView train TGZ + xView_train.geojson (session-signed downloads)
│
▼ download_extract → /Volumes/<cat>/<schema>/data/{train_images, train_labels}
binaryFile reader over train_images/*.tif
│
▼ rst_fromfile("path", "GTiff") + bbox + srid → xview_raster
GeoJSON objects (features)
│
▼ st_geomfromgeojson → st_asewkt (SRID=4326) → xview_object
Join objects to rasters on path
│
▼ rst_clip(tile, wkt_clip, True) → xview_object_clip
foreachPartition → open/write bytes
│
▼ /Volumes/<cat>/<schema>/data/clip/<index_right>_<type_id>_<feature_id>.tif
Key GeoBrix / Databricks functions shown
- GeoBrix RasterX (
rx.rst_*):rst_fromfile(GTiff),rst_boundingbox,rst_srid,rst_summary,rst_clip(with EWKT input). - Databricks built-in ST (
DBF.*):st_geomfromgeojson,st_asewkt— used to emit each feature's polygon asSRID=4326;POLYGON(...)so that the SRID travels with the WKT intorst_clip. - Readers:
binaryFile(primary raster loader —rst_fromfileinlines the file bytes into the tile),json(multiline) for the xView labels GeoJSON. - Writers: plain
open(..., "wb")viaforeachPartitionto materialize clipped TIFs as individual files on the Volume.
Gotchas
- xView is session-signed: download links from
challenge.xviewdataset.orgare time-limited. Ifdownload_extracthangs or returns tiny files, regenerate your links and re-run — the helper cleans/tmpon each run, so it is safe to retry. - Pass
[E]WKTstrings or[E]WKBbytes — not native geometry columns:rst_clipexpects aString/Binarycolumn. Do not passst_geomfromtext(...)/st_geomfromwkb(...)/ DBR geometry or geography types directly. Serialize withst_asewkt(preferred — carries SRID) orst_aswkb/st_asewkbfirst. - EWKT vs WKT (SRID handling): in 0.3.0, if you pass plain WKT/WKB (no SRID),
rst_clipassumes the geometry is already in the raster's CRS and does not reproject. If you pass EWKT/EWKB (SRID set) and the SRID differs from the raster CRS, the cutline is reprojected before clipping. xView imagery is EPSG:4326, andst_asewkt(st_geomfromgeojson(...))emitsSRID=4326;..., so the two match for this dataset — but EWKT is the robust choice when sources differ. - Filter before clipping for demo runs:
xview_objectcontains hundreds of thousands of features across 60+ classes. The notebook filters totype_name = 'Yacht'(type_id=50) before the join + clip so the demo finishes quickly. Remove the filter for a full materialization. - Class label dictionary is inline:
xv_type_dictis copied from the xView baseline repo; update it if xView publishes new classes.