Skip to main content

STAC Client

databricks.labs.gbx.stac.StacClient is a lightweight, Serverless-safe client for distributed STAC search, resilient asset download, and repair of invalid files — against any STAC catalog (default: Planetary Computer).

Where a single-node STAC script serializes search requests and downloads, StacClient fans both operations out across the Spark cluster — one task per AOI row (search) and one task per asset (download). On Serverless, parallelism is controlled via partitions= and DataFrame.repartition(), with no spark.conf.set calls.

Opt-in extra

StacClient requires geobrix[light,stac]. The [stac] extra pulls in pystac-client, planetary-computer, tenacity, and requests. Serverless environment version 5 (Python 3.12) is required.

Ready-made downloaders

For common sources you usually don't call StacClient directly — the gbx.sample package ships AOI-driven downloaders that wrap it (or the same distributed STAC pattern) with source-specific defaults, each following a discover → download → read flow:

DownloaderSourceSTAC backing
Overture Maps DownloaderOvertureClientOverture Maps (buildings, places, …)Overture's static STAC catalog via the overturemaps CLI
NAIP Aerial Imagery DownloaderNaipDownloaderNAIP 1 m aerial imagery (US)StacClient on Planetary Computer
3DEP Downloader (DEM)DemDownloaderUSGS 3DEP elevation (US)StacClient on Planetary Computer

Reach for StacClient directly when you need a catalog these don't cover, or full control over the search → download → repair flow described below.

Installation

pip install "geobrix[light,stac]"

From a Databricks notebook (Serverless or classic):

%pip install --quiet "geobrix[light,stac] @ file:///Volumes/<catalog>/<schema>/<volume>/geobrix-0.4.0-py3-none-any.whl"

Import

from databricks.labs.gbx.stac import StacClient

Constructor

StacClient(
catalog="https://planetarycomputer.microsoft.com/api/stac/v1", # default
sign="planetary_computer", # 'planetary_computer' | None | callable(href)->href
)
ParameterDefaultDescription
catalogPlanetary ComputerURL of any STAC API endpoint.
sign"planetary_computer"Signing strategy. "planetary_computer" uses planetary_computer.sign_inplace; None skips signing (public catalogs); or pass any callable(href: str) -> str.

Methods

Fan out AOI rows to the STAC catalog and return one row per (input-row, item, asset).

assets_df = client.search(
df, # DataFrame with a GeoJSON-geometry column
geojson_col="geojson", # column name holding the GeoJSON geometry string
collections=["sentinel-2-l2a"], # list of STAC collection IDs
datetime="2022-06-01/2022-09-01",# ISO datetime or range (STAC datetime syntax)
partitions=512, # repartition fan-out; no spark.conf
)

Parameters:

ParameterTypeDefaultDescription
dfDataFrameInput rows. Each row is one AOI.
geojson_colstrColumn containing GeoJSON geometry strings (intersects filter).
collectionsList[str]STAC collection IDs to search.
datetimestrISO datetime or range ("YYYY-MM-DD" or "start/end").
partitionsint512Target partition count for the fan-out repartition.

Output columns (in addition to all input columns, carried through):

ColumnTypeDescription
item_idstringSTAC item identifier.
datestringAcquisition date from properties.datetime.
item_bboxstringItem bounding box (GeoJSON).
asset_namestringAsset key (e.g. "B02", "B03").
hrefstringAsset download URL at search time (may expire; re-signed per attempt in download).
item_propertiesstringFull item properties JSON.

One row is emitted per (input-row, item, asset). The same STAC item reached via multiple AOI rows produces multiple rows — download deduplicates internally to unique (item_id, asset_name).


download

Resilient, validated asset download — one Spark task per asset. Deduplicates to unique (item_id, asset_name) so the same item reached via multiple AOIs is fetched exactly once.

files_df = client.download(
assets_df,
out_dir, # UC Volume path (FUSE-mounted)
asset_names=["B02", "B03", "B04", "B08"], # None = all assets present in df
name="{asset_name}_{item_id}.tif", # filename template
validate=True, # rasterio read-validation per file
max_tries=5,
partitions=None, # default: one task per asset
)

Parameters:

ParameterTypeDefaultDescription
dfDataFrameMust contain item_id and asset_name columns. (href from search output is accepted but not required — the href is re-signed per attempt from item_id + asset_name.)
out_dirstrDestination directory (e.g. a UC Volume FUSE path).
asset_namesList[str] or NoneNoneFilter to these asset keys. None downloads all assets present in the DataFrame.
namestr"{asset_name}_{item_id}.tif"Filename template. Supports {asset_name} and {item_id} placeholders.
validateboolTrueOpen and decode a raster window after download to reject throttled error bodies and truncated files that a size check would pass.
max_triesint5Maximum download attempts per asset (exponential backoff between attempts).
partitionsint or NoneNoneExplicit repartition before download. None sets one partition per unique asset.

Output columns:

ColumnTypeDescription
item_idstringSTAC item identifier.
asset_namestringAsset key.
out_file_pathstringAbsolute path of the written file on the Volume.
out_file_szlongFile size in bytes (0 if the download failed).
is_out_file_validbooleantrue if the file passed read-validation; false otherwise.
last_updatetimestampTime of the download attempt.

Resilience behavior:

  • The href is re-signed on every attempt — signed URLs from search may expire before a retry; download always re-derives a fresh URL from item_id + asset_name.
  • HTTP errors (4xx/5xx, including throttle responses) trigger tenacity exponential backoff and retry up to max_tries.
  • Each file is staged to worker-local disk first; it is copied to the Volume only after passing read-validation. No partial or corrupt file is written to the Volume.
  • Files that already exist and are valid (is_out_file_valid = true) are skipped — the operation is idempotent.
  • A failed asset (exhausted retries, or read-validation failure) sets is_out_file_valid = false and out_file_sz = 0; use repair to retry those rows.

repair

Re-download invalid files and merge the results back to the Delta table.

repaired = client.repair(
"band_b02", # Delta table name or DataFrame
where="is_out_file_valid = false", # SQL filter over the table
)

Parameters:

ParameterTypeDefaultDescription
table_or_dfstr or DataFrameDelta table name or a DataFrame with item_id, asset_name, out_file_path, is_out_file_valid.
wherestr"is_out_file_valid = false"SQL predicate selecting rows to re-download.

Behavior: reads the table, filters to matching rows, re-runs the resilient download on that subset, then merges updated out_file_path, out_file_sz, is_out_file_valid, and last_update back into the Delta table. Returns the repaired subset as a DataFrame.


End-to-end example

This illustrates the full search → download → repair flow. For fully-executed, worked examples see EO Series (Sentinel-2 / Alaska) or Helios (3DEP COGs over San Francisco, NB03).

from databricks.labs.gbx.stac import StacClient
from pyspark.sql import functions as F

client = StacClient() # default: Planetary Computer, sign=planetary_computer

# 1 — Search: one row per (AOI cell, STAC item, asset)
# df_cells has a "geojson" column with one GeoJSON geometry per H3 cell
assets_df = client.search(
df_cells,
geojson_col="geojson",
collections=["sentinel-2-l2a"],
datetime="2022-06-01/2022-06-30",
partitions=512,
)
# Write to Delta for an auditable handoff
assets_df.write.mode("overwrite").saveAsTable("cell_assets")

# 2 — Download: resilient, validated; one task per unique (item_id, asset_name)
assets = spark.read.table("cell_assets")
files_df = client.download(
assets,
out_dir="/Volumes/my_catalog/my_schema/data/alaska/B02",
asset_names=["B02"],
name="{asset_name}_{item_id}.tif",
validate=True,
max_tries=5,
)

# Join back per-item metadata (date) from the search output
item_meta = assets.select("item_id", "date").distinct()
band_df = (
files_df
.join(item_meta, on="item_id", how="left")
.withColumn("band_name", F.lit("B02"))
.select("item_id", "band_name", "date",
"out_file_path", "out_file_sz", "is_out_file_valid", "last_update")
)
band_df.write.mode("overwrite").saveAsTable("band_b02")

# 3 — Repair: re-download any files that failed read-validation
repaired = client.repair("band_b02", where="is_out_file_valid = false")

Serverless usage

StacClient is designed for Serverless (environment version 5, Python 3.12):

  • No spark.conf.set. Parallelism is controlled entirely via partitions= in search and download, and via DataFrame.repartition(N, "col") in your notebook — hash by a column, since on Serverless a number-only repartition(N) is coalesced by AQE back toward one partition.
  • No .cache() / .persist(). Materialize search results and downloaded-file metadata to Delta tables — Delta time travel is a more durable alternative to in-memory caching and survives session restarts.
  • One task per asset. Each download task is independent; Serverless autoscaling routes tasks across available workers without pinning.
# Serverless: write search results to Delta immediately (no caching)
client.search(df_cells, geojson_col="geojson",
collections=["sentinel-2-l2a"], datetime="2022-06-01").write \
.mode("overwrite").saveAsTable("cell_assets")

# Serverless: read back from Delta for the download step
assets = spark.read.table("cell_assets")
files_df = client.download(assets, out_dir="/Volumes/...", asset_names=["B02"])
files_df.write.mode("overwrite").saveAsTable("band_b02")
No doc-test backing for this page

StacClient is a network integration client — it requires live STAC catalog access and real asset URLs. The illustrative code blocks above show the API surface; the EO Series notebooks and the Helios notebooks are fully-executed, end-to-end examples with real data.


Non-goals

The following are explicitly out of scope for the initial StacClient:

  • No async / concurrent-within-task fetching. Parallelism comes from Spark tasks, not asyncio/threads inside a UDF.
  • No non-raster asset validation. validate=True open-and-decodes a raster window; JSON, thumbnails, and vector sidecars are downloaded but not validated.
  • No catalog or item publishing. StacClient is a read/consume client.
  • No credential management. Auth is expressed through the sign parameter; token storage and refresh are the caller's responsibility.

See also

  • EO Series notebooks — the worked end-to-end example (search → download → repair → tessellate → stack).
  • Helios notebooksStacClient used to discover and download 3DEP DEMs from Planetary Computer, converted to COGs and cataloged in a STAC Delta table (NB03).
  • Execution Tiers — lightweight vs heavyweight comparison.
  • RasterX Function Referencerst_h3_tessellate, rst_fromcontent, rst_merge_agg, and the rest of the raster processing functions used downstream of STAC downloads.