Skip to main content

Security

GeoBrix is an open-source Databricks Labs project. Because the published artifacts run inside customer clusters and read data from a wide range of formats, we treat supply-chain hygiene as a first-class concern. This page describes what we do upstream to keep the package trustworthy, and what you can do on your side to build on that foundation.

For private vulnerability disclosure, see SECURITY.md — email labs@databricks.com or reach out via your Databricks representative. Please do not open a public issue for a suspected vulnerability.

How we secure the package

These controls live in the GeoBrix repository itself — they apply to every release we publish.

Pinned third-party GitHub Actions

Every third-party Action referenced under .github/ is locked to a full commit SHA from a release published before the policy cutoff, rather than a movable tag like @v3. Tag names are kept inline as comments for readability, but the SHA is what GitHub resolves. This means a compromised upstream Action repo cannot retroactively change what runs in our CI.

First-party (databricks* / databrickslabs*) Actions are exempt and continue to use tag references.

PGP-verified Maven dependencies

Every Maven artifact pulled by the build — direct dependency, transitive dependency, plugin, plugin dependency — is checked against a PGP key allowlist maintained in the repository (.maven-keys.list). The pgpverify-maven-plugin runs as the first Maven step in every CI build, so verification happens before any compile, test, or install. CI Maven invocations also pass -C (strict-checksums), so any checksum mismatch from the registry aborts the build instead of warning.

Hash-pinned Python dependencies

Every Python install path that we control — CI, the development container, and the notebook test harness — uses a lockfile generated with uv pip compile --generate-hashes and installed via pip install --require-hashes. pip refuses to install any wheel whose sha256 doesn't match what was recorded at lock time, so a compromised mirror serving a same-version-but-different-bytes wheel fails closed.

PathLockfile
CI (Scala + Python build)python/geobrix/requirements-ci.txt
Dev containerpython/geobrix/requirements-dev-container.txt
Notebook test harnessnotebooks/tests/requirements.txt

GDAL is the documented exception: its Python wheel must match the GDAL native version installed on the host, so it is installed separately against the detected version. The native side is pinned via the init script (see below).

Pinned GDAL native + multi-layer trust chain

As of v0.3.0, the GDAL native install path is pre-built in CI and distributed as a single signed tarball rather than fetched and compiled on every cluster start. This both strengthens the supply-chain story (the PPA round-trip happens once, in a controlled environment) and cuts cluster start time from ~15 minutes to ~30–90 seconds.

The trust chain has four layers, each gating the next:

  1. GPG fingerprint check + reviewed Git-LFS commit (upstream of CI). The build script scripts/build-gdal-artifacts.sh embeds the UbuntuGIS signing key inline and refuses to proceed unless its fingerprint matches UBUNTUGIS_FPR — the same check the legacy on-cluster script ran, moved upstream. The resolved runtime .deb set (PPA versions of libgdal, libproj, libgeos, proj-data, and other transitive runtime deps) plus the source-compiled GDAL Python wheel (--no-binary :all: against the PPA headers) and the libgdalalljni.so JNI are packaged into a single platform tarball.

    That platform tarball is committed to the repository under resources/static/geobrix-gdal-platform-noble.tar.gz via Git LFS. The PR that adds or updates it is the single human-review checkpoint for the bytes that will ship to every cluster — reviewers re-run the build script locally in ubuntu:24.04 and compare the resulting sha256 to the committed sidecar before approving. Subsequent GeoBrix releases reuse that committed platform tarball (the release workflow grafts the per-release JAR onto a copy of it); no new PPA round-trip happens unless the platform tarball itself is bumped.

  2. Per-file SHA256SUMS manifest (inside the tarball). Every .deb, .whl, .so, and .jar in the bundle is hashed at build time. The cluster verifies the manifest after extraction; any per-file tampering or transport corruption fails closed before install.

  3. Outer tarball sha256 sidecar (staged in your UC Volume). A <tarball>.sha256 sidecar file is published alongside the tarball in each GeoBrix release. The operator uploads both files to the Volume; the cluster init script reads the sidecar at runtime and verifies the tarball against it. A tampered tarball fails closed before extraction.

  4. Unity Catalog Volume ACL (your environment). The Volume's write permission is the boundary that lets the cluster trust the sidecar. Only the release/operator process should be able to write to the staging Volume; read access is broader (clusters need it). This is the only layer of the chain that lives in your workspace — keep write access tightly scoped.

The legacy on-cluster install path is still available as scripts/geobrix-gdal-init-ppa.sh (slower, runs the PPA dance every boot) for bootstrapping new bundles or debugging.

Hardened, ephemeral CI runners

All CI jobs run on Databricks-managed hardened runner groups registered for the databrickslabs org. Each job gets a fresh, ephemeral VM that is destroyed at the end of the run, so nothing persists between jobs and no state can be carried forward by a malicious step. Org-level allowlisting controls which workflows can request those runners and which secrets they can see.

Short-lived registry tokens (JFrog OIDC)

Pip, Maven, and npm in CI authenticate to our artifact mirror via short-lived OIDC tokens minted per run, not long-lived registry credentials checked into GitHub secrets. The token's lifetime is the workflow run; there is nothing durable to leak.

Gated deploy environment

Workflows that need elevated tokens (pushing back to PR branches, deploying docs) run under a GitHub Environment with deployment-branch restrictions (main only), required CODEOWNER review, and the environment-scoped REPO_ACCESS_TOKEN as a fine-grained PAT limited to contents:write on this repository. Secrets are not released until those gates pass.

How you can build on this foundation

The upstream controls above protect the artifacts we ship. The controls below are what you can do at install time and at runtime to keep the same guarantees in your environment.

1. Use the init script from the matching release verbatim

Each GeoBrix release attaches its own geobrix-gdal-init.sh to the release page so the script and tarball that ship together are unambiguously paired. The docs import the latest script from main for reference; for a specific GeoBrix version, use the script attached to that release.

The architecture check, the sidecar verification, the per-file SHA256SUMS verification, and the offline pip install are load-bearing for the trust chain. The only line you should change in the script is VOL_DIR. Replacing it with a homegrown GDAL installer drops those guarantees on your cluster.

2. Stage the tarball + sidecar in a Volume you control

Both files belong in a Unity Catalog Volume whose write ACL you've restricted to the release/operator process. The cluster only needs read access. The recommended flow is:

  • Download geobrix-gdal-artifacts-v<version>-noble.tar.gz and its matching .sha256 sidecar from the GitHub release page.
  • Verify the tarball locally: sha256sum -c <tarball>.sha256.
  • Upload both files to the Volume — the cluster init script reads the sidecar at runtime to know which tarball to expect and what hash to verify against.
  • Refresh on a controlled cadence (when bumping GeoBrix versions or applying a security patch), not automatically.

The init script lives separately — typically as a workspace file or in a separate Volume the cluster reads as its init-script attachment. Pair it with the bundle by GeoBrix version.

3. Pin the GeoBrix version in your cluster libraries

GeoBrix is Beta — APIs may break to stabilize, and there are no function aliases. Pin the exact wheel and JAR version in your cluster configuration and only bump deliberately. See the Beta Release Notes for the change list per version.

4. Restrict GDAL drivers for untrusted inputs

GeoBrix wraps GDAL/OGR, which can read a very large number of raster and vector formats. When you ingest data from third-party sources, narrow the driver list with GDAL_SKIP or the per-format options described in the Readers and Writers sections, rather than leaving every driver enabled by default. The smaller the allowlist, the smaller the attack surface from a malicious input file.

5. Report suspected vulnerabilities privately

If you find something that looks like a security issue in GeoBrix itself, contact labs@databricks.com or your Databricks representative before publishing details. See SECURITY.md for what to include in the report.

6. VRT Python pixel functions: off by default by design

GDAL's VRT Python pixel function API lets a <PixelFunctionCode> element in a VRT XML file execute arbitrary Python in-process at band-read time. GeoBrix sets GDAL_VRT_ENABLE_PYTHON=NO at executor startup and only flips it to YES for the duration of an individual combineavg / derivedband call (via the internal GDALManager.withVrtPython bracket). The four built-in functions inject pyfunc source generated by GeoBrix itself, never by user input.

If your own code consumes Python-pixel VRTs from less-trusted sources (e.g. you pull VRT XML from object storage that other principals can write to), either keep the option NO and pre-translate to GTiff, or switch to GDAL_VRT_ENABLE_PYTHON=TRUSTED_MODULES with a narrow GDAL_VRT_PYTHON_TRUSTED_MODULES allowlist. See RasterX § VRT Python pixel functions for the full how-to.

Next steps