Security
GeoBrix is an open-source Databricks Labs project. Because the published artifacts run inside customer clusters and read data from a wide range of formats, we treat supply-chain hygiene as a first-class concern. This page describes what we do upstream to keep the package trustworthy, and what you can do on your side to build on that foundation.
For private vulnerability disclosure, see
SECURITY.md —
email labs@databricks.com or reach out via your Databricks representative.
Please do not open a public issue for a suspected vulnerability.
How we secure the package
These controls live in the GeoBrix repository itself — they apply to every release we publish.
Pinned third-party GitHub Actions
Every third-party Action referenced under .github/ is locked to a full commit
SHA from a release published before the policy cutoff, rather than a movable
tag like @v3. Tag names are kept inline as comments for readability, but the
SHA is what GitHub resolves. This means a compromised upstream Action repo
cannot retroactively change what runs in our CI.
First-party (databricks* / databrickslabs*) Actions are exempt and
continue to use tag references.
PGP-verified Maven dependencies
Every Maven artifact pulled by the build — direct dependency, transitive
dependency, plugin, plugin dependency — is checked against a PGP key
allowlist maintained in the repository
(.maven-keys.list).
The
pgpverify-maven-plugin
runs as the first Maven step in every CI build, so verification happens before
any compile, test, or install. CI Maven invocations also pass -C
(strict-checksums), so any checksum mismatch from the registry aborts the
build instead of warning.
Hash-pinned Python dependencies
Every Python install path that we control — CI, the development container,
and the notebook test harness — uses a lockfile generated with
uv pip compile --generate-hashes and installed via
pip install --require-hashes. pip refuses to install any wheel whose
sha256 doesn't match what was recorded at lock time, so a compromised mirror
serving a same-version-but-different-bytes wheel fails closed.
| Path | Lockfile |
|---|---|
| CI (Scala + Python build) | python/geobrix/requirements-ci.txt |
| Dev container | python/geobrix/requirements-dev-container.txt |
| Notebook test harness | notebooks/tests/requirements.txt |
GDAL is the documented exception: its Python wheel must match the GDAL native version installed on the host, so it is installed separately against the detected version. The native side is pinned via the init script (see below).
Pinned GDAL native + multi-layer trust chain
As of v0.3.0, the GDAL native install path is pre-built in CI and distributed as a single signed tarball rather than fetched and compiled on every cluster start. This both strengthens the supply-chain story (the PPA round-trip happens once, in a controlled environment) and cuts cluster start time from ~15 minutes to ~30–90 seconds.
The trust chain has four layers, each gating the next:
-
GPG fingerprint check + reviewed Git-LFS commit (upstream of CI). The build script
scripts/build-gdal-artifacts.shembeds the UbuntuGIS signing key inline and refuses to proceed unless its fingerprint matchesUBUNTUGIS_FPR— the same check the legacy on-cluster script ran, moved upstream. The resolved runtime.debset (PPA versions oflibgdal,libproj,libgeos,proj-data, and other transitive runtime deps) plus the source-compiled GDAL Python wheel (--no-binary :all:against the PPA headers) and thelibgdalalljni.soJNI are packaged into a single platform tarball.That platform tarball is committed to the repository under
resources/static/geobrix-gdal-platform-noble.tar.gzvia Git LFS. The PR that adds or updates it is the single human-review checkpoint for the bytes that will ship to every cluster — reviewers re-run the build script locally inubuntu:24.04and compare the resulting sha256 to the committed sidecar before approving. Subsequent GeoBrix releases reuse that committed platform tarball (the release workflow grafts the per-release JAR onto a copy of it); no new PPA round-trip happens unless the platform tarball itself is bumped. -
Per-file
SHA256SUMSmanifest (inside the tarball). Every.deb,.whl,.so, and.jarin the bundle is hashed at build time. The cluster verifies the manifest after extraction; any per-file tampering or transport corruption fails closed before install. -
Outer tarball sha256 sidecar (staged in your UC Volume). A
<tarball>.sha256sidecar file is published alongside the tarball in each GeoBrix release. The operator uploads both files to the Volume; the cluster init script reads the sidecar at runtime and verifies the tarball against it. A tampered tarball fails closed before extraction. -
Unity Catalog Volume ACL (your environment). The Volume's write permission is the boundary that lets the cluster trust the sidecar. Only the release/operator process should be able to write to the staging Volume; read access is broader (clusters need it). This is the only layer of the chain that lives in your workspace — keep write access tightly scoped.
The legacy on-cluster install path is still available as
scripts/geobrix-gdal-init-ppa.sh
(slower, runs the PPA dance every boot) for bootstrapping new bundles or
debugging.
Hardened, ephemeral CI runners
All CI jobs run on Databricks-managed hardened runner groups registered for
the databrickslabs org. Each job gets a fresh, ephemeral VM that is
destroyed at the end of the run, so nothing persists between jobs and no
state can be carried forward by a malicious step. Org-level allowlisting
controls which workflows can request those runners and which secrets they
can see.
Short-lived registry tokens (JFrog OIDC)
Pip, Maven, and npm in CI authenticate to our artifact mirror via short-lived OIDC tokens minted per run, not long-lived registry credentials checked into GitHub secrets. The token's lifetime is the workflow run; there is nothing durable to leak.
Gated deploy environment
Workflows that need elevated tokens (pushing back to PR branches, deploying
docs) run under a GitHub Environment with deployment-branch restrictions
(main only), required CODEOWNER review, and the environment-scoped
REPO_ACCESS_TOKEN as a fine-grained PAT limited to contents:write on this
repository. Secrets are not released until those gates pass.
How you can build on this foundation
The upstream controls above protect the artifacts we ship. The controls below are what you can do at install time and at runtime to keep the same guarantees in your environment.
1. Use the init script from the matching release verbatim
Each GeoBrix release attaches its own geobrix-gdal-init.sh to the
release page so the
script and tarball that ship together are unambiguously paired. The docs
import the latest script from main for reference; for a specific
GeoBrix version, use the script attached to that release.
The architecture check, the sidecar verification, the per-file
SHA256SUMS verification, and the offline pip install are load-bearing
for the trust chain. The only line you should change in the script is
VOL_DIR. Replacing it with a homegrown GDAL installer drops those
guarantees on your cluster.
2. Stage the tarball + sidecar in a Volume you control
Both files belong in a Unity Catalog Volume whose write ACL you've restricted to the release/operator process. The cluster only needs read access. The recommended flow is:
- Download
geobrix-gdal-artifacts-v<version>-noble.tar.gzand its matching.sha256sidecar from the GitHub release page. - Verify the tarball locally:
sha256sum -c <tarball>.sha256. - Upload both files to the Volume — the cluster init script reads the sidecar at runtime to know which tarball to expect and what hash to verify against.
- Refresh on a controlled cadence (when bumping GeoBrix versions or applying a security patch), not automatically.
The init script lives separately — typically as a workspace file or in a separate Volume the cluster reads as its init-script attachment. Pair it with the bundle by GeoBrix version.
3. Pin the GeoBrix version in your cluster libraries
GeoBrix is Beta — APIs may break to stabilize, and there are no function aliases. Pin the exact wheel and JAR version in your cluster configuration and only bump deliberately. See the Beta Release Notes for the change list per version.
4. Restrict GDAL drivers for untrusted inputs
GeoBrix wraps GDAL/OGR, which can read a very large number of raster and
vector formats. When you ingest data from third-party sources, narrow the
driver list with GDAL_SKIP or the per-format options described in the
Readers and Writers sections,
rather than leaving every driver enabled by default. The smaller the
allowlist, the smaller the attack surface from a malicious input file.
5. Report suspected vulnerabilities privately
If you find something that looks like a security issue in GeoBrix itself,
contact labs@databricks.com or your Databricks representative before
publishing details. See
SECURITY.md
for what to include in the report.
6. VRT Python pixel functions: off by default by design
GDAL's VRT Python pixel function API
lets a <PixelFunctionCode> element in a VRT XML file execute arbitrary
Python in-process at band-read time. GeoBrix sets GDAL_VRT_ENABLE_PYTHON=NO
at executor startup and only flips it to YES for the duration of an
individual combineavg / derivedband call (via the internal
GDALManager.withVrtPython bracket). The four built-in functions inject
pyfunc source generated by GeoBrix itself, never by user input.
If your own code consumes Python-pixel VRTs from less-trusted sources
(e.g. you pull VRT XML from object storage that other principals can
write to), either keep the option NO and pre-translate to GTiff, or
switch to GDAL_VRT_ENABLE_PYTHON=TRUSTED_MODULES with a narrow
GDAL_VRT_PYTHON_TRUSTED_MODULES allowlist. See
RasterX § VRT Python pixel functions
for the full how-to.
Next steps
- Installation Guide — apply the init script as part of cluster setup.
- Readers overview — configuration knobs for narrowing the GDAL driver surface.
- SECURITY.md — vulnerability reporting policy.