Developers
This page is for contributors and developers working in the GeoBrix repository. It describes how the project is organized and how to use the gbx:* commands effectively.
How the project is organized
GeoBrix is a multi-artifact repo: Scala/JVM core, Python bindings, docs, and tooling share the same root and are wired for Databricks and local development.
Repository layout
| Path | Purpose |
|---|---|
src/main/scala/com/databricks/labs/gbx/ | Core implementation: readers, expressions, RasterX, GridX, VectorX |
src/test/scala/ | Scala unit and integration tests |
python/geobrix/ | Python package: PySpark bindings and sample-data bundle |
docs/ | Docusaurus site: docs/ (content), src/ (components), tests under docs/tests/ |
notebooks/ | Sample notebooks (e.g. sample-data/setup_sample_data.ipynb) and notebooks/tests/ |
scripts/ | CI, Docker, and one-off scripts |
sample-data/ | Scripts and outputs for sample data (host); in-cluster uses Volumes path |
scripts/commands/ | gbx:* palette commands — .md registration + .sh implementation (see below) |
CLAUDE.md | Project conventions and working patterns — read this first when starting work here |
Packages and readers
- RasterX — Raster operations and expressions (GDAL-backed);
rst_*/gbx_rst_*. - GridX — Grid systems (BNG, H3);
bng_*/gbx_bng_*. - VectorX — Vector geometry and OGR-backed readers;
st_*/gbx_st_*. - Readers — Format-specific data sources (GDAL, OGR, GeoTIFF, Shapefile, GeoJSON, GeoPackage, etc.) registered as Spark data sources.
Tests and docs
- Unit tests:
src/test/scala/(Scala),python/geobrix/test/(Python). - Documentation tests:
docs/tests/python/,docs/tests/scala/— validate code examples used in the docs; single source of truth. - Notebook tests:
notebooks/tests/mirrorsnotebooks/; run viagbx:*commands or CI.
Development and CI use a Docker image (geobrix-dev) for a consistent environment; most gbx:* commands run inside that container.
Git LFS — required to clone the GDAL platform tarball
The GDAL platform tarball at resources/static/geobrix-gdal-platform-noble.tar.gz (~90 MB, ships in every GeoBrix release as the runtime GDAL bundle) is stored via Git LFS so the binary lives in LFS storage instead of the git pack. The matching .sha256 sidecar is small enough to live in git directly and is NOT LFS-tracked. The tracking rule is in .gitattributes at the repo root.
One-time install per machine
brew install git-lfs # macOS; or apt-get install git-lfs on Debian/Ubuntu
git lfs install # writes LFS filters into ~/.gitconfig
Cloning the repo
After git lfs install, a normal git clone of geobrix automatically fetches LFS objects:
git clone git@github.com:databrickslabs/geobrix.git
If you cloned before installing git-lfs, run git lfs pull from inside the working tree to fetch the binary. Without that step, resources/static/geobrix-gdal-platform-noble.tar.gz will be a ~130-byte LFS pointer file rather than the real 90 MB tarball, and the package-geobrix-artifacts.yml workflow's lfs: true checkout will fail an integrity check.
Updating the platform tarball
Rebuild only when GDAL_PPA_VERSION changes, when DBR moves to a new Ubuntu LTS, or for a security advisory against one of the bundled libs. See resources/static/README.md for the full Docker-based recipe. The short version:
- Run
scripts/build-gdal-artifacts.sh --platform-onlyinside a freshubuntu:24.04container (Docker recipe in the README). - Move the resulting
geobrix-gdal-platform-noble.tar.gz+.sha256fromdist/intoresources/static/. git add resources/static/geobrix-gdal-platform-noble.tar.gz— the LFS filter intercepts via.gitattributes. Verify withgit lfs ls-files(should list the tarball) andgit diff --cached --stat resources/static/geobrix-gdal-platform-noble.tar.gz(should show ~3 lines added — the pointer — not 90 MB).git add resources/static/geobrix-gdal-platform-noble.tar.gz.sha256— committed normally, not LFS.- Open a PR. The reviewer re-runs the build script locally in their own
ubuntu:24.04container and confirms the resulting sha256 matches the committed sidecar before approving — that PR review is the trust anchor for every cluster that subsequently installs from this bundle. See Security for the full chain.
Storage considerations
LFS bandwidth and storage come from the databrickslabs GitHub org quota. Each tarball bump consumes both. Don't rebuild the tarball just to bump GeoBrix versions — the release workflow grafts the per-release JAR onto the committed platform tarball without changing it.
Testing on a Databricks cluster
You can run the Essential bundle and primitive Volume tests on a live Databricks cluster so that Volume paths are FUSE-mounted and the bundle uses pathlib/shutil only (no Databricks Files API).
Config — Copy notebooks/tests/databricks_cluster_config.example.env to notebooks/tests/databricks_cluster_config.env and set:
DATABRICKS_HOST,DATABRICKS_TOKEN(orDATABRICKS_CONFIG_PROFILE)CLUSTER_ID(existing cluster to run the job)GBX_BUNDLE_VOLUME_CATALOG,GBX_BUNDLE_VOLUME_SCHEMA,GBX_BUNDLE_VOLUME_NAME— Volume root is/Volumes/<catalog>/<schema>/<volume_name>. The volume name must match Data Explorer exactly (e.g.sample-datanotsample_data).- GBX_ARTIFACT_VOLUME — directory for artifacts (e.g.
/Volumes/.../artifacts). JAR and wheel are uploaded directly here (no subpaths). Wheel path for the notebook is derived asGBX_ARTIFACT_VOLUME/geobrix-<version>-py3-none-any.whlunless overridden. - Optional:
GBX_BUNDLE_WHEEL_VOLUME_PATH— override full wheel path for the notebook pip cells. - Optional:
GBX_BUNDLE_SKIP_WHEEL_UPLOAD=1— use existing wheel (no build/upload); notebook still gets the pip and restart cells. - Optional:
GBX_BUNDLE_SKIP_JAR_UPLOAD=1— when running push-wheel, skip JAR build/upload; when running push-jar alone, skip JAR build/upload. - Optional:
GBX_RUNNER_DIR,GBX_BUNDLE_RUNNER_NOTEBOOK,GBX_PRIMITIVE_RUNNER_NOTEBOOK— where to upload the runner notebooks.
Commands — From the repo root:
gbx:test:primitive-databricks— Pushes the primitive notebook and runs it on the cluster. Validates volume exists, create subdirs, read/write/copy via FUSE (pathlib/shutil). No GeoBrix dependency.gbx:test:bundle-databricks— Pushes the bundle runner notebook and runs it on the cluster. IfGBX_BUNDLE_WHEEL_VOLUME_PATHis set, the notebook has: (1)%pip install --quiet <wheel>, (2)%pip install --quiet --no-deps --force-reinstall <wheel>, (3)dbutils.library.restartPython(), then the bundle cell. Run those cells in order so the restarted process loads the new GeoBrix code.
Convention — For Volume path handling (FUSE, pathlib, no random access), see the "Unity Catalog Volumes" section in CLAUDE.md.
gbx:* commands
The repo provides gbx:* commands so both humans and AI agents can run tests, coverage, docs, and Docker in a consistent way. Each command is a .md registration + .sh implementation under scripts/commands/ (the directory name is historical; the commands are usable from any shell).
Conventions and architectural guidance live in CLAUDE.md at the repo root — read that for cross-language naming, BNG resolution rules, GDAL resource management, doc-test single-source pattern, and the user-facing-docs voice rule. Agents (Cursor or Claude) read CLAUDE.md as their entry point.
Commands
Commands are invocable actions. Prefer them over ad-hoc shell for tests, coverage, docs, Docker, and data so behavior is consistent and reproducible.
How to invoke
- From Cursor UI — Use the command palette (
/or type the command name) and run the desiredgbx:*command. Each command is backed by a.md(registration) and a.sh(implementation) inscripts/commands/. - From a shell — Run the script directly, e.g.
bash scripts/commands/gbx-test-scala.sh [OPTIONS]. This is the form most often used by terminals, CI, or AI agents (Claude, Cursor) invoking them via a shell tool.
Naming
Commands follow gbx:<category>:<action>:
| Category | Purpose |
|---|---|
test | Run tests (Scala, Python, docs, SQL docs, notebooks, function-info) |
coverage | Code coverage (Scala/Python, unit/docs, gaps, baseline, package-targeted) |
data | Sample data: download (essential/complete bundle), generate minimal bundle, push JAR/wheel to Volume |
docs | Documentation server (start, stop, restart, dev, serve-local, static-build, function-info) |
docker | Container lifecycle (exec, start, stop, restart, rebuild, attach) |
ci | CI / GitHub Actions: push, trigger, status, watch, logs, docs menu, setup |
lint | Scala: scalastyle; Python: isort, black, flake8 (same as CI) |
When to use which command
Use the following by task; always prefer the command over manual shell for these operations.
Testing
| Command | When to use | What it does |
|---|---|---|
gbx:test:scala | After Scala changes, before commit | Runs Scala unit tests (excludes doc tests); supports --suite |
gbx:test:python | After Python changes | Runs Python unit tests in python/geobrix/test/ |
gbx:test:scala-docs | After changing Scala doc examples | Runs Scala doc tests in docs/tests/scala/ |
gbx:test:python-docs | After changing Python doc examples | Runs Python doc tests in docs/tests/python/ (default: no integration) |
gbx:test:sql-docs | After changing SQL API examples | Runs SQL (and Python API) doc tests in docs/tests/python/api/ |
gbx:test:docs | Before PR that touches docs | Runs all doc tests by invoking python-docs, sql-docs, and scala-docs in sequence; sets GBX_SAMPLE_DATA_ROOT to minimal bundle by default (use --no-sample-data-root for full bundle) |
gbx:test:function-info | After changing function-info or doc SQL | Regenerates function-info and runs DESCRIBE/coverage tests |
gbx:test:notebooks | After changing sample-data or notebook runner | Runs notebook tests; use --include-integration for full run |
gbx:test:python-dbr | Validate Python on Databricks Runtime | DBR integration tests (spatial SQL, readers); excluded from regular CI; requires DBR environment |
gbx:test:bundle-databricks | Validate Essential bundle on a live cluster | Pushes runner notebook and runs it on CLUSTER_ID; use --local to run bundle on host |
gbx:test:primitive-databricks | Validate Volume access on cluster (FUSE) | Pushes primitive notebook; tests volume exists, create subdirs, read/write/copy via pathlib |
Coverage (Scala coverage is slow; use strategically)
| Command | When to use | What it does |
|---|---|---|
gbx:coverage:gaps | See where to focus | Shows package-level gaps (no test run) |
gbx:coverage:scala-package | After changes in one package | Runs coverage for one package (e.g. rasterx, gridx) |
gbx:coverage:baseline | Weekly or before release | Full Scala or Python baseline |
gbx:coverage:scala | Full Scala coverage (sparingly) | Full scoverage; use --report-only to view last run |
gbx:coverage:python | After Python changes | Python unit test coverage (fast) |
gbx:coverage:scala-docs / gbx:coverage:python-docs | After doc test changes | Coverage for doc test suites |
Data
Doc tests use the in-repo minimal bundle (no download step). Generate it once with gbx:data:generate-minimal-bundle; the Docker Volumes mount makes it available to tests. For full sample data locally, use gbx:data:download. Minimal-bundle Sentinel-2 rasters (and *_byte.tif variants) may appear black in QGIS or other viewers; the full-size Essential/Complete bundle rasters are the ones suited for visual inspection.
| Command | When to use | What it does |
|---|---|---|
gbx:data:download | Need full sample data locally | Downloads essential and/or complete bundle to sample-data/ |
gbx:data:generate-minimal-bundle | CI or doc tests; after full bundle if needed | Generates minimal bundle under sample-data/Volumes/.../test-data/geobrix-examples/ by bbox extraction (NYC/London, --bbox-size, --max-rows); doc test commands use this, not a download step |
gbx:data:push-wheel | Put built JAR and wheel on Volume | Builds JAR first (push-jar) unless GBX_BUNDLE_SKIP_JAR_UPLOAD=1, then clears python/geobrix/dist, runs python3 -m build, uploads wheel to GBX_ARTIFACT_VOLUME/ (overwrite if exists); set GBX_BUNDLE_SKIP_WHEEL_UPLOAD=1 to skip wheel |
gbx:data:push-jar | Put built JAR on a Volume | Runs mvn clean package -DskipTests, uploads JAR to GBX_ARTIFACT_VOLUME/ (overwrite if exists); set GBX_BUNDLE_SKIP_JAR_UPLOAD=1 to skip |
Documentation
| Command | When to use | What it does |
|---|---|---|
gbx:docs:start | Serve docs (one-off) | Builds (optional) and starts server on port 3000 |
gbx:docs:stop | Free port or before another docs command | Stops any running docs server |
gbx:docs:dev | While editing docs | Dev server with hot reload |
gbx:docs:serve-local | Preview production build | Build then serve static site |
gbx:docs:static-build | Create offline/portable docs zip | Build with relative paths and hash router; zip to resources/static/ by default (use --output <path>); leaves docs/build/ unchanged for serving |
gbx:docs:restart | Restart after stop | Stop + start with same options |
gbx:docs:function-info | After changing doc SQL examples | Regenerates function-info.json from doc SQL |
Docker
| Command | When to use | What it does |
|---|---|---|
gbx:docker:start | First time or after stop | Starts geobrix-dev container |
gbx:docker:stop | When done developing | Stops container |
gbx:docker:exec | Run Maven, pytest, etc. | Runs a command (or interactive shell) in container |
gbx:docker:attach | Interactive shell in container | Attach to running container |
gbx:docker:restart | After config change | Restart container |
gbx:docker:rebuild | After Dockerfile or deps change | Rebuild image and optionally start |
gbx:docker:clear-pycache | After editing Python code, stale imports | Clears .pyc and __pycache__ in container so tests see fresh code |
Lint
| Command | When to use | What it does |
|---|---|---|
gbx:lint:scalastyle | Before pushing Scala changes | Runs ScalaStyle on src/main/scala (same config as CI); catches trailing whitespace, missing EOF newline, non-ASCII in comments |
gbx:lint:python | Before pushing Python changes | Runs isort, black, flake8 on python/geobrix (same as CI). Default: check-only in Docker. Use --fix on host to apply isort/black (requires pip install -e \"python/geobrix[dev]\"). |
CI (require GitHub CLI gh; use gbx:ci:setup to install/authenticate)
| Command | When to use | What it does |
|---|---|---|
gbx:ci:push | Initiate remote build on current branch (e.g. beta/0.4.0) | Pushes branch to origin, then watches the build main workflow run |
gbx:ci:trigger | Push then manually trigger build main (e.g. workflow_dispatch) | Pushes branch, lists runs, prompts to trigger build main on current branch |
gbx:ci:status | Check recent CI runs | Shows recent workflow runs for current branch (optional: [LIMIT]) |
gbx:ci:watch | Stream a CI run | Watches latest run (or [RUN_ID]) in real time |
gbx:ci:logs | Download CI logs | Fetches logs for latest run (or [RUN_ID]) into ci-logs/ |
gbx:ci:docs | Doc-test CI menu | Run doc tests locally (python/scala), status, trigger, watch, logs (or no args for menu) |
gbx:ci:setup | One-time CI setup | Install and authenticate GitHub CLI (gh) |
Most commands accept --help. Common options: --log <path> for test/output logs (truncated each run), --open for coverage reports, and command-specific flags (e.g. --suite, --path, --skip-build). Doc test commands set GBX_SAMPLE_DATA_ROOT=/Volumes/main/default/test-data in the container by default so the minimal bundle is used (required for remote/CI); use --no-sample-data-root to leave it unset and use the full-bundle path or your own env. They do not run a sample-data download; the minimal or full bundle must be present via the Volumes mount.
Working with agents
When working with Claude or Cursor in this repo, agents should:
- Read
CLAUDE.mdfirst — it documents project conventions and translates cross-project working patterns into geobrix-specific behavior (Docker container,ghaccount switching, etc.). - Dispatch long-running work to subagents —
gbx:*test/build commands typically take minutes and benefit from running in an isolated context, freeing the main session for review. - Use
gbx:*commands rather than ad-hoc shell for tests, coverage, docs, and Docker — they handle env vars, log paths, and container setup consistently. - Add or fix a command rather than work around it — if a
gbx:*command is broken, fix the script inscripts/commands/<name>.sh; don't invoke the underlying tool directly and let the command rot.
Quick reference
- Run tests:
gbx:test:*(pick the scope: scala, python, docs, notebooks, bundle-databricks, primitive-databricks). - Coverage: Prefer
gbx:coverage:gapsandgbx:coverage:scala-packagefor Scala;gbx:coverage:pythonis fast. - Docs:
gbx:docs:devwhile editing;gbx:docs:stopto stop. - Docker:
gbx:docker:startthengbx:docker:exec(orattach) for builds and tests. - Databricks cluster:
gbx:test:primitive-databricksthengbx:test:bundle-databrickswithdatabricks_cluster_config.envset. - Conventions and patterns: Read
CLAUDE.mdat the repo root. - Full command list and options: See the "Commands" section above, or run any
gbx:*command with--help.
CI / GitHub Actions
Workflows live in .github/workflows/. They define when and how tests and builds run on GitHub.
When things run
- Main build (
build_main.yml) — Runs on push to any branch (exceptpython/**andscala/**), on all pull requests, and via workflow_dispatch (manual trigger from the Actions tab orgh workflow run "build main" --ref <branch>). Pipeline: checkout → build Scala → build Python → rebuild doc-snippet-inventory → (on push to main only) Scala doc tests → upload artifacts.- Scala tests in the main run use
-Dsuites='com.databricks.labs.gbx.*', so only unit/integration tests run; Scala doc tests (docs/tests/scala/) are excluded from this step. - Scala doc tests run in a separate step only when the event is push to
main. That step setsGBX_SAMPLE_DATA_ROOTand runs the doc test suites.
- Scala tests in the main run use
- Documentation tests (
doc-tests.yml) — Run after the "build main" workflow completes (on the same ref). Python doc tests and structure validation run here; Scala doc tests are run bybuild_main(see above). Also triggerable manually via workflow_dispatch. - Branch-specific builds —
build_python.ymlruns on push topython/**;build_scala.ymlruns on push toscala/**. - CodeQL (
codeql-analysis.yml) — Security analysis on push and pull_request tomain. - Publish / release — Triggered by release or publish events (see the workflow files).
Initiating a build from a branch
Pushing to a branch (except python/** and scala/**) successfully triggers the build main workflow. To run the main build on your current branch (e.g. beta/0.4.0):
- Push and watch —
gbx:ci:push. Pushes the current branch to origin (push triggers build main), then streams the run. - Trigger after push —
gbx:ci:trigger. Pushes, then prompts to trigger build main (workflow_dispatch). - Check status —
gbx:ci:status. Recent workflow runs for the current branch. - Watch a run —
gbx:ci:watchorgbx:ci:watch RUN_ID. - Fetch logs —
gbx:ci:logsorgbx:ci:logs RUN_ID(saves toci-logs/). - First-time setup —
gbx:ci:setupto install and authenticate the GitHub CLI (gh).
Build environment and caching
The Scala and Python build actions (.github/actions/scala_build/ and python_build/) use a shared environment so that when both run in the same job (e.g. build main), caches stay warm:
- Apt — Both actions restore
.cache/apt-archivesat the start of the GDAL step and save it at the end. Workflows cache.cache/apt-archiveswith a key derived from both action files, so one cache serves Scala and Python; changing either action’s apt steps invalidates the cache. - Pip — Both use the same pip cache key (
.ci-pip-cache-key, created by the workflow from ref + matrix). That avoids duplicate pip installs when Scala runs first and Python reuses the same interpreter and cache. - Maven — Scala uses
setup-javawithcache: 'maven'andcache-dependency-path: 'pom.xml'.
The two actions intentionally mirror each other (same apt repos, same GDAL/natives, same pip stack: numpy, pyspark, gdal). Scala adds JDK, Maven, zip/unzip, and the JNI .so copy; Python adds pytest and pip install python/geobrix[dev]. Lint (ScalaStyle, isort/black/flake8) runs in CI on every build, but the build fails on lint errors only for PRs targeting main; pushes and PRs to other branches do not fail on lint. Config: scalastyle-config.xml, python/geobrix/pyproject.toml. Use gbx:lint:scalastyle and gbx:lint:python locally (or gbx:lint:python --fix with dev deps on the host). A future refactor could extract a single “setup GDAL + pip” composite used by both; for now the duplication is small and the structure is aligned for easy comparison.
Summary
The main build runs on push to any branch (except python/**, scala/**) — push triggers are successful — and on all PRs and via workflow_dispatch. Use gbx:ci:push to push and watch the build. Scala doc tests run only on push to main, in a separate step with GBX_SAMPLE_DATA_ROOT set. For full details and triggers, see the YAML files in .github/workflows/.