Developers

This page is for contributors and developers working in the GeoBrix repository. It describes how the project is organized and how to use the gbx:* commands effectively.

How the project is organized

GeoBrix is a multi-artifact repo: Scala/JVM core, Python bindings, docs, and tooling share the same root and are wired for Databricks and local development.

Repository layout

Path	Purpose
`src/main/scala/com/databricks/labs/gbx/`	Core implementation: readers, expressions, RasterX, GridX, VectorX
`src/test/scala/`	Scala unit and integration tests
`python/geobrix/`	Python package: PySpark bindings and sample-data bundle
`docs/`	Docusaurus site: `docs/` (content), `src/` (components), tests under `docs/tests/`
`notebooks/`	Sample notebooks (e.g. `sample-data/setup_sample_data.ipynb`) and `notebooks/tests/`
`scripts/`	CI, Docker, and one-off scripts
`sample-data/`	Scripts and outputs for sample data (host); in-cluster uses Volumes path
`scripts/commands/`	`gbx:*` palette commands — `.md` registration + `.sh` implementation (see below)
`CLAUDE.md`	Project conventions and working patterns — read this first when starting work here

Packages and readers

RasterX — Raster operations and expressions (GDAL-backed); rst_* / gbx_rst_*.
GridX — Grid systems (BNG, H3); bng_* / gbx_bng_*.
VectorX — Vector geometry and OGR-backed readers; st_* / gbx_st_*.
Readers — Format-specific data sources (GDAL, OGR, GeoTIFF, Shapefile, GeoJSON, GeoPackage, etc.) registered as Spark data sources.

Tests and docs

Unit tests: src/test/scala/ (Scala), python/geobrix/test/ (Python).
Documentation tests: docs/tests/python/, docs/tests/scala/ — validate code examples used in the docs; single source of truth.
Notebook tests: notebooks/tests/ mirrors notebooks/; run via gbx:* commands or CI.

Development and CI use a Docker image (geobrix-dev) for a consistent environment; most gbx:* commands run inside that container.

Git LFS — required to clone the GDAL platform tarball

The GDAL platform tarball at resources/static/geobrix-gdal-platform-noble.tar.gz (~90 MB, ships in every GeoBrix release as the runtime GDAL bundle) is stored via Git LFS so the binary lives in LFS storage instead of the git pack. The matching .sha256 sidecar is small enough to live in git directly and is NOT LFS-tracked. The tracking rule is in .gitattributes at the repo root.

One-time install per machine

brew install git-lfs           # macOS; or apt-get install git-lfs on Debian/Ubuntu
git lfs install                # writes LFS filters into ~/.gitconfig

Cloning the repo

After git lfs install, a normal git clone of geobrix automatically fetches LFS objects:

git clone git@github.com:databrickslabs/geobrix.git

If you cloned before installing git-lfs, run git lfs pull from inside the working tree to fetch the binary. Without that step, resources/static/geobrix-gdal-platform-noble.tar.gz will be a ~130-byte LFS pointer file rather than the real 90 MB tarball, and the package-geobrix-artifacts.yml workflow's lfs: true checkout will fail an integrity check.

Updating the platform tarball

Rebuild only when GDAL_PPA_VERSION changes, when DBR moves to a new Ubuntu LTS, or for a security advisory against one of the bundled libs. See resources/static/README.md for the full Docker-based recipe. The short version:

Run scripts/build-gdal-artifacts.sh --platform-only inside a fresh ubuntu:24.04 container (Docker recipe in the README).
Move the resulting geobrix-gdal-platform-noble.tar.gz + .sha256 from dist/ into resources/static/.
git add resources/static/geobrix-gdal-platform-noble.tar.gz — the LFS filter intercepts via .gitattributes. Verify with git lfs ls-files (should list the tarball) and git diff --cached --stat resources/static/geobrix-gdal-platform-noble.tar.gz (should show ~3 lines added — the pointer — not 90 MB).
git add resources/static/geobrix-gdal-platform-noble.tar.gz.sha256 — committed normally, not LFS.
Open a PR. The reviewer re-runs the build script locally in their own ubuntu:24.04 container and confirms the resulting sha256 matches the committed sidecar before approving — that PR review is the trust anchor for every cluster that subsequently installs from this bundle. See Security for the full chain.

Storage considerations

LFS bandwidth and storage come from the databrickslabs GitHub org quota. Each tarball bump consumes both. Don't rebuild the tarball just to bump GeoBrix versions — the release workflow grafts the per-release JAR onto the committed platform tarball without changing it.

Testing on a Databricks cluster

You can run the Essential bundle and primitive Volume tests on a live Databricks cluster so that Volume paths are FUSE-mounted and the bundle uses pathlib/shutil only (no Databricks Files API).

Config — Copy notebooks/tests/databricks_cluster_config.example.env to notebooks/tests/databricks_cluster_config.env and set:

DATABRICKS_HOST, DATABRICKS_TOKEN (or DATABRICKS_CONFIG_PROFILE)
CLUSTER_ID (existing cluster to run the job)
GBX_BUNDLE_VOLUME_CATALOG, GBX_BUNDLE_VOLUME_SCHEMA, GBX_BUNDLE_VOLUME_NAME — Volume root is /Volumes/<catalog>/<schema>/<volume_name>. The volume name must match Data Explorer exactly (e.g. sample-data not sample_data).
GBX_ARTIFACT_VOLUME — directory for artifacts (e.g. /Volumes/.../artifacts). JAR and wheel are uploaded directly here (no subpaths). Wheel path for the notebook is derived as GBX_ARTIFACT_VOLUME/geobrix-<version>-py3-none-any.whl unless overridden.
Optional: GBX_BUNDLE_WHEEL_VOLUME_PATH — override full wheel path for the notebook pip cells.
Optional: GBX_BUNDLE_SKIP_WHEEL_UPLOAD=1 — use existing wheel (no build/upload); notebook still gets the pip and restart cells.
Optional: GBX_BUNDLE_SKIP_JAR_UPLOAD=1 — when running push-wheel, skip JAR build/upload; when running push-jar alone, skip JAR build/upload.
Optional: GBX_RUNNER_DIR, GBX_BUNDLE_RUNNER_NOTEBOOK, GBX_PRIMITIVE_RUNNER_NOTEBOOK — where to upload the runner notebooks.

Commands — From the repo root:

gbx:test:primitive-databricks — Pushes the primitive notebook and runs it on the cluster. Validates volume exists, create subdirs, read/write/copy via FUSE (pathlib/shutil). No GeoBrix dependency.
gbx:test:bundle-databricks — Pushes the bundle runner notebook and runs it on the cluster. If GBX_BUNDLE_WHEEL_VOLUME_PATH is set, the notebook has: (1) %pip install --quiet <wheel>, (2) %pip install --quiet --no-deps --force-reinstall <wheel>, (3) dbutils.library.restartPython(), then the bundle cell. Run those cells in order so the restarted process loads the new GeoBrix code.

Convention — For Volume path handling (FUSE, pathlib, no random access), see the "Unity Catalog Volumes" section in CLAUDE.md.

`gbx:*` commands

The repo provides gbx:* commands so both humans and AI agents can run tests, coverage, docs, and Docker in a consistent way. Each command is a .md registration + .sh implementation under scripts/commands/ (the directory name is historical; the commands are usable from any shell).

Conventions and architectural guidance live in CLAUDE.md at the repo root — read that for cross-language naming, BNG resolution rules, GDAL resource management, doc-test single-source pattern, and the user-facing-docs voice rule. Agents (Cursor or Claude) read CLAUDE.md as their entry point.

Commands

Commands are invocable actions. Prefer them over ad-hoc shell for tests, coverage, docs, Docker, and data so behavior is consistent and reproducible.

How to invoke

From Cursor UI — Use the command palette (/ or type the command name) and run the desired gbx:* command. Each command is backed by a .md (registration) and a .sh (implementation) in scripts/commands/.
From a shell — Run the script directly, e.g. bash scripts/commands/gbx-test-scala.sh [OPTIONS]. This is the form most often used by terminals, CI, or AI agents (Claude, Cursor) invoking them via a shell tool.

Naming

Commands follow gbx:<category>:<action>:

Category	Purpose
`test`	Run tests (Scala, Python, docs, SQL docs, notebooks, function-info)
`coverage`	Code coverage (Scala/Python, unit/docs, gaps, baseline, package-targeted)
`data`	Sample data: download (essential/complete bundle), generate minimal bundle, push JAR/wheel to Volume
`docs`	Documentation server (start, stop, restart, dev, serve-local, static-build, function-info)
`docker`	Container lifecycle (exec, start, stop, restart, rebuild, attach)
`ci`	CI / GitHub Actions: push, trigger, status, watch, logs, docs menu, setup
`lint`	Scala: scalastyle; Python: isort, black, flake8 (same as CI)

When to use which command

Use the following by task; always prefer the command over manual shell for these operations.

Testing

Command	When to use	What it does
`gbx:test:scala`	After Scala changes, before commit	Runs Scala unit tests (excludes doc tests); supports `--suite`
`gbx:test:python`	After Python changes	Runs Python unit tests in `python/geobrix/test/`
`gbx:test:scala-docs`	After changing Scala doc examples	Runs Scala doc tests in `docs/tests/scala/`
`gbx:test:python-docs`	After changing Python doc examples	Runs Python doc tests in `docs/tests/python/` (default: no integration)
`gbx:test:sql-docs`	After changing SQL API examples	Runs SQL (and Python API) doc tests in `docs/tests/python/api/`
`gbx:test:docs`	Before PR that touches docs	Runs all doc tests by invoking python-docs, sql-docs, and scala-docs in sequence; sets `GBX_SAMPLE_DATA_ROOT` to minimal bundle by default (use `--no-sample-data-root` for full bundle)
`gbx:test:function-info`	After changing function-info or doc SQL	Regenerates function-info and runs DESCRIBE/coverage tests
`gbx:test:notebooks`	After changing sample-data or notebook runner	Runs notebook tests; use `--include-integration` for full run
`gbx:test:python-dbr`	Validate Python on Databricks Runtime	DBR integration tests (spatial SQL, readers); excluded from regular CI; requires DBR environment
`gbx:test:bundle-databricks`	Validate Essential bundle on a live cluster	Pushes runner notebook and runs it on `CLUSTER_ID`; use `--local` to run bundle on host
`gbx:test:primitive-databricks`	Validate Volume access on cluster (FUSE)	Pushes primitive notebook; tests volume exists, create subdirs, read/write/copy via pathlib

Coverage (Scala coverage is slow; use strategically)

Command	When to use	What it does
`gbx:coverage:gaps`	See where to focus	Shows package-level gaps (no test run)
`gbx:coverage:scala-package`	After changes in one package	Runs coverage for one package (e.g. `rasterx`, `gridx`)
`gbx:coverage:baseline`	Weekly or before release	Full Scala or Python baseline
`gbx:coverage:scala`	Full Scala coverage (sparingly)	Full scoverage; use `--report-only` to view last run
`gbx:coverage:python`	After Python changes	Python unit test coverage (fast)
`gbx:coverage:scala-docs` / `gbx:coverage:python-docs`	After doc test changes	Coverage for doc test suites

Data

Doc tests use the in-repo minimal bundle (no download step). Generate it once with gbx:data:generate-minimal-bundle; the Docker Volumes mount makes it available to tests. For full sample data locally, use gbx:data:download. Minimal-bundle Sentinel-2 rasters (and *_byte.tif variants) may appear black in QGIS or other viewers; the full-size Essential/Complete bundle rasters are the ones suited for visual inspection.

Command	When to use	What it does
`gbx:data:download`	Need full sample data locally	Downloads essential and/or complete bundle to `sample-data/`
`gbx:data:generate-minimal-bundle`	CI or doc tests; after full bundle if needed	Generates minimal bundle under `sample-data/Volumes/.../test-data/geobrix-examples/` by bbox extraction (NYC/London, `--bbox-size`, `--max-rows`); doc test commands use this, not a download step
`gbx:data:push-wheel`	Put built JAR and wheel on Volume	Builds JAR first (push-jar) unless `GBX_BUNDLE_SKIP_JAR_UPLOAD=1`, then clears `python/geobrix/dist`, runs `python3 -m build`, uploads wheel to `GBX_ARTIFACT_VOLUME/` (overwrite if exists); set `GBX_BUNDLE_SKIP_WHEEL_UPLOAD=1` to skip wheel
`gbx:data:push-jar`	Put built JAR on a Volume	Runs `mvn clean package -DskipTests`, uploads JAR to `GBX_ARTIFACT_VOLUME/` (overwrite if exists); set `GBX_BUNDLE_SKIP_JAR_UPLOAD=1` to skip

Documentation

Command	When to use	What it does
`gbx:docs:start`	Serve docs (one-off)	Builds (optional) and starts server on port 3000
`gbx:docs:stop`	Free port or before another docs command	Stops any running docs server
`gbx:docs:dev`	While editing docs	Dev server with hot reload
`gbx:docs:serve-local`	Preview production build	Build then serve static site
`gbx:docs:static-build`	Create offline/portable docs zip	Build with relative paths and hash router; zip to `resources/static/` by default (use `--output <path>`); leaves `docs/build/` unchanged for serving
`gbx:docs:restart`	Restart after stop	Stop + start with same options
`gbx:docs:function-info`	After changing doc SQL examples	Regenerates `function-info.json` from doc SQL

Docker

Command	When to use	What it does
`gbx:docker:start`	First time or after stop	Starts `geobrix-dev` container
`gbx:docker:stop`	When done developing	Stops container
`gbx:docker:exec`	Run Maven, pytest, etc.	Runs a command (or interactive shell) in container
`gbx:docker:attach`	Interactive shell in container	Attach to running container
`gbx:docker:restart`	After config change	Restart container
`gbx:docker:rebuild`	After Dockerfile or deps change	Rebuild image and optionally start
`gbx:docker:clear-pycache`	After editing Python code, stale imports	Clears `.pyc` and `__pycache__` in container so tests see fresh code

Lint

Command	When to use	What it does
`gbx:lint:scalastyle`	Before pushing Scala changes	Runs ScalaStyle on `src/main/scala` (same config as CI); catches trailing whitespace, missing EOF newline, non-ASCII in comments
`gbx:lint:python`	Before pushing Python changes	Runs isort, black, flake8 on `python/geobrix` (same as CI). Default: check-only in Docker. Use `--fix` on host to apply isort/black (requires `pip install -e \"python/geobrix[dev]\"`).

CI (require GitHub CLI gh; use gbx:ci:setup to install/authenticate)

Command	When to use	What it does
`gbx:ci:push`	Initiate remote build on current branch (e.g. beta/0.4.0)	Pushes branch to origin, then watches the build main workflow run
`gbx:ci:trigger`	Push then manually trigger build main (e.g. workflow_dispatch)	Pushes branch, lists runs, prompts to trigger build main on current branch
`gbx:ci:status`	Check recent CI runs	Shows recent workflow runs for current branch (optional: `[LIMIT]`)
`gbx:ci:watch`	Stream a CI run	Watches latest run (or `[RUN_ID]`) in real time
`gbx:ci:logs`	Download CI logs	Fetches logs for latest run (or `[RUN_ID]`) into `ci-logs/`
`gbx:ci:docs`	Doc-test CI menu	Run doc tests locally (python/scala), status, trigger, watch, logs (or no args for menu)
`gbx:ci:setup`	One-time CI setup	Install and authenticate GitHub CLI (`gh`)

Most commands accept --help. Common options: --log <path> for test/output logs (truncated each run), --open for coverage reports, and command-specific flags (e.g. --suite, --path, --skip-build). Doc test commands set GBX_SAMPLE_DATA_ROOT=/Volumes/main/default/test-data in the container by default so the minimal bundle is used (required for remote/CI); use --no-sample-data-root to leave it unset and use the full-bundle path or your own env. They do not run a sample-data download; the minimal or full bundle must be present via the Volumes mount.

Working with agents

When working with Claude or Cursor in this repo, agents should:

Read CLAUDE.md first — it documents project conventions and translates cross-project working patterns into geobrix-specific behavior (Docker container, gh account switching, etc.).
Dispatch long-running work to subagents — gbx:* test/build commands typically take minutes and benefit from running in an isolated context, freeing the main session for review.
Use gbx:* commands rather than ad-hoc shell for tests, coverage, docs, and Docker — they handle env vars, log paths, and container setup consistently.
Add or fix a command rather than work around it — if a gbx:* command is broken, fix the script in scripts/commands/<name>.sh; don't invoke the underlying tool directly and let the command rot.

Quick reference

Run tests: gbx:test:* (pick the scope: scala, python, docs, notebooks, bundle-databricks, primitive-databricks).
Coverage: Prefer gbx:coverage:gaps and gbx:coverage:scala-package for Scala; gbx:coverage:python is fast.
Docs: gbx:docs:dev while editing; gbx:docs:stop to stop.
Docker: gbx:docker:start then gbx:docker:exec (or attach) for builds and tests.
Databricks cluster: gbx:test:primitive-databricks then gbx:test:bundle-databricks with databricks_cluster_config.env set.
Conventions and patterns: Read CLAUDE.md at the repo root.
Full command list and options: See the "Commands" section above, or run any gbx:* command with --help.

CI / GitHub Actions

Workflows live in .github/workflows/. They define when and how tests and builds run on GitHub.

When things run

Main build (build_main.yml) — Runs on push to any branch (except python/** and scala/**), on all pull requests, and via workflow_dispatch (manual trigger from the Actions tab or gh workflow run "build main" --ref <branch>). Pipeline: checkout → build Scala → build Python → rebuild doc-snippet-inventory → (on push to main only) Scala doc tests → upload artifacts.
- Scala tests in the main run use -Dsuites='com.databricks.labs.gbx.*', so only unit/integration tests run; Scala doc tests (docs/tests/scala/) are excluded from this step.
- Scala doc tests run in a separate step only when the event is push to main. That step sets GBX_SAMPLE_DATA_ROOT and runs the doc test suites.
Documentation tests (doc-tests.yml) — Run after the "build main" workflow completes (on the same ref). Python doc tests and structure validation run here; Scala doc tests are run by build_main (see above). Also triggerable manually via workflow_dispatch.
Branch-specific builds — build_python.yml runs on push to python/**; build_scala.yml runs on push to scala/**.
CodeQL (codeql-analysis.yml) — Security analysis on push and pull_request to main.
Publish / release — Triggered by release or publish events (see the workflow files).

Initiating a build from a branch

Pushing to a branch (except python/** and scala/**) successfully triggers the build main workflow. To run the main build on your current branch (e.g. beta/0.4.0):

Push and watch — gbx:ci:push. Pushes the current branch to origin (push triggers build main), then streams the run.
Trigger after push — gbx:ci:trigger. Pushes, then prompts to trigger build main (workflow_dispatch).
Check status — gbx:ci:status. Recent workflow runs for the current branch.
Watch a run — gbx:ci:watch or gbx:ci:watch RUN_ID.
Fetch logs — gbx:ci:logs or gbx:ci:logs RUN_ID (saves to ci-logs/).
First-time setup — gbx:ci:setup to install and authenticate the GitHub CLI (gh).

Build environment and caching

The Scala and Python build actions (.github/actions/scala_build/ and python_build/) use a shared environment so that when both run in the same job (e.g. build main), caches stay warm:

Apt — Both actions restore .cache/apt-archives at the start of the GDAL step and save it at the end. Workflows cache .cache/apt-archives with a key derived from both action files, so one cache serves Scala and Python; changing either action’s apt steps invalidates the cache.
Pip — Both use the same pip cache key (.ci-pip-cache-key, created by the workflow from ref + matrix). That avoids duplicate pip installs when Scala runs first and Python reuses the same interpreter and cache.
Maven — Scala uses setup-java with cache: 'maven' and cache-dependency-path: 'pom.xml'.

The two actions intentionally mirror each other (same apt repos, same GDAL/natives, same pip stack: numpy, pyspark, gdal). Scala adds JDK, Maven, zip/unzip, and the JNI .so copy; Python adds pytest and pip install python/geobrix[dev]. Lint (ScalaStyle, isort/black/flake8) runs in CI on every build, but the build fails on lint errors only for PRs targeting main; pushes and PRs to other branches do not fail on lint. Config: scalastyle-config.xml, python/geobrix/pyproject.toml. Use gbx:lint:scalastyle and gbx:lint:python locally (or gbx:lint:python --fix with dev deps on the host). A future refactor could extract a single “setup GDAL + pip” composite used by both; for now the duplication is small and the structure is aligned for easy comparison.

Summary

The main build runs on push to any branch (except python/**, scala/**) — push triggers are successful — and on all PRs and via workflow_dispatch. Use gbx:ci:push to push and watch the build. Scala doc tests run only on push to main, in a separate step with GBX_SAMPLE_DATA_ROOT set. For full details and triggers, see the YAML files in .github/workflows/.

How the project is organized​

Repository layout​

Packages and readers​

Tests and docs​

Git LFS — required to clone the GDAL platform tarball​

One-time install per machine​

Cloning the repo​

Updating the platform tarball​

Storage considerations​

Testing on a Databricks cluster​

gbx:* commands​

Commands​

How to invoke​

Naming​

When to use which command​

Working with agents​

Quick reference​

CI / GitHub Actions​

When things run​

Initiating a build from a branch​

Build environment and caching​

Summary​