Skip to main content

Contributing

First Principles

Favoring standard libraries over external dependencies, especially in specific contexts like Databricks, is a best practice in software development.

There are several reasons why this approach is encouraged:

  • Standard libraries are typically well-vetted, thoroughly tested, and maintained by the official maintainers of the programming language or platform. This ensures a higher level of stability and reliability.
  • External dependencies, especially lesser-known or unmaintained ones, can introduce bugs, security vulnerabilities, or compatibility issues that can be challenging to resolve. Adding external dependencies increases the complexity of your codebase.
  • Each dependency may have its own set of dependencies, potentially leading to a complex web of dependencies that can be difficult to manage. This complexity can lead to maintenance challenges, increased risk, and longer build times.
  • External dependencies can pose security risks. If a library or package has known security vulnerabilities and is widely used, it becomes an attractive target for attackers. Minimizing external dependencies reduces the potential attack surface and makes it easier to keep your code secure.
  • Relying on standard libraries enhances code portability. It ensures your code can run on different platforms and environments without being tightly coupled to specific external dependencies. This is particularly important in settings like Databricks, where you may need to run your code on different clusters or setups.
  • External dependencies may have their versioning schemes and compatibility issues. When using standard libraries, you have more control over versioning and can avoid conflicts between different dependencies in your project.
  • Fewer external dependencies mean faster build and deployment times. Downloading, installing, and managing external packages can slow down these processes, especially in large-scale projects or distributed computing environments like Databricks.
  • External dependencies can be abandoned or go unmaintained over time. This can lead to situations where your project relies on outdated or unsupported code. When you depend on standard libraries, you have confidence that the core functionality you rely on will continue to be maintained and improved.

While minimizing external dependencies is essential, exceptions can be made case-by-case. There are situations where external dependencies are justified, such as when a well-established and actively maintained library provides significant benefits, like time savings, performance improvements, or specialized functionality unavailable in standard libraries.

First contribution

If you're interested in contributing, please create a PR, reach out to us or open an issue to discuss your ideas.

Here are the example steps to submit your first contribution:

  1. Fork the repo. You can also create a branch if you are added as writer to the repo.

  2. The locally: git clone

  3. git checkout main (or gcm if you're using ohmyzsh).

  4. git pull (or gl if you're using ohmyzsh).

  5. git checkout -b FEATURENAME (or gcb FEATURENAME if you're using ohmyzsh).

  6. .. do the work

  7. make fmt

  8. make lint

  9. .. fix if any issues reported

  10. make setup_spark_remote, make test and make integration, and optionally make coverage (generate coverage report)

  11. .. fix if any issues reported

  12. git commit -S -a -m "message"

    Make sure to enter a meaningful commit message title. You need to sign commits with your GPG key (hence -S option). To setup GPG key in your Github account follow these instructions. You can configure Git to sign all commits with your GPG key by default: git config --global commit.gpgsign true

    If you have not signed your commits initially, you can re-apply all of them and sign as follows:

    git reset --soft HEAD~<how-many-commit-go-back>
    git commit -S --reuse-message=ORIG_HEAD
    git push -f origin <remote-branch-name>
  13. git push origin FEATURENAME

    To access the repository, you must use the HTTPS remote with a personal access token or SSH with an SSH key and passphrase that has been authorized for databrickslabs organization.

  14. Go to GitHub UI and create PR. Alternatively, gh pr create (if you have GitHub CLI installed). Use a meaningful pull request title because it'll appear in the release notes. Use Resolves #NUMBER in pull request description to automatically link it to an existing issue.

Local Setup

This section provides a step-by-step guide to set up and start working on the project. These steps will help you set up your project environment and dependencies for efficient development.

To begin, install Hatch, which is our build tool.

On MacOSX, this is achieved using the following:

brew install hatch

Run the following command to create the default environment and install development dependencies, assuming you've already cloned the github repo.

make dev

Before every commit, apply the consistent formatting of the code, as we want our codebase look consistent:

make fmt

Before every commit, run automated bug detector and unit tests to ensure that automated pull request checks do pass, before your code is reviewed by others:

make lint
make setup_spark_remote
make test

The command make setup_spark_remote sets up the environment for running unit tests and is required one time only. DQX uses Databricks Connect as a test dependency, which restricts the creation of a Spark session in local mode. To enable spark local execution for unit testing, the command install spark remote.

Running integration tests and code coverage

Integration tests and code coverage are run automatically when you create a Pull Request in Github. You can also trigger the tests from a local machine by configuring authentication to a Databricks workspace. You can use any Unity Catalog enabled Databricks workspace.

Using terminal

If you want to run the tests from your local machine in the terminal, you need to set up the following environment variables:

export DATABRICKS_HOST=https://<workspace-url>
export DATABRICKS_CLUSTER_ID=<cluster-id>

# Authenticate to Databricks using OAuth generated for a service principal (recommended)
export DATABRICKS_CLIENT_ID=<oauth-client-id>
export DATABRICKS_CLIENT_SECRET=<oauth-client-secret>

# Optionally enable serverless compute to be used for the tests.
# Note that we run integration tests on standard and serverless compute clusters as part of the CI/CD pipelines
export DATABRICKS_SERVERLESS_COMPUTE_ID=auto

We recommend using OAuth access token generated for a service principal to authenticate with Databricks as presented above. Alternatively, you can authenticate using PAT token by setting the DATABRICKS_TOKEN environment variable. However, we do not recommend this method, as it is less secure than OAuth.

Run integration tests with the following command:

make integration

Calculate test coverage and display report in html:

make coverage

Using IDE

If you want to run integration tests from your IDE, you must setup .env or ~/.databricks/debug-env.json file (see instructions). The name of the debug environment that you must define is ws (see debug_env_name fixture in the conftest.py).

Minimal Configuration

Create the ~/.databricks/debug-env.json with the following content, replacing the placeholders:

{
"ws": {
"DATABRICKS_CLIENT_ID": "<oauth-client-id>",
"DATABRICKS_CLIENT_SECRET": "<oauth-client-secret>",
"DATABRICKS_HOST": "https://<workspace-url>",
"DATABRICKS_CLUSTER_ID": "<databricks-cluster-id>"
}
}

You must provide an existing cluster which will be auto-started for you as part of the tests.

We recommend using OAuth access token generated for a service principal to authenticate with Databricks as presented above. Alternatively, you can authenticate using PAT token by providing the DATABRICKS_TOKEN field. However, we do not recommend this method, as it is less secure than OAuth.

Running Tests on Serverless Compute

Integration tests are executed on both standard and serverless compute clusters as part of the CI/CD pipelines. To run integration tests on serverless compute, add the DATABRICKS_SERVERLESS_COMPUTE_ID field to your debug configuration:

{
"ws": {
"DATABRICKS_CLIENT_ID": "<oauth-client-id>",
"DATABRICKS_CLIENT_SECRET": "<oauth-client-secret>",
"DATABRICKS_HOST": "https://<workspace-url>",
"DATABRICKS_CLUSTER_ID": "<databricks-cluster-id>",
"DATABRICKS_SERVERLESS_COMPUTE_ID": "auto"
}
}

When DATABRICKS_SERVERLESS_COMPUTE_ID is set the DATABRICKS_CLUSTER_ID is ignored, and tests run on serverless compute.

Manual testing of the framework

We require that all changes be covered by unit tests and integration tests. A pull request (PR) will be blocked if the code coverage is negatively impacted by the proposed change. However, manual testing may still be useful before creating or merging a PR.

To test DQX from your feature branch, you can install it directly as follows:

pip install git+https://github.com/databrickslabs/dqx.git@feature_barnch_name

Replace feature_branch_name with the name of your branch.

Manual testing of the CLI commands from the current codebase

Once you clone the repo locally and install Databricks CLI you can run labs CLI commands from the root of the repository. Similar to other databricks cli commands we can specify Databricks profile to use with --profile.

Build the project:

make dev

Authenticate your current machine to your Databricks Workspace:

databricks auth login --host <WORKSPACE_HOST>

Show info about the project:

databricks labs show .

Install dqx:

# use the current codebase
databricks labs install .

Show current installation username:

databricks labs dqx me

Uninstall DQX:

databricks labs uninstall dqx

Manual testing of the CLI commands from a pre-release version

In most cases, installing DQX directly from the current codebase is sufficient to test CLI commands. However, this approach may not be ideal in some cases because the CLI would use the current development virtual environment. When DQX is installed from a released version, it creates a fresh and isolated Python virtual environment locally and installs all the required packages, ensuring a clean setup. If you need to perform end-to-end testing of the CLI before an official release, follow the process outlined below.

Note: This is only available for GitHub accounts that have write access to the repository. If you contribute from a fork this method is not available.

# create new tag
git tag v0.1.12-alpha

# push the tag
git push origin v0.1.12-alpha

# specify the tag (pre-release version)
databricks labs install dqx@v0.1.12-alpha

The release pipeline only triggers when a valid semantic version is provided (e.g. v0.1.12). Pre-release versions (e.g. v0.1.12-alpha) do not trigger the release pipeline, allowing you to test changes safely before making an official release.

Troubleshooting

If you encounter any package dependency errors after git pull, run make clean

Common fixes for mypy errors

See https://mypy.readthedocs.io/en/stable/cheat_sheet_py3.html for more details

..., expression has type "None", variable has type "str"

  • Add assert ... is not None if it's a body of a method. Example:
# error: Argument 1 to "delete" of "DashboardWidgetsAPI" has incompatible type "str | None"; expected "str"
self._ws.dashboard_widgets.delete(widget.id)

after

assert widget.id is not None
self._ws.dashboard_widgets.delete(widget.id)
  • Add ... | None if it's in the dataclass. Example: cloud: str = None -> cloud: str | None = None

..., has incompatible type "Path"; expected "str"

Add .as_posix() to convert Path to str

Argument 2 to "get" of "dict" has incompatible type "None"; expected ...

Add a valid default value for the dictionary return.

Example:

def viz_type(self) -> str:
return self.viz.get("type", None)

after:

Example:

def viz_type(self) -> str:
return self.viz.get("type", "UNKNOWN")