Skip to main content

Installation

The framework can be installed on a Databricks workspace or used as a standalone library.

Prerequisites

  • Python 3.10 or later. See instructions.
  • Unity Catalog-enabled Databricks workspace.
  • Network access to your Databricks Workspace used for the installation process.
  • (Optional) Databricks CLI v0.241 or later. See instructions.
  • Databricks Runtime with Spark 3.5.0 or higher. See instructions.

DQX installation as a Library

Install the project via pip:

pip install databricks-labs-dqx

Install a specific version of the project via pip (e.g. version 0.1.12):

pip install databricks-labs-dqx==0.1.12

DQX installation in a Databricks Workspace

If you choose to install DQX via PyPI and use it purely as a library, you don’t need to pre-install DQX in the workspace. However, installing DQX in the workspace offers additional benefits such as profiling job/workflow, pre-configured dashboards, convenient configuration management.

Authentication

Once you install Databricks CLI, authenticate your current machine to your Databricks Workspace:

databricks auth login --host <WORKSPACE_HOST>

To enable debug logs, simply add --debug flag to any command. More about authentication options here.

Install DQX

Install DQX in your Databricks workspace via Databricks CLI:

databricks labs install dqx

Install a specific version of DQX in your Databricks workspace via Databricks CLI (e.g. version 0.1.12):

databricks labs install dqx@v0.1.12

You'll be prompted to select a configuration profile created by databricks auth login command, and other configuration options.

The cli command will install the following components in the workspace installation folder:

  • A Python wheel file with the library packaged.
  • DQX configuration file (config.yml).
  • Profiling workflow for generating quality rule candidates (not scheduled by default eliminating cost concerns)
  • Quality dashboards for monitoring to display information about the data quality issues (not scheduled by default eliminating cost concerns)

By default, DQX is installed in the user home directory (under /Users/<user>/.dqx). You can also install DQX globally by setting 'DQX_FORCE_INSTALL' environment variable. The following options are available:

  • DQX_FORCE_INSTALL=global databricks labs install dqx: will force the installation to be for root only (/Applications/dqx)
  • DQX_FORCE_INSTALL=user databricks labs install dqx: will force the installation to be for user only (/Users/<user>/.dqx)

Configration file

DQX configuration file can contain multiple run configurations for different pipelines or projects defining specific set of input, output and quarantine locations etc. During the installation the "default" run configuration is created. When DQX is upgraded, the config is preserved.

Open the configuration file:

databricks labs dqx open-remote-config

You can add additional run configurations or update the default run configuration after the installation by editing the config.yml file. See example config below:

log_level: INFO
version: 1
run_configs:
- name: default # <- unique name of the run config (default used during installation)
input_location: s3://iot-ingest/raw # <- Input location for profiling (UC table or cloud path)
input_format: delta # <- format, required if cloud path provided
output_table: main.iot.silver # <- output UC table
quarantine_table: main.iot.quarantine # <- quarantine UC table used as input for quality dashboards
checks_file: iot_checks.yml # <- location of the quality rules (checks)
profile_summary_stats_file: iot_profile_summary_stats.yml # <- location of profiling summary stats
warehouse_id: your-warehouse-id # <- warehouse id for refreshing dashboards
- name: another_run_config # <- unique name of the run config
...

To specify a particular run configuration when executing DQX Labs CLI commands, use the --run-config parameter. If no configuration is provided, the "default" run config is used.

Workflows

Profiling workflow is intended as a one-time operation. It is not scheduled by default ensuring that no costs are incurred.

List all installed workflows in the workspace and their latest run state:

databricks labs dqx workflows

Dashboards

DQX data quality dashboards are deployed to the installation directory. Dashboards are not scheduled to refresh by default ensuring that no costs are incurred.

Open dashboards:

databricks labs dqx open-dashboards

Note: the dashboards are only using the quarantined data as input as defined during the installation process. If you change the quarantine table in the run config after the deployment (quarantine_table field), you need to update the dashboard queries accordingly.

Install DQX on Databricks cluster

You need to install the DQX package on a Databricks cluster to be able to use it. You can install it either from PYPI or use a wheel file generated as part of the installation in the workspace.

There are multiple ways to install libraries in a Databricks cluster (see here). For example, you can install DQX directly from a notebook cell as follows:

# using PYPI package:
%pip install databricks-labs-dqx

# using wheel file, DQX installed for the current user:
%pip install /Workspace/Users/<user-name>/.dqx/wheels/databricks_labs_dqx-*.whl

# using wheel file, DQX installed globally:
%pip install /Applications/dqx/wheels/databricks_labs_dqx-*.whl

Restart the kernel after the package is installed in the notebook:

# in a separate cell run:
dbutils.library.restartPython()

Upgrade DQX in the Databricks workspace

Verify that DQX is installed:

databricks labs installed

Upgrade DQX via Databricks CLI:

databricks labs upgrade dqx

Uninstall DQX from the Databricks workspace

Uninstall DQX via Databricks CLI:

databricks labs uninstall dqx

Databricks CLI will confirm a few options:

  • Whether you want to remove all dqx artefacts from the workspace as well. Defaults to 'no'.