Testing Applications Using DQX

Standard testing with DQEngine

Testing applications that use DQEngine requires proper initialization of the Databricks workspace client. Detailed guidance on authentication for the workspace client is available here.

For testing, we recommend:

pytester fixtures is used to set up Databricks remote Spark session and workspace client. For pytester to be able to authenticate to a workspace, you need to use debug_env_name fixture. We recommend using the ~/.databricks/debug-env.json file to store different sets of environment variables (see more details below).
chispa for asserting Spark DataFrames.

These libraries are also used internally to test DQX.

Below is an example test.

Python

import pytest
from chispa.dataframe_comparer import assert_df_equality
from databricks.labs.dqx.check_funcs import is_not_null_and_not_empty
from databricks.labs.dqx.engine import DQEngine
from databricks.labs.dqx.rule import DQRowRule
from databricks.labs.dqx.schema import dq_result_schema


@pytest.fixture
def debug_env_name():
    return "ws"  # Specify the name of the target environment from ~/.databricks/debug-env.json


def test_dq(ws, spark): # use ws and spark pytester fixtures to initialize workspace client and spark session
    schema = "a: int, b: int, c: int"
    expected_schema = schema + f", _errors: {dq_result_schema.simpleString()}, _warnings: {dq_result_schema.simpleString()}"
    test_df = spark.createDataFrame([[1, 3, 3]], schema)

    checks = [
      DQRowRule(name="col_a_is_null_or_empty", criticality="warn", check_func=is_not_null_and_not_empty, column="a"),
      DQRowRule(name="col_b_is_null_or_empty", criticality="error", check_func=is_not_null_and_not_empty, column="b"),
    ]

    dq_engine = DQEngine(ws)
    df = dq_engine.apply_checks(test_df, checks)

    expected_df = spark.createDataFrame([[1, 3, 3, None, None]], expected_schema)
    assert_df_equality(df, expected_df)

Setting up Databricks Workspace Client authentication in a terminal

If you want to run the tests from your local machine in the terminal, you need to set up the following environment variables:

export DATABRICKS_HOST=https://<workspace-url>
export DATABRICKS_CLUSTER_ID=<cluster-id>

# Authenticate to Databricks using OAuth generated for a service principal (recommended)
export DATABRICKS_CLIENT_ID=<oauth-client-id>
export DATABRICKS_CLIENT_SECRET=<oauth-client-secret>

# Optionally enable serverless compute to be used for the tests
export DATABRICKS_SERVERLESS_COMPUTE_ID=auto

We recommend using OAuth access token generated for a service principal to authenticate with Databricks as presented above. Alternatively, you can authenticate using PAT token by setting the DATABRICKS_TOKEN environment variable. However, we do not recommend this method, as it is less secure than OAuth.

Setting up Databricks Workspace Client authentication in an IDE

If you want to run the tests from your IDE, you must setup .env or ~/.databricks/debug-env.json file (see instructions). The name of the debug environment that you must define is ws (see debug_env_name fixture in the example above).

Minimal Configuration

Create the ~/.databricks/debug-env.json with the following content, replacing the placeholders:

{
  "ws": {
    "DATABRICKS_CLIENT_ID": "<oauth-client-id>",
    "DATABRICKS_CLIENT_SECRET": "<oauth-client-secret>",
    "DATABRICKS_HOST": "https://<workspace-url>",
    "DATABRICKS_CLUSTER_ID": "<databricks-cluster-id>"
  }
}

You must provide an existing cluster. It will auto-start for you as part of the tests.

We recommend using OAuth access token generated for a service principal to authenticate with Databricks as presented above. Alternatively, you can authenticate using PAT token by providing the DATABRICKS_TOKEN field. However, we do not recommend this method, as it is less secure than OAuth.

Running Tests on Serverless Compute

To run the integration tests on serverless compute, add the DATABRICKS_SERVERLESS_COMPUTE_ID field to your debug configuration:

{
  "ws": {
    "DATABRICKS_CLIENT_ID": "<oauth-client-id>",
    "DATABRICKS_CLIENT_SECRET": "<oauth-client-secret>",
    "DATABRICKS_HOST": "https://<workspace-url>",
    "DATABRICKS_CLUSTER_ID": "<databricks-cluster-id>",
    "DATABRICKS_SERVERLESS_COMPUTE_ID": "auto"
  }
}

When DATABRICKS_SERVERLESS_COMPUTE_ID is set, the DATABRICKS_CLUSTER_ID is ignored, and tests run on serverless compute.

Local testing with DQEngine

If workspace-level access is unavailable in your testing environment, you can perform local testing by installing the latest pyspark package and mocking the workspace client. Below is an example test.

Usage tips

This approach should be treated as experimental! It does not offer the same level of testing as the standard approach, and it is only applicable to selected methods (see here). We strongly recommend following the standard testing procedure outlined above, which includes proper initialization of the workspace client.

Python

from unittest.mock import MagicMock
from databricks.sdk import WorkspaceClient
from pyspark.sql import SparkSession
from chispa.dataframe_comparer import assert_df_equality
from databricks.labs.dqx.check_funcs import is_not_null_and_not_empty
from databricks.labs.dqx.engine import DQEngine
from databricks.labs.dqx.rule import DQRowRule
from databricks.labs.dqx.schema import dq_result_schema


def test_dq():
  spark = SparkSession.builder.master("local[*]").getOrCreate() # create spark local session
  ws = MagicMock(spec=WorkspaceClient, **{"current_user.me.return_value": None}) # mock the workspace client

  schema = "a: int, b: int, c: int"
  expected_schema = schema + f", _errors: {dq_result_schema.simpleString()}, _warnings: {dq_result_schema.simpleString()}"
  test_df = spark.createDataFrame([[1, None, 3]], schema)

  checks = [
    DQRowRule(name="col_a_is_null_or_empty", criticality="warn", check_func=is_not_null_and_not_empty, column="a"),
    DQRowRule(name="col_b_is_null_or_empty", criticality="error", check_func=is_not_null_and_not_empty, column="b"),
  ]

  dq_engine = DQEngine(ws)
  df = dq_engine.apply_checks(test_df, checks)

  expected_df = spark.createDataFrame(
    [[1, None, 3, {"b_is_null_or_empty": "Column b is null or empty"}, None]], expected_schema
  )
  assert_df_equality(df, expected_df)

Standard testing with DQEngine​

Setting up Databricks Workspace Client authentication in a terminal​

Setting up Databricks Workspace Client authentication in an IDE​

Minimal Configuration​

Running Tests on Serverless Compute​

Local testing with DQEngine​