Skip to main content

Testing Applications Using DQX

Standard testing with DQEngine

Testing applications that use DQEngine requires proper initialization of the Databricks workspace client. Detailed guidance on authentication for the workspace client is available here.

For testing, we recommend:

  • pytester fixtures to setup Databricks remote Spark session and workspace client. For pytester to be able to authenticate to a workspace you need to use debug_env_name fixture. We recommend using the ~/.databricks/debug-env.json file to store different sets of environment variables (see more details below).
  • chispa for asserting Spark DataFrames.

These libraries are also used internally for testing DQX.

Example test:

from chispa.dataframe_comparer import assert_df_equality
from databricks.labs.dqx.col_functions import is_not_null_and_not_empty
from databricks.labs.dqx.engine import DQEngine
from databricks.labs.dqx.rule import DQRule


@pytest.fixture
def debug_env_name():
return "ws" # Specify the name of the target environment from ~/.databricks/debug-env.json


def test_dq(ws, spark): # use ws and spark pytester fixtures to initialize workspace client and spark session
schema = "a: int, b: int, c: int"
expected_schema = schema + ", _errors: map<string,string>, _warnings: map<string,string>"
test_df = spark.createDataFrame([[1, 3, 3]], schema)

checks = [
DQRule(name="col_a_is_null_or_empty", criticality="warn", check=is_not_null_and_not_empty("a")),
DQRule(name="col_b_is_null_or_empty", criticality="error", check=is_not_null_and_not_empty("b")),
]

dq_engine = DQEngine(ws)
df = dq_engine.apply_checks(test_df, checks)

expected_df = spark.createDataFrame([[1, 3, 3, None, None]], expected_schema)
assert_df_equality(df, expected_df)

Setting up Databricks workspace client authentication in a terminal

If you want to run the tests from your local machine in the terminal, you need to set up the following environment variables:

export DATABRICKS_HOST=https://<workspace-url>
export DATABRICKS_CLUSTER_ID=<cluster-id>

# Authenticate to Databricks using OAuth generated for a service principal (recommended)
export DATABRICKS_CLIENT_ID=<oauth-client-id>
export DATABRICKS_CLIENT_SECRET=<oauth-client-secret>

# Optionally enable serverless compute to be used for the tests
export DATABRICKS_SERVERLESS_COMPUTE_ID=auto

We recommend using OAuth access token generated for a service principal to authenticate with Databricks as presented above. Alternatively, you can authenticate using PAT token by setting the DATABRICKS_TOKEN environment variable. However, we do not recommend this method, as it is less secure than OAuth.

Setting up Databricks workspace client authentication in an IDE

If you want to run the tests from your IDE, you must setup .env or ~/.databricks/debug-env.json file (see instructions). The name of the debug environment that you must define is ws (see debug_env_name fixture in the example above).

Minimal Configuration

Create the ~/.databricks/debug-env.json with the following content, replacing the placeholders:

{
"ws": {
"DATABRICKS_CLIENT_ID": "<oauth-client-id>",
"DATABRICKS_CLIENT_SECRET": "<oauth-client-secret>",
"DATABRICKS_HOST": "https://<workspace-url>",
"DATABRICKS_CLUSTER_ID": "<databricks-cluster-id>"
}
}

You must provide an existing cluster which will be auto-started for you as part of the tests.

We recommend using OAuth access token generated for a service principal to authenticate with Databricks as presented above. Alternatively, you can authenticate using PAT token by providing the DATABRICKS_TOKEN field. However, we do not recommend this method, as it is less secure than OAuth.

Running Tests on Serverless Compute

To run the integration tests on serverless compute, add the DATABRICKS_SERVERLESS_COMPUTE_ID field to your debug configuration:

{
"ws": {
"DATABRICKS_CLIENT_ID": "<oauth-client-id>",
"DATABRICKS_CLIENT_SECRET": "<oauth-client-secret>",
"DATABRICKS_HOST": "https://<workspace-url>",
"DATABRICKS_CLUSTER_ID": "<databricks-cluster-id>",
"DATABRICKS_SERVERLESS_COMPUTE_ID": "auto"
}
}

When DATABRICKS_SERVERLESS_COMPUTE_ID is set the DATABRICKS_CLUSTER_ID is ignored, and tests run on serverless compute.

Local testing with DQEngine

If workspace-level access is unavailable in your testing environment, you can perform local testing by installing the latest pyspark package and mocking the workspace client.

Note: This approach should be treated as experimental! It does not offer the same level of testing as the standard approach and it is only applicable to selected methods. We strongly recommend following the standard testing procedure outlined above, which includes proper initialization of the workspace client.

Example test:

from unittest.mock import MagicMock
from databricks.sdk import WorkspaceClient
from pyspark.sql import SparkSession
from chispa.dataframe_comparer import assert_df_equality
from databricks.labs.dqx.col_functions import is_not_null_and_not_empty
from databricks.labs.dqx.engine import DQEngine
from databricks.labs.dqx.rule import DQRule


def test_dq():
spark = SparkSession.builder.master("local[*]").getOrCreate() # create spark local session
ws = MagicMock(spec=WorkspaceClient, **{"catalogs.list.return_value": []}) # mock the workspace client

schema = "a: int, b: int, c: int"
expected_schema = schema + ", _errors: map<string,string>, _warnings: map<string,string>"
test_df = spark.createDataFrame([[1, None, 3]], schema)

checks = [
DQRule(name="col_a_is_null_or_empty", criticality="warn", check=is_not_null_and_not_empty("a")),
DQRule(name="col_b_is_null_or_empty", criticality="error", check=is_not_null_and_not_empty("b")),
]

dq_engine = DQEngine(ws)
df = dq_engine.apply_checks(test_df, checks)

expected_df = spark.createDataFrame(
[[1, None, 3, {"col_b_is_null_or_empty": "Column b is null or empty"}, None]], expected_schema
)
assert_df_equality(df, expected_df)