Loading and Storing Quality Checks
DQX provides flexible methods to load and save quality checks (rules) defined as metadata (a list of dictionaries) from different storage backends, making it easier to manage, share, and reuse checks across workflows and environments.
Saving and loading methods accept a storage backend configuration as input. The following backend configuration are currently supported:
FileChecksStorageConfig
: local files (JSON/YAML)WorkspaceFileChecksStorageConfig
: workspace files (JSON/YAML)VolumeFileChecksStorageConfig
: Unity Catalog volumes (JSON/YAML file)TableChecksStorageConfig
: Unity Catalog tablesInstallationChecksStorageConfig
: installation-managed location from the run config, ignores location and infers it fromchecks_location
in the run config
You can find details on how to define checks here.
Saving quality checks to a storage
You can save quality checks defined by metadata (list of dictionaries) or generated by the profiler to various storage locations.
For the save methods to work, checks must be defined declaratively as metadata (list of dictionaries).
If you create checks as a list of DQRule
objects, you can convert them to metadata format and back using the methods described here.
- Python
- Workflows
from databricks.labs.dqx.engine import DQEngine
from databricks.labs.dqx.config import (
FileChecksStorageConfig,
WorkspaceFileChecksStorageConfig,
InstallationChecksStorageConfig,
TableChecksStorageConfig,
VolumeFileChecksStorageConfig
)
from databricks.sdk import WorkspaceClient
dq_engine = DQEngine(WorkspaceClient())
# define checks as a list of dictionaries
checks: list[dict] = yaml.safe_load("""
- criticality: warn
check:
function: is_not_null_and_not_empty
arguments:
column: col3
# ...
""")
# save checks as a YAML file in the local filesystem (overwrite the file)
dq_engine.save_checks(checks, config=FileChecksStorageConfig(location="checks.yml"))
# save checks as a YAML file in arbitrary workspace location (overwrite the file)
dq_engine.save_checks(checks, config=WorkspaceFileChecksStorageConfig(location="/Shared/App1/checks.yml"))
# save checks in a Delta table to "default" run config (append checks)
dq_engine.save_checks(checks, config=TableChecksStorageConfig(location="catalog.schema.checks_table", mode="append"))
# save checks in a Delta table with specific run config for filtering (overwrite checks)
dq_engine.save_checks(checks, config=TableChecksStorageConfig(location="catalog.schema.checks_table", run_config_name="workflow_001", mode="overwrite"))
# save checks as a YAML in a Unity Catalog volume location (overwrite the file)
dq_engine.save_checks(checks, config=VolumeFileChecksStorageConfig(location="/Volumes/dq/config/checks_volume/App1/checks.yml"))
# save checks as a YAML file or table defined in 'checks_location' of the run config
# only works if DQX is installed in the workspace
dq_engine.save_checks(checks, config=InstallationChecksStorageConfig(assume_user=True, run_config_name="default"))
When using the profiler workflow to generate quality check candidates, the checks are saved to the location specified in the checks_location
field of the configuration file.
Saving quality rules in a Delta table without using DQX methods
Quality rules can be stored in a Delta table as showcased above using DQEngine
.
You can also store them directly in a Delta table using Spark DataFrame API.
- Python
from databricks.labs.dqx.engine import DQEngine
from databricks.sdk import WorkspaceClient
from databricks.labs.dqx.config import TableChecksStorageConfig
schema = (
"name STRING, "
"criticality STRING, "
"check STRUCT<function STRING, for_each_column ARRAY<STRING>, arguments MAP<STRING, STRING>>, "
"filter STRING, "
"run_config_name STRING, "
"user_metadata MAP<STRING, STRING>"
)
# define checks as a list of dictionaries
checks: list[dict] = [
{
"criticality": "error",
"check": {"function": "is_not_null", "for_each_column": ["col1", "col2"], "arguments": {}},
"filter": "col1 > 0",
"user_metadata": {"check_owner": "someone@email.com"},
},
{
"name": "column_not_less_than",
"criticality": "warn",
"check": {"function": "is_not_less_than", "arguments": {"column": "col_2", "limit": 1}},
},
{
"criticality": "warn",
"name": "column_in_list",
"check": {"function": "is_in_list", "arguments": {"column": "col_2", "allowed": [1, 2]}},
},
]
# checks can also be defined by specifying columns explicitly
checks = [
[
None,
"error",
{"function": "is_not_null", "for_each_column": ["col1", "col2"], "arguments": {}},
"col1 > 0",
"default",
{"check_owner": "someone@email.com"},
],
[
"column_not_less_than",
"warn",
{"function": "is_not_less_than", "arguments": {"column": "col_2", "limit": 1}},
None,
"default",
None,
],
[
"column_in_list",
"warn",
{"function": "is_in_list", "for_each_column": None, "arguments": {"column": "col_2", "allowed": [1, 2]}},
None,
"default",
None,
],
]
# save checks
df = spark.createDataFrame(checks, schema)
df.write.format("delta").mode("overwrite").saveAsTable("main.default.dqx_checks_table")
# load checks as dict from the delta table
dq_engine = DQEngine(WorkspaceClient())
checks = dq_engine.load_checks(config=TableChecksStorageConfig(location="main.default.dqx_checks_table"))
# validate loaded checks
assert not dq_engine.validate_checks(checks).has_errors
Loading quality checks from a storage
You can load quality checks from various storage locations. You can then apply the loaded checks using the methods described here.
For the save methods to work, checks must be defined as a list of dictionaries.
If you create checks as a list of DQRule objects, you can convert them using the serialize_checks
method, as described here.
- Python
- Workflows
from databricks.labs.dqx.engine import DQEngine
from databricks.labs.dqx.config import (
FileChecksStorageConfig,
WorkspaceFileChecksStorageConfig,
InstallationChecksStorageConfig,
TableChecksStorageConfig,
VolumeFileChecksStorageConfig
)
from databricks.sdk import WorkspaceClient
dq_engine = DQEngine(WorkspaceClient())
# load checks from a local path
checks: list[dict] = dq_engine.load_checks(checks, config=FileChecksStorageConfig(location="checks.yml"))
# load checks from arbitrary workspace location
checks: list[dict] = dq_engine.load_checks(checks, config=WorkspaceFileChecksStorageConfig(location="/Shared/App1/checks.yml"))
# load checks from a Delta table and default run config name
checks: list[dict] = dq_engine.load_checks(checks, config=TableChecksStorageConfig(location="catalog.schema.checks_table"))
# load checks from a Delta table with specific run config for filtering
checks: list[dict] = dq_engine.load_checks(checks, config=TableChecksStorageConfig(location="catalog.schema.checks_table", run_config_name="workflow_001"))
# load checks from a Unity Catalog volume
checks: list[dict] = dq_engine.load_checks(checks, config=VolumeFileChecksStorageConfig(location="/Volumes/dq/config/checks_volume/App1/checks.yml"))
# load checks from a file or table defined in the run config ('checks_location' field)
# only works if DQX is installed in the workspace
checks: list[dict] = dq_engine.load_checks(checks, config=InstallationChecksStorageConfig(run_config_name="default"))
# validate loaded checks
assert not dq_engine.validate_checks(checks).has_errors
When using the quality checker or e2e workflows to apply quality checks, they load checks from the checks_location
field defined in the configuration file.