Loading and Storing Quality Checks

DQX provides flexible methods to load and save quality checks (rules) defined as metadata (a list of dictionaries) from different storage backends, making it easier to manage, share, and reuse checks across workflows and environments.

Saving and loading methods accept a storage backend configuration as input. The following backend configuration are currently supported:

FileChecksStorageConfig: local files (JSON/YAML), or workspace files if invoked from Databricks notebook or job. Containing fields:
- location: absolute or relative file path in the local filesystem (JSON or YAML); also works with absolute or relative workspace file paths if invoked from Databricks notebook or job.
WorkspaceFileChecksStorageConfig: workspace files (JSON/YAML) using absolute paths. Containing fields:
- location: absolute workspace file path (JSON or YAML).
TableChecksStorageConfig: Unity Catalog tables. Containing fields:
- location: table fully qualified name.
- run_config_name: (optional) run configuration name to load (it can be any string), e.g. input table or job name (use "default" if not provided).
- mode: (optional) write mode for saving checks (overwrite or append, default is overwrite). The overwrite mode will only replace checks for the specific run config and not all checks in the table.
VolumeFileChecksStorageConfig: Unity Catalog Volume (JSON/YAML file). Containing fields:
- location: Unity Catalog Volume file path (JSON or YAML).
LakebaseChecksStorageConfig: Lakebase table. Containing fields:
- instance_name: name of the Lakebase instance, e.g., "my-instance".
- user: user to connect to the Lakebase instance, e.g., "user@domain.com" or Databricks service principal client ID.
- location: fully-qualified table name in the format "database.schema.table".
- port: (optional) port on which to connect to the Lakebase instance (use 5432 if not provided).
- run_config_name: (optional) run configuration name to load (use "default" if not provided).
- mode: (optional) write mode for saving checks (overwrite or append, default is overwrite). The overwrite mode will only replace checks for the specific run config and not all checks in the table.
InstallationChecksStorageConfig: installation-managed location from the run config, ignores location and infers it from checks_location in the run config. Containing fields:
- location (optional): automatically set based on the checks_location field from the run configuration.
- install_folder: (optional) installation folder where DQX is installed, only required when custom installation folder is used.
- run_config_name (optional) - run configuration name to load (it can be any string), e.g. input table or job name (use "default" if not provided).
- product_name: (optional) name of the product (use "dqx" if not provided).
- assume_user: (optional) if True, assume user installation, otherwise global installation (skipped if install_folder is provided).
- the config inherits from the specific configs such as WorkspaceFileChecksStorageConfig, TableChecksStorageConfig, VolumeFileChecksStorageConfig, and LakebaseChecksStorageConfig so relevant fields from these specific configs can be provided (e.g. instance_name and user for lakebase).

You can find details on how to define checks here.

Saving quality checks to a storage

You can save quality checks defined by metadata (list of dictionaries) or generated by the profiler to various storage locations.

Checks definition

For the save methods to work, checks must be defined declaratively as metadata (list of dictionaries). If you create checks as a list of DQRule objects, you can convert them to metadata format and back using the methods described here.

Python
Workflows

from databricks.labs.dqx.engine import DQEngine
from databricks.labs.dqx.config import (
  FileChecksStorageConfig,
  WorkspaceFileChecksStorageConfig,
  InstallationChecksStorageConfig,
  TableChecksStorageConfig,
  VolumeFileChecksStorageConfig,
  LakebaseChecksStorageConfig,
)
from databricks.sdk import WorkspaceClient

dq_engine = DQEngine(WorkspaceClient())

# define checks as a list of dictionaries
checks: list[dict] = yaml.safe_load("""
- criticality: warn
  check:
    function: is_not_null_and_not_empty
    arguments:
      column: col3
 # ...
""")

# save checks as a YAML file in the local filesystem (overwrite the file) using relative or absolute path
# also works with absolute and relative workspace paths if invoked from Databricks notebook or job
dq_engine.save_checks(checks, config=FileChecksStorageConfig(location="checks.yml"))

# save checks as a YAML file in arbitrary workspace location (overwrite the file) using absolute path
dq_engine.save_checks(checks, config=WorkspaceFileChecksStorageConfig(location="/Shared/App1/checks.yml"))

# save checks in a Delta table to "default" run config, append checks
dq_engine.save_checks(checks, config=TableChecksStorageConfig(location="catalog.schema.checks_table", mode="append"))

# save checks in a Delta table with specific run config for filtering (it can be any string, e.g. input table or job name), overwrite checks
dq_engine.save_checks(checks, config=TableChecksStorageConfig(location="catalog.schema.checks_table", run_config_name="main.default.input_table", mode="overwrite"))

# save checks as a YAML in a Unity Catalog Volume location (overwrite the file)
dq_engine.save_checks(checks, config=VolumeFileChecksStorageConfig(location="/Volumes/dq/config/checks_volume/App1/checks.yml"))

# save checks as a Lakebase table using a Databricks service principal
dq_engine.save_checks(checks, config=LakebaseChecksStorageConfig(instance_name="my-instance", user="00000000-0000-0000-0000-000000000000", location="dqx.config.checks"))

# save checks as a YAML file or table defined in 'checks_location' of the run config
# only works if DQX is installed in the workspace
# the run config name can be any string, e.g. input table or job name
dq_engine.save_checks(checks, config=InstallationChecksStorageConfig(assume_user=True, run_config_name="main.default.input_table"))

When using the profiler workflow to generate quality check candidates, the checks are saved to the location specified in the checks_location field of the configuration file.

Saving quality rules in a Delta table without using DQX methods

Quality rules can be stored in a Delta table as showcased above using DQEngine. You can also store them directly in a Delta table using Spark DataFrame API.

Python

from databricks.labs.dqx.engine import DQEngine
from databricks.sdk import WorkspaceClient
from databricks.labs.dqx.config import TableChecksStorageConfig

schema = (
  "name STRING, "
  "criticality STRING, "
  "check STRUCT<function STRING, for_each_column ARRAY<STRING>, arguments MAP<STRING, STRING>>, "
  "filter STRING, "
  "run_config_name STRING, "
  "user_metadata MAP<STRING, STRING>"
)

# define checks as a list of dictionaries
checks: list[dict] = [
  {
    "criticality": "error",
    "check": {"function": "is_not_null", "for_each_column": ["col1", "col2"], "arguments": {}},
    "filter": "col1 > 0",
    "user_metadata": {"check_owner": "someone@email.com"},
  },
  {
    "name": "column_not_less_than",
    "criticality": "warn",
    "check": {"function": "is_not_less_than", "arguments": {"column": "col_2", "limit": 1}},
  },
  {
    "criticality": "warn",
    "name": "column_in_list",
    "check": {"function": "is_in_list", "arguments": {"column": "col_2", "allowed": [1, 2]}},
  },
]

# checks can also be defined by specifying columns explicitly
checks = [
  [
    None,
    "error",
    {"function": "is_not_null", "for_each_column": ["col1", "col2"], "arguments": {}},
    "col1 > 0",
    "default",
    {"check_owner": "someone@email.com"},
  ],
  [
    "column_not_less_than",
    "warn",
    {"function": "is_not_less_than", "arguments": {"column": "col_2", "limit": 1}},
    None,
    "default",
    None,
  ],
  [
    "column_in_list",
    "warn",
    {"function": "is_in_list", "for_each_column": None, "arguments": {"column": "col_2", "allowed": [1, 2]}},
    None,
    "default",
    None,
  ],
]

# save checks
df = spark.createDataFrame(checks, schema)
df.write.format("delta").mode("overwrite").saveAsTable("main.default.dqx_checks_table")

# load checks as dict from the delta table
dq_engine = DQEngine(WorkspaceClient())
checks = dq_engine.load_checks(config=TableChecksStorageConfig(location="main.default.dqx_checks_table"))

# validate loaded checks
assert not dq_engine.validate_checks(checks).has_errors

Loading quality checks from a storage

You can load quality checks from various storage locations. You can then apply the loaded checks using the methods described here.

Checks definition

For the save methods to work, checks must be defined as a list of dictionaries. If you create checks as a list of DQRule objects, you can convert them using the serialize_checks method, as described here.

Python
Workflows

from databricks.labs.dqx.engine import DQEngine
from databricks.labs.dqx.config import (
  FileChecksStorageConfig,
  WorkspaceFileChecksStorageConfig,
  InstallationChecksStorageConfig,
  TableChecksStorageConfig,
  VolumeFileChecksStorageConfig,
  LakebaseChecksStorageConfig,
)
from databricks.sdk import WorkspaceClient

dq_engine = DQEngine(WorkspaceClient())

# load checks from a local file using relative or absolute path
# also works for absolute and relative workspace paths if invoked from Databricks notebook or job
checks: list[dict] = dq_engine.load_checks(config=FileChecksStorageConfig(location="checks.yml"))

# load checks from arbitrary workspace location using absolute path
checks: list[dict] = dq_engine.load_checks(config=WorkspaceFileChecksStorageConfig(location="/Shared/App1/checks.yml"))

# load checks from a Delta table and default run config name
checks: list[dict] = dq_engine.load_checks(config=TableChecksStorageConfig(location="catalog.schema.checks_table"))

# load checks from a Delta table with specific run config for filtering (it can be any string, e.g. input table or job name)
checks: list[dict] = dq_engine.load_checks(config=TableChecksStorageConfig(location="catalog.schema.checks_table", run_config_name="main.default.input_table"))

# load checks from a Unity Catalog Volume
checks: list[dict] = dq_engine.load_checks(config=VolumeFileChecksStorageConfig(location="/Volumes/dq/config/checks_volume/App1/checks.yml"))

# load checks from a Lakebase table using a Databricks service principal
checks: list[dict] = dq_engine.load_checks(config=LakebaseChecksStorageConfig(instance_name="my-instance", user="00000000-0000-0000-0000-000000000000", location="dqx.config.checks"))

# load checks from a file or table defined in the run config ('checks_location' field)
# only works if DQX is installed in the workspace
# the run config name is a string (e.g. input table or job name)
checks: list[dict] = dq_engine.load_checks(config=InstallationChecksStorageConfig(run_config_name="main.default.input_table"))

# validate loaded checks
assert not dq_engine.validate_checks(checks).has_errors

When using the quality checker or e2e workflows to apply quality checks, they load checks from the checks_location field defined in the configuration file.

Saving quality checks to a storage​

Loading quality checks from a storage​

Saving quality checks to a storage

Loading quality checks from a storage