Additional Configuration
Adding user metadata to the results of all checks
You can provide user metadata to the results by specifying extra parameters when creating the engine.
The custom key-value metadata will be included in every quality check result inside the user_metadata field.
- Python
- Workflows
from databricks.sdk import WorkspaceClient
from databricks.labs.dqx.engine import DQEngine
from databricks.labs.dqx.config import ExtraParams
user_metadata = {"key1": "value1", "key2": "value2"}
# use ExtraParams to configure one or more optional parameters
extra_parameters = ExtraParams(user_metadata=user_metadata)
ws = WorkspaceClient()
dq_engine = DQEngine(ws, extra_params=extra_parameters)
You can set the following fields in the configuration file to provide user metadata when using DQX workflows:
extra_params:
user_metadata:
custom_metadata: custom_value
Adding user metadata to the results of specific checks
You can also provide user metadata for specific checks when defining those checks programmatically or via configuration.
The custom key-value metadata will be included in every quality check result inside the user_metadata field.
When the same properties are defined in both the engine and check-level user metadata, the check-level values will override the values set in the engine.
- Python
from databricks.labs.dqx.rule import DQRowRule
from databricks.labs.dqx import check_funcs
# define the checks programmatically using DQX classes with user metadata for an individual check
checks = [
DQRowRule( # check with user metadata
name="col_5_is_null_or_empty",
criticality="warn",
check_func=check_funcs.is_not_null_and_not_empty,
column="col5",
user_metadata={"key1": "value1", "key2": "value2"}
),
...
]
# define the checks using yaml with user metadata for an individual check
checks = yaml.safe_load("""
# check with user metadata
- criticality: warn
check:
function: is_not_null_and_not_empty
arguments:
column: col5
user_metadata:
key1: value1
key2: value2
""")
Customizing result columns
By default, DQX appends _errors, _warnings, and _dq_info (only if certain checks are used) result columns to the output DataFrame or Table to flag quality issues.
You can customize the names of these result columns by specifying extra parameters when creating the engine.
| Default Column | Purpose | Customization Key |
|---|---|---|
_errors | Array of critical quality check failures | errors |
_warnings | Array of warning-level quality check issues | warnings |
_dq_info | Structured metadata produced by certain dataset-level checks (e.g., row anomaly detection) | info |
- Python
- Workflows
from databricks.sdk import WorkspaceClient
from databricks.labs.dqx.engine import DQEngine
from databricks.labs.dqx.config import ExtraParams
custom_column_names = {
"errors": "dq_errors",
"warnings": "dq_warnings",
"info": "dq_info" # used for some checks only, e.g. row anomaly detection
}
# use ExtraParams to configure one or more optional parameters
extra_parameters = ExtraParams(result_column_names=custom_column_names)
ws = WorkspaceClient()
dq_engine = DQEngine(ws, extra_params=extra_parameters)
You can set the following fields in the configuration file to customize the result columns when using DQX workflows:
extra_params:
result_column_names:
errors: dq_errors
warnings: dq_warnings
info: dq_info
The _dq_info reporting column is an array of structs: one element per dataset-level check that produces info (e.g. one per has_no_row_anomalies). Each element has a shared schema; for row anomaly detection the populated field is anomaly, with fields such as score, severity_percentile, is_anomaly, threshold, model, contributions, and confidence_std. For the full field list and types, see Schema of the info column (_dq_info) in the Row Anomaly Detection guide.
Suppressing skipped check entries
By default, when a check is skipped (e.g. because a referenced column or filter cannot be resolved in the input DataFrame), DQX records an entry in _errors or _warnings with skipped=True and a message describing why the check was skipped.
You can suppress these entries entirely by setting suppress_skipped=True in ExtraParams. When enabled, skipped checks produce no entry in _errors or _warnings, so the row is not included in the bad DataFrame.
- Python
- Workflows
from databricks.sdk import WorkspaceClient
from databricks.labs.dqx.engine import DQEngine
from databricks.labs.dqx.config import ExtraParams
# suppress entries for checks skipped due to missing columns or invalid filters
extra_parameters = ExtraParams(suppress_skipped=True)
ws = WorkspaceClient()
dq_engine = DQEngine(ws, extra_params=extra_parameters)
You can set the following fields in the configuration file to suppress skipped check entries when using DQX workflows:
extra_params:
suppress_skipped: true
Identifying skipped checks in results
When a check is skipped and suppress_skipped is not enabled, the result struct includes a skipped boolean field set to True. This allows downstream consumers to distinguish skipped checks from actual data quality failures without parsing the message string.
checked_df = dq_engine.apply_checks(df, checks)
# filter to only skipped entries in _errors
from pyspark.sql import functions as F
skipped = checked_df.select(F.explode("_errors").alias("e")).filter(F.col("e.skipped") == True)
Defining default variables for substitution
DQX allows you to define engine-level defaults for variables used in declarative check definitions (YAML, JSON, or Delta tables). These defaults are automatically applied during load_checks and save_checks unless overridden by the per-call variables parameter.
- Python
from databricks.labs.dqx.engine import DQEngine
from databricks.labs.dqx.config import ExtraParams, FileChecksStorageConfig, TableChecksStorageConfig
from databricks.sdk import WorkspaceClient
# Initialize engine with default variables
dq_engine = DQEngine(
WorkspaceClient(),
extra_params=ExtraParams(
variables={
"min_temp": 0,
"max_temp": 50,
"region": "GLOBAL"
}
)
)
# Load checks - uses 'min_temp' and 'max_temp' from defaults,
# but overrides 'region' specifically for this call.
resolved_checks = dq_engine.load_checks(
config=FileChecksStorageConfig(location="checks.yml"),
variables={"region": "EMEA"},
)
# Save checks - resolves variables before computing fingerprints and persisting.
# Uses 'min_temp' and 'max_temp' from defaults, overrides 'region' for this call.
dq_engine.save_checks(
checks=checks,
config=TableChecksStorageConfig(location="catalog.schema.checks_table"),
variables={"region": "EMEA"},
)
Variable substitution is not currently supported in DQX installable workflows. Variables can be defined and stored as YAML in the configuration file but will not be applied during workflow execution.
Variable substitution is only available when defining checks declaratively (as dictionaries or in files/tables). It is not supported when using DQX classes (e.g., DQRowRule) directly.
Overwriting run metadata
By default, DQX automatically generates a unique run_id for each engine instance and uses the current timestamp as the run_time. You can manually overwrite these values using ExtraParams if you need to align DQX results with external systems or re-run checks for a specific historical point in time.
- Python
- Workflows
from databricks.labs.dqx.engine import DQEngine
from databricks.labs.dqx.config import ExtraParams
from databricks.sdk import WorkspaceClient
extra_params = ExtraParams(
run_id_overwrite="custom-execution-id-123",
run_time_overwrite="2024-01-01T12:00:00Z"
)
dq_engine = DQEngine(WorkspaceClient(), extra_params=extra_params)
You can set the following fields in the configuration file to overwrite the run metadata when using DQX workflows:
extra_params:
run_id_overwrite: custom-execution-id-123
run_time_overwrite: 2024-01-01T12:00:00Z