Quality rules
This page provides a reference for the quality checks (rule functions) available in DQX.
Row-level quality checks
The following row-level checks are currently available in DQX. These checks are applied to each row of a PySpark DataFrame and generate issue reports as additional columns. You can also define your own custom checks (see Creating custom checks).
Available row-level checks
Check | Description | Arguments |
---|---|---|
is_not_null | Checks whether the values in the input column are not null. | col_name: column to check (can be a string column name or a column expression) |
is_not_empty | Checks whether the values in the input column are not empty (but may be null). | col_name: column to check (can be a string column name or a column expression) |
is_not_null_and_not_empty | Checks whether the values in the input column are not null and not empty. | col_name: column to check (can be a string column name or a column expression); trim_strings: optional boolean flag to trim spaces from strings |
is_in_list | Checks whether the values in the input column are present in the list of allowed values (null values are allowed). | col_name: column to check (can be a string column name or a column expression); allowed: list of allowed values |
is_not_null_and_is_in_list | Checks whether the values in the input column are not null and present in the list of allowed values. | col_name: column to check (can be a string column name or a column expression); allowed: list of allowed values |
is_not_null_and_not_empty_array | Checks whether the values in the array input column are not null and not empty. | col_name: column to check (can be a string column name or a column expression) |
is_in_range | Checks whether the values in the input column are in the provided range (inclusive of both boundaries). | col_name: column to check (can be a string column name or a column expression); min_limit: min limit as number, date, timestamp, column name or sql expression; max_limit: max limit as number, date, timestamp, column name or sql expression |
is_not_in_range | Checks whether the values in the input column are outside the provided range (inclusive of both boundaries). | col_name: column to check (can be a string column name or a column expression); min_limit: min limit as number, date, timestamp, column name or sql expression; max_limit: max limit as number, date, timestamp, column name or sql expression |
is_not_less_than | Checks whether the values in the input column are not less than the provided limit. | col_name: column to check (can be a string column name or a column expression); limit: limit as number, date, timestamp, column name or sql expression |
is_not_greater_than | Checks whether the values in the input column are not greater than the provided limit. | col_name: column to check (can be a string column name or a column expression); limit: limit as number, date, timestamp, column name or sql expression |
is_valid_date | Checks whether the values in the input column have valid date formats. | col_name: column to check (can be a string column name or a column expression); date_format: optional date format (e.g. 'yyyy-mm-dd') |
is_valid_timestamp | Checks whether the values in the input column have valid timestamp formats. | col_name: column to check (can be a string column name or a column expression); timestamp_format: optional timestamp format (e.g. 'yyyy-mm-dd HH:mm:ss') |
is_not_in_future | Checks whether the values in the input column contain a timestamp that is not in the future, where 'future' is defined as current_timestamp + offset (in seconds). | col_name: column to check (can be a string column name or a column expression); offset: offset to use; curr_timestamp: current timestamp, if not provided current_timestamp() function is used |
is_not_in_near_future | Checks whether the values in the input column contain a timestamp that is not in the near future, where 'near future' is defined as greater than the current timestamp but less than the current_timestamp + offset (in seconds). | col_name: column to check (can be a string column name or a column expression); offset: offset to use; curr_timestamp: current timestamp, if not provided current_timestamp() function is used |
is_older_than_n_days | Checks whether the values in one input column are at least N days older than the values in another column. | col_name: column to check (can be a string column name or a column expression); days: number of days; curr_date: current date, if not provided current_date() function is used |
is_older_than_col2_for_n_days | Checks whether the values in one input column are at least N days older than the values in another column. | col_name1: first column to check (can be a string column name or a column expression); col_name2: second column to check (can be a string column name or a column expression); days: number of days |
is_unique | Checks whether the values in the input column are unique and reports an issue for each row that contains a duplicate value. Null values are not considered duplicates, following the ANSI SQL standard. | col_name: column to check (can be a string column name or a column expression): window_spec: optional window specification as a string or column object, you must handle NULLs correctly using coalesce() to prevent rows exclusion |
regex_match | Checks whether the values in the input column match a given regex. | col_name: column to check (can be a string column name or a column expression); regex: regex to check; negate: if the condition should be negated (true) or not |
sql_expression | Checks whether the values meet the condition provided as an SQL expression, e.g. a = 'str1', a > b | expression: sql expression to check; msg: optional message to output; name: optional name of the resulting column; negate: if the condition should be negated |
You can explore the implementation details of the rules here. If you have a custom check that could be broadly useful, feel free to submit a PR to DQX (see the contribution guide for details).
Usage examples of row-level checks
Below are fully specified examples of how to use each check in YAML format and with DQX classes. Both are equivalent and can be used interchangeably.
The criticality
field can be either "error" (data goes only into the bad / quarantine DataFrame) or "warn" (data goes into good and bad DataFrames).
For brevity, the name
field in the examples is omitted and it will be auto-generated in the results.
Checks define in YAML
# is_not_null check
- criticality: error
check:
function: is_not_null
arguments:
col_name: col1
# is_not_empty check
- criticality: error
check:
function: is_not_empty
arguments:
col_name: col1
# is_not_null_and_not_empty check
- criticality: error
check:
function: is_not_null_and_not_empty
arguments:
col_name: col1
trim_strings: true
# is_in_list check
- criticality: error
check:
function: is_in_list
arguments:
col_name: col2
allowed:
- 1
- 2
- 3
# is_not_null_and_is_in_list check
- criticality: error
check:
function: is_not_null_and_is_in_list
arguments:
col_name: col2
allowed:
- 1
- 2
- 3
# is_not_null_and_not_empty_array check
- criticality: error
check:
function: is_not_null_and_not_empty_array
arguments:
col_name: col4
# is_in_range check
- criticality: error
check:
function: is_in_range
arguments:
col_name: col2
min_limit: 1
max_limit: 10
- criticality: error
check:
function: is_in_range
arguments:
col_name: col5
min_limit: 2025-01-01
max_limit: 2025-02-24
- criticality: error
check:
function: is_in_range
arguments:
col_name: col6
min_limit: 2025-01-01 00:00:00
max_limit: 2025-02-24 01:00:00
- criticality: error
check:
function: is_in_range
arguments:
col_name: col3
min_limit: col2
max_limit: col2 * 2
# is_not_in_range check
- criticality: error
check:
function: is_not_in_range
arguments:
col_name: col2
min_limit: 11
max_limit: 20
- criticality: error
check:
function: is_not_in_range
arguments:
col_name: col5
min_limit: 2025-02-25
max_limit: 2025-02-26
- criticality: error
check:
function: is_not_in_range
arguments:
col_name: col6
min_limit: 2025-02-25 00:00:00
max_limit: 2025-02-26 01:00:00
- criticality: error
check:
function: is_not_in_range
arguments:
col_name: col3
min_limit: col2 + 10
max_limit: col2 * 10
# is_not_less_than check
- criticality: error
check:
function: is_not_less_than
arguments:
col_name: col2
limit: 0
- criticality: error
check:
function: is_not_less_than
arguments:
col_name: col5
limit: 2025-01-01
- criticality: error
check:
function: is_not_less_than
arguments:
col_name: col6
limit: 2025-01-01 01:00:00
- criticality: error
check:
function: is_not_less_than
arguments:
col_name: col3
limit: col2 - 10
# is_not_greater_than check
- criticality: error
check:
function: is_not_greater_than
arguments:
col_name: col2
limit: 10
- criticality: error
check:
function: is_not_greater_than
arguments:
col_name: col5
limit: 2025-03-01
- criticality: error
check:
function: is_not_greater_than
arguments:
col_name: col6
limit: 2025-03-24 01:00:00
- criticality: error
check:
function: is_not_greater_than
arguments:
col_name: col3
limit: col2 + 10
# is_valid_date check
- criticality: error
check:
function: is_valid_date
arguments:
col_name: col5
- criticality: error
name: col5_is_not_valid_date2
check:
function: is_valid_date
arguments:
col_name: col5
date_format: yyyy-MM-dd
# is_valid_timestamp check
- criticality: error
check:
function: is_valid_timestamp
arguments:
col_name: col6
timestamp_format: yyyy-MM-dd HH:mm:ss
- criticality: error
name: col6_is_not_valid_timestamp2
check:
function: is_valid_timestamp
arguments:
col_name: col6
# is_not_in_future check
- criticality: error
check:
function: is_not_in_future
arguments:
col_name: col6
offset: 86400
# is_not_in_near_future check
- criticality: error
check:
function: is_not_in_near_future
arguments:
col_name: col6
offset: 36400
# is_older_than_n_days check
- criticality: error
check:
function: is_older_than_n_days
arguments:
col_name: col5
days: 10000
# is_older_than_col2_for_n_days check
- criticality: error
check:
function: is_older_than_col2_for_n_days
arguments:
col_name1: col5
col_name2: col6
days: 2
# is_unique check
- criticality: error
check:
function: is_unique
arguments:
col_name: col1
# is_unique check with custom window
# provide default value for NULL in the time column of the window spec using coalesce() to prevent rows exclusion!
- criticality: error
name: col1_is_not_unique2
check:
function: is_unique
arguments:
col_name: col1
window_spec: window(coalesce(col6, '1970-01-01'), '10 minutes')
# regex_match check
- criticality: error
check:
function: regex_match
arguments:
col_name: col2
regex: '[0-9]+'
negate: false
# sql_expression check
- criticality: error
check:
function: sql_expression
arguments:
expression: col3 >= col2 and col3 <= 10
msg: col3 is less than col2 and col3 is greater than 10
name: custom_output_name
negate: false
Checks defined using DQX classes
from databricks.labs.dqx.rule import DQColRule
from databricks.labs.dqx.col_check_functions import *
from datetime import datetime
checks = [
DQColRule(
criticality="error",
check_func=is_not_null,
col_name="col1"
),
DQColRule(
criticality="error",
check_func=is_not_empty,
col_name="col1"
),
DQColRule(
criticality="error",
check_func=is_not_null_and_not_empty,
col_name="col1",
check_func_kwargs={"trim_strings": True}
),
DQColRule(
criticality="error",
check_func=is_in_list,
col_name="col2",
check_func_args=[[1, 2, 3]]
),
DQColRule(
criticality="error",
check_func=is_not_null_and_is_in_list,
col_name="col2",
check_func_args=[[1, 2, 3]]
),
DQColRule(
criticality="error",
check_func=is_not_null_and_not_empty_array,
col_name="col4"
),
DQColRule(
criticality="error",
check_func=is_in_range,
col_name="col2",
check_func_kwargs={"min_limit": 1, "max_limit": 10}
),
DQColRule(
criticality="error",
check_func=is_in_range,
col_name="col5",
check_func_kwargs={
"min_limit": datetime(2025, 1, 1).date(),
"max_limit": datetime(2025, 2, 24).date()
}
),
DQColRule(
criticality="error",
check_func=is_in_range,
col_name="col6",
check_func_kwargs={
"min_limit": datetime(2025, 1, 1, 0, 0, 0),
"max_limit": datetime(2025, 2, 24, 1, 0, 0)
}
),
DQColRule(
criticality="error",
check_func=is_in_range,
col_name="col3",
check_func_kwargs={
"min_limit": "col2",
"max_limit": "col2 * 2"
}
),
DQColRule(
criticality="error",
check_func=is_not_in_range,
col_name="col2",
check_func_kwargs={
"min_limit": 11,
"max_limit": 20
}
),
DQColRule(
criticality="error",
check_func=is_not_in_range,
col_name="col5",
check_func_kwargs={
"min_limit": datetime(2025, 2, 25).date(),
"max_limit": datetime(2025, 2, 26).date()
}
),
DQColRule(
criticality="error",
check_func=is_not_in_range,
col_name="col6",
check_func_kwargs={
"min_limit": datetime(2025, 2, 25, 0, 0, 0),
"max_limit": datetime(2025, 2, 26, 1, 0, 0)
}
),
DQColRule(
criticality="error",
check_func=is_not_in_range,
col_name="col3",
check_func_kwargs={
"min_limit": "col2 + 10",
"max_limit": "col2 * 10"
}
),
DQColRule(
criticality="error",
check_func=is_not_less_than,
col_name="col2",
check_func_kwargs={"limit": 0}
),
DQColRule(
criticality="error",
check_func=is_not_less_than,
col_name="col5",
check_func_kwargs={"limit": datetime(2025, 1, 1).date()}
),
DQColRule(
criticality="error",
check_func=is_not_less_than,
col_name="col6",
check_func_kwargs={"limit": datetime(2025, 1, 1, 1, 0, 0)}
),
DQColRule(
criticality="error",
check_func=is_not_less_than,
col_name="col3",
check_func_kwargs={"limit": "col2 - 10"}
),
DQColRule(
criticality="error",
check_func=is_not_greater_than,
col_name="col2",
check_func_kwargs={"limit": 10}
),
DQColRule(
criticality="error",
check_func=is_not_greater_than,
col_name="col5",
check_func_kwargs={"limit": datetime(2025, 3, 1).date()}
),
DQColRule(
criticality="error",
check_func=is_not_greater_than,
col_name="col6",
check_func_kwargs={"limit": datetime(2025, 3, 24, 1, 0, 0)}
),
DQColRule(
criticality="error",
check_func=is_not_greater_than,
col_name="col3",
check_func_kwargs={"limit": "col2 + 10"}
),
DQColRule(
criticality="error",
check_func=is_valid_date,
col_name="col5"
),
DQColRule(
criticality="error",
check_func=is_valid_date,
col_name="col5",
check_func_kwargs={"date_format": "yyyy-MM-dd"},
name="col5_is_not_valid_date2"
),
DQColRule(
criticality="error",
check_func=is_valid_timestamp,
col_name="col6"
),
DQColRule(
criticality="error",
check_func=is_valid_timestamp,
col_name="col6",
check_func_kwargs={"timestamp_format": "yyyy-MM-dd HH:mm:ss"},
name="col6_is_not_valid_timestamp2"
),
DQColRule(
criticality="error",
check_func=is_not_in_future,
col_name="col6",
check_func_kwargs={"offset": 86400}
),
DQColRule(
criticality="error",
check_func=is_not_in_near_future,
col_name="col6",
check_func_kwargs={"offset": 36400}
),
DQColRule(
criticality="error",
check_func=is_older_than_n_days,
col_name="col5",
check_func_kwargs={"days": 10000}
),
DQColRule(
criticality="error",
check_func=is_older_than_col2_for_n_days,
check_func_args=["col5", "col6", 2]
),
DQColRule(
criticality="error",
check_func=is_unique,
col_name="col1"
),
DQColRule(
criticality="error",
name="col1_is_not_unique2",
# provide default value for NULL in the time column of the window spec using coalesce() to prevent rows exclusion!
check_func=is_unique,
col_name="col1",
check_func_kwargs={
"window_spec": F.window(F.coalesce(F.col("col6"),
F.lit(datetime(1970, 1, 1))), "10 minutes")
}
),
DQColRule(
criticality="error",
check_func=regex_match,
col_name="col2",
check_func_kwargs={
"regex": "[0-9]+", "negate": False
}
),
DQColRule(
criticality="error",
check_func=sql_expression,
check_func_kwargs={
"expression": "col3 >= col2 and col3 <= 10",
"msg": "col3 is less than col2 and col3 is greater than 10",
"name": "custom_output_name",
"negate": False
}
),
]
Usage examples of row-level checks for complex column types (Struct, Map, Array)
You can apply checks to complex column types (Struct
, MapType
, ArrayType
) by passing a column expression to the check function or by using the sql_expression
check function.
Below are examples of how to apply checks for complex types in YAML format and with DQX classes.
Checks on complex column types defined in YAML
# is_not_null check applied to a struct column element (dot notation)
- criticality: error
check:
function: is_not_null
arguments:
col_name: col8.field1
# is_not_null check applied to a map column element
- criticality: error
check:
function: is_not_null
arguments:
col_name: try_element_at(col7, 'key1')
# is_not_null check applied to an array column element at the specified position
- criticality: error
check:
function: is_not_null
arguments:
col_name: try_element_at(col4, 1)
# is_not_greater_than check applied to an array column
- criticality: error
check:
function: is_not_greater_than
arguments:
col_name: array_max(col4)
limit: 10
# is_not_less_than check applied to an array column
- criticality: error
check:
function: is_not_less_than
arguments:
col_name: array_min(col4)
limit: 1
# sql_expression check applied to a map column element
- criticality: error
check:
function: sql_expression
arguments:
expression: try_element_at(col7, 'key1') < 10
msg: col7 element 'key1' is less than 10
name: col7_element_key1_less_than_10
negate: false
# sql_expression check applied to an array of map column elements
- criticality: error
check:
function: sql_expression
arguments:
expression: not exists(col4, x -> x >= 10)
msg: array col4 has an element greater than 10
name: col4_all_elements_less_than_10
negate: false
# apply check to multiple columns (simple col, struct, map and array)
- criticality: error
check:
function: is_not_null
arguments:
col_names:
- col1 # col
- col8.field1 # struct col
- try_element_at(col7, 'key1') # map col
- try_element_at(col4, 1) # array col
Checks on complex column types defined using DQX classes
import pyspark.sql.functions as F
from databricks.labs.dqx.rule import DQColRule
from databricks.labs.dqx.col_check_functions import *
checks = [
# is_not_null check applied to a struct column element (dot notation)
DQColRule(
criticality="error",
check_func=is_not_null,
col_name="col8.field1",
),
# is_not_null check applied to a map column element
DQColRule(
criticality="error",
check_func=is_not_null,
col_name=F.try_element_at("col7", F.lit("key1")),
),
# is_not_null check applied to an array column element at the specified position
DQColRule(
criticality="error",
check_func=is_not_null,
col_name=F.try_element_at("col4", F.lit(1)),
),
# is_not_greater_than check applied to an array column
DQColRule(
criticality="error",
check_func=is_not_greater_than,
col_name=F.array_max("col4"),
check_func_kwargs={"limit": 10},
),
# is_not_less_than check applied to an array column
DQColRule(
criticality="error",
check_func=is_not_less_than,
col_name=F.array_min("col4"),
check_func_kwargs={"limit": 1},
),
# sql_expression check applied to a map column element
DQColRule(
criticality="error",
check_func=sql_expression,
check_func_kwargs={
"expression": "try_element_at(col7, 'key1') < 10",
"msg": "col7 element 'key1' is less than 10",
"name": "col7_element_key1_less_than_10",
"negate": False,
},
),
# sql_expression check applied to an array of map column elements
DQColRule(
criticality="error",
check_func=sql_expression,
check_func_kwargs={
"expression": "not exists(col4, x -> x >= 10)",
"msg": "array col4 has an element greater than 10",
"name": "col4_all_elements_less_than_10",
"negate": False,
},
),
] + DQColSetRule( # apply check to multiple columns (simple col, map and array)
check_func=is_not_null,
criticality="error",
columns=[
"col1", # col as string
F.col("col2"), # col
"col8.field1", # struct col
F.try_element_at("col7", F.lit("key1")), # map col
F.try_element_at("col4", F.lit(1)) # array col
]
).get_rules()
Applying filters on checks
You can apply checks to a part of the DataFrame by using a filter
.
For example, to check that a column a
is not null only when a column b
is positive, you can define the check as follows:
- criticality: error
filter: col2 > 0
check:
function: is_not_null
arguments:
col_name: col1
Creating custom checks
Custom checks with SQL Expression
You can define custom checks using SQL Expression rule (sql_expression
),
for example:
- criticality: error
check:
function: sql_expression
arguments:
expression: col1 LIKE '%foo'
msg: col1 ends with 'foo'
SQL Expressions are also useful if you need to make cross-column validation, for example:
- criticality: error
check:
function: sql_expression
arguments:
expression: col1 > col2
msg: col1 is greater than col2
Custom checks as a Python function
If you need a reusable check or want to implement more complex logic which is challenging to implement using SQL expression, you can define your own custom check function.
A check function is a callable that uses make_condition
to return pyspark.sql.Column
.
Custom check example
- Python
import pyspark.sql.functions as F
from pyspark.sql import Column
from databricks.labs.dqx.col_check_functions import make_condition
def ends_with_foo(col_name: str) -> Column:
column = F.col(col_name)
return make_condition(column.endswith("foo"), f"Column {col_name} ends with foo", f"{col_name}_ends_with_foo")
Execution of the custom check using DQX classes
- Python
import yaml
from databricks.labs.dqx.engine import DQEngine
from databricks.sdk import WorkspaceClient
from databricks.labs.dqx.col_check_functions import is_not_null
checks = [
DQColRule(criticality="error", check_func=is_not_null, col_name="col1"),
DQColRule(criticality="error", check_func=ends_with_foo, col_name="col1"),
]
dq_engine = DQEngine(WorkspaceClient())
# Option 1: apply quality rules on the dataframe and provide valid and invalid (quarantined) dataframes
valid_df, quarantined_df = dq_engine.apply_checks_and_split(input_df, checks)
# Option 2: apply quality rules on the dataframe and report issues as additional columns
valid_and_quarantined_df = dq_engine.apply_checks_by_metadata(input_df, checks)
Execution of the custom check using YAML definition
- Python
import yaml
from databricks.labs.dqx.engine import DQEngine
from databricks.sdk import WorkspaceClient
checks = yaml.safe_load("""
- criticality: error
check:
function: ends_with_foo
arguments:
col_name: col1
- criticality: error
check:
function: is_not_null
arguments:
col_name: col1
""")
dq_engine = DQEngine(WorkspaceClient())
custom_check_functions = {"ends_with_foo": ends_with_foo} # list of custom check functions
# or include all functions with globals() for simplicity
#custom_check_functions=globals()
# Option 1: apply quality rules on the dataframe and provide valid and invalid (quarantined) dataframes
valid_df, quarantined_df = dq_engine.apply_checks_by_metadata_and_split(input_df, checks, custom_check_functions)
# Option 2: apply quality rules on the dataframe and report issues as additional columns
valid_and_quarantined_df = dq_engine.apply_checks_by_metadata(input_df, checks, custom_check_functions)