Skip to main content

Quality rules and functions reference

This page provides a reference for the quality rules and functions available in DQX.

Quality rules / functions

The following quality rules / functions are currently available:

CheckDescriptionArguments
is_not_nullCheck if input column is not nullcol_name: column name to check
is_not_emptyCheck if input column is not emptycol_name: column name to check
is_not_null_and_not_emptyCheck if input column is not null or emptycol_name: column name to check; trim_strings: boolean flag to trim spaces from strings
value_is_in_listCheck if the provided value is present in the input column.col_name: column name to check; allowed: list of allowed values
value_is_not_null_and_is_in_listCheck if provided value is present if the input column is not nullcol_name: column name to check; allowed: list of allowed values
is_in_rangeCheck if input column is in the provided range (inclusive of both boundaries)col_name: column name to check; min_limit: min limit; max_limit: max limit
is_not_in_rangeCheck if input column is not within defined range (inclusive of both boundaries)col_name: column name to check; min_limit: min limit value; max_limit: max limit value
not_less_thanCheck if input column is not less than the provided limitcol_name: column name to check; limit: limit value
not_greater_thanCheck if input column is not greater than the provided limitcol_name: column name to check; limit: limit value
not_in_futureCheck if input column defined as date is not in the future (future defined as current_timestamp + offset)col_name: column name to check; offset: offset to use; curr_timestamp: current timestamp, if not provided current_timestamp() function is used
not_in_near_futureCheck if input column defined as date is not in the near future (near future defined as grater than current timestamp but less than current timestamp + offset)col_name: column name to check; offset: offset to use; curr_timestamp: current timestamp, if not provided current_timestamp() function is used
is_older_than_n_daysCheck if input column is older than n number of dayscol_name: column name to check; days: number of days; curr_date: current date, if not provided current_date() function is used
is_older_than_col2_for_n_daysCheck if one column is not older than another column by n number of dayscol_name1: first column name to check; col_name2: second column name to check; days: number of days
regex_matchCheck if input column matches a given regexcol_name: column name to check; regex: regex to check; negate: if the condition should be negated (true) or not
sql_expressionCheck if input column is matches the provided sql expression, eg. a = 'str1', a > bexpression: sql expression to check; msg: optional message to output; name: optional name of the resulting column; negate: if the condition should be negated
is_not_null_and_not_empty_arrayCheck if input array column is not null or emptycol_name: column name to check

You can check implementation details of the rules here.

Creating your own checks

Use sql expression

If a check that you need does not exist in DQX, you can define them using sql expression rule (sql_expression), for example:

- criticality: "error"
check:
function: "sql_expression"
arguments:
expression: "col1 LIKE '%foo'"
msg: "col1 ends with 'foo'"

Sql expression is also useful if you want to make cross-column validation, for example:

- criticality: "error"
check:
function: "sql_expression"
arguments:
expression: "a > b"
msg: "a is greater than b"

Define custom check functions

If you need a reusable check or need to implement a more complicated logic you can define your own check functions. A check is a function available from 'globals' that returns pyspark.sql.Column, for example:

import pyspark.sql.functions as F
from pyspark.sql import Column
from databricks.labs.dqx.col_functions import make_condition

def ends_with_foo(col_name: str) -> Column:
column = F.col(col_name)
return make_condition(column.endswith("foo"), f"Column {col_name} ends with foo", f"{col_name}_ends_with_foo")

Then you can use the function as a check:

import yaml
from databricks.labs.dqx.engine import DQEngine
from databricks.sdk import WorkspaceClient
from databricks.labs.dqx.col_functions import *

checks = yaml.safe_load("""
- criticality: "error"
check:
function: "ends_with_foo"
arguments:
col_name: "col1"
""")

dq_engine = DQEngine(WorkspaceClient())

# Option 1: apply quality rules on the dataframe and provide valid and invalid (quarantined) dataframes
valid_df, quarantined_df = dq_engine.apply_checks_by_metadata_and_split(input_df, checks, globals())

# Option 2: apply quality rules on the dataframe and report issues as additional columns (`_warning` and `_error`)
valid_and_quarantined_df = dq_engine.apply_checks_by_metadata(input_df, checks, globals())

You can see all existing DQX checks here.

Feel free to submit a PR to DQX with your own check so that other can benefit from it (see contribution guide).