Skip to main content

Additional Configuration

Profiling Options

Profiler will sample the input data by default with a factor of 0.3 (30%) and limit the input to 1000 records. You can adjust these and other parameters as follows:

from databricks.labs.dqx.profiler.profiler import DQProfiler
from databricks.labs.dqx.profiler.generator import DQGenerator
from databricks.labs.dqx.engine import DQEngine
from databricks.sdk import WorkspaceClient

input_df = spark.read.table("catalog1.schema1.table1")

default_profile_options = {
"round": True, # round the min/max values
"max_in_count": 10, # generate is_in if we have less than 1 percent of distinct values
"distinct_ratio": 0.05, # generate is_distinct if we have less than 1 percent of distinct values
"max_null_ratio": 0.01, # generate is_null if we have less than 1 percent of nulls
"remove_outliers": True, # remove outliers
"outlier_columns": [], # remove outliers in the columns
"num_sigmas": 3, # number of sigmas to use when remove_outliers is True
"trim_strings": True, # trim whitespace from strings
"max_empty_ratio": 0.01, # generate is_empty if we have less than 1 percent of empty strings
"sample_fraction": 0.3, # fraction of data to sample (30%)
"sample_seed": None, # seed for sampling
"limit": 1000, # limit the number of samples
}
columns_to_profile = ["col1", "col2"]

# profile input data
ws = WorkspaceClient()
profiler = DQProfiler(ws)
summary_stats, profiles = profiler.profile(input_df, cols=columns_to_profile, opts=profile_options)

# generate DQX quality rules/checks
generator = DQGenerator(ws)
checks = generator.generate_dq_rules(profiles) # with default level "error"

Adding User Metadata to the Results of All Checks

You can provide user metadata to the results by specifying extra parameters when creating the engine. The custom key-value metadata will be included in every quality check result inside the user_metadata field.

from databricks.sdk import WorkspaceClient
from databricks.labs.dqx.engine import DQEngine
from databricks.labs.dqx.rule import ExtraParams

user_metadata = {"key1": "value1", "key2": "value2"}

# use ExtraParams to configure one or more optional parameters
extra_parameters = ExtraParams(user_metadata=user_metadata)

ws = WorkspaceClient()
dq_engine = DQEngine(ws, extra_params=extra_parameters)

Adding User Metadata to the Results of Specific Checks

You can also provide user metadata for specific checks when defining those checks in code or configuration. The custom key-value metadata will be included in every quality check result inside the user_metadata field.

When the same properties are defined in both the engine and check-level user metadata, the check-level values will override the values set in the engine.

from databricks.labs.dqx.rule import DQRowRule
from databricks.labs.dqx import check_funcs


# define the checks programmatically using DQX classes with user metadata for an individual check
checks = [
DQRowRule( # check with user metadata
name="col_5_is_null_or_empty",
criticality="warn",
check_func=check_funcs.is_not_null_and_not_empty,
column="col5",
user_metadata={"key1": "value1", "key2": "value2"}
),
...
]

# define the checks using yaml with user metadata for an individual check
checks = yaml.safe_load("""
# check with user metadata
- criticality: warn
check:
function: is_not_null_and_not_empty
arguments:
column: col5
user_metadata:
key1: value1
key2: value2
""")

Customizing Reporting Columns

By default, DQX appends _error and _warning reporting columns to the output DataFrame to flag quality issues. You can customize the names of these reporting columns by specifying extra parameters when creating the engine.

from databricks.sdk import WorkspaceClient
from databricks.labs.dqx.engine import DQEngine
from databricks.labs.dqx.rule import ExtraParams

custom_column_names = {"errors": "dq_errors", "warnings": "dq_warnings"}

# use ExtraParams to configure one or more optional parameters
extra_parameters = ExtraParams(column_names=custom_column_names)

ws = WorkspaceClient()
dq_engine = DQEngine(ws, extra_params=extra_parameters)