Additional Configuration
Profiling Options
Profiler will sample the input data by default with a factor of 0.3 (30%) and limit the input to 1000 records. You can adjust these and other parameters as follows:
- Python
from databricks.labs.dqx.profiler.profiler import DQProfiler
from databricks.labs.dqx.profiler.generator import DQGenerator
from databricks.labs.dqx.engine import DQEngine
from databricks.sdk import WorkspaceClient
input_df = spark.read.table("catalog1.schema1.table1")
default_profile_options = {
"round": True, # round the min/max values
"max_in_count": 10, # generate is_in if we have less than 1 percent of distinct values
"distinct_ratio": 0.05, # generate is_distinct if we have less than 1 percent of distinct values
"max_null_ratio": 0.01, # generate is_null if we have less than 1 percent of nulls
"remove_outliers": True, # remove outliers
"outlier_columns": [], # remove outliers in the columns
"num_sigmas": 3, # number of sigmas to use when remove_outliers is True
"trim_strings": True, # trim whitespace from strings
"max_empty_ratio": 0.01, # generate is_empty if we have less than 1 percent of empty strings
"sample_fraction": 0.3, # fraction of data to sample (30%)
"sample_seed": None, # seed for sampling
"limit": 1000, # limit the number of samples
}
columns_to_profile = ["col1", "col2"]
# profile input data
ws = WorkspaceClient()
profiler = DQProfiler(ws)
summary_stats, profiles = profiler.profile(input_df, cols=columns_to_profile, opts=profile_options)
# generate DQX quality rules/checks
generator = DQGenerator(ws)
checks = generator.generate_dq_rules(profiles) # with default level "error"
Adding User Metadata to the Results of All Checks
You can provide user metadata to the results by specifying extra parameters when creating the engine.
The custom key-value metadata will be included in every quality check result inside the user_metadata
field.
- Python
from databricks.sdk import WorkspaceClient
from databricks.labs.dqx.engine import DQEngine
from databricks.labs.dqx.rule import ExtraParams
user_metadata = {"key1": "value1", "key2": "value2"}
# use ExtraParams to configure one or more optional parameters
extra_parameters = ExtraParams(user_metadata=user_metadata)
ws = WorkspaceClient()
dq_engine = DQEngine(ws, extra_params=extra_parameters)
Adding User Metadata to the Results of Specific Checks
You can also provide user metadata for specific checks when defining those checks in code or configuration.
The custom key-value metadata will be included in every quality check result inside the user_metadata
field.
When the same properties are defined in both the engine and check-level user metadata, the check-level values will override the values set in the engine.
- Python
from databricks.labs.dqx.rule import DQRowRule
from databricks.labs.dqx import check_funcs
# define the checks programmatically using DQX classes with user metadata for an individual check
checks = [
DQRowRule( # check with user metadata
name="col_5_is_null_or_empty",
criticality="warn",
check_func=check_funcs.is_not_null_and_not_empty,
column="col5",
user_metadata={"key1": "value1", "key2": "value2"}
),
...
]
# define the checks using yaml with user metadata for an individual check
checks = yaml.safe_load("""
# check with user metadata
- criticality: warn
check:
function: is_not_null_and_not_empty
arguments:
column: col5
user_metadata:
key1: value1
key2: value2
""")
Customizing Reporting Columns
By default, DQX appends _error
and _warning
reporting columns to the output DataFrame to flag quality issues.
You can customize the names of these reporting columns by specifying extra parameters when creating the engine.
- Python
from databricks.sdk import WorkspaceClient
from databricks.labs.dqx.engine import DQEngine
from databricks.labs.dqx.rule import ExtraParams
custom_column_names = {"errors": "dq_errors", "warnings": "dq_warnings"}
# use ExtraParams to configure one or more optional parameters
extra_parameters = ExtraParams(column_names=custom_column_names)
ws = WorkspaceClient()
dq_engine = DQEngine(ws, extra_params=extra_parameters)