Skip to main content

Data Profiling and Quality Checks Generation

Data profiling can be run to profile the input data and generate quality rule candidates with summary statistics. The generated data quality rules (checks) candidates can be used as input for the quality checking (see Adding quality checks to the application). In addition, the DLT generator can generate native Delta Live Tables (DLT) expectations.

Data profiling usage

Data profiling is typically performed as a one-time action for the input dataset to discover the initial set of quality rule candidates. The check candidates should be manually reviewed before being applied to the data. This is not intended to be a continuously repeated or scheduled process, thereby also minimizing concerns regarding compute intensity and associated costs.

Profiling and Generating Checks

from databricks.labs.dqx.profiler.profiler import DQProfiler
from databricks.labs.dqx.profiler.generator import DQGenerator
from databricks.labs.dqx.profiler.dlt_generator import DQDltGenerator
from databricks.labs.dqx.engine import DQEngine
from databricks.sdk import WorkspaceClient

input_df = spark.read.table("catalog1.schema1.table1")

# profile input data
ws = WorkspaceClient()
profiler = DQProfiler(ws)
summary_stats, profiles = profiler.profile(input_df)

# generate DQX quality rules/checks
generator = DQGenerator(ws)
checks = generator.generate_dq_rules(profiles) # with default level "error"

dq_engine = DQEngine(ws)

# save checks in arbitrary workspace location
dq_engine.save_checks_in_workspace_file(checks, workspace_path="/Shared/App1/checks.yml")

# generate DLT expectations
dlt_generator = DQDltGenerator(ws)

dlt_expectations = dlt_generator.generate_dlt_rules(profiles, language="SQL")
print(dlt_expectations)

dlt_expectations = dlt_generator.generate_dlt_rules(profiles, language="Python")
print(dlt_expectations)

dlt_expectations = dlt_generator.generate_dlt_rules(profiles, language="Python_Dict")
print(dlt_expectations)

The profiler samples 30% of the data (sample ratio = 0.3) and limits the input to 1000 records by default. These and other configuration options can be customized as detailed in the Profiling Options section.

Storing Quality Checks

You can save checks defined in code or generated by the profiler to a delta table or file as yaml or json in the local path, workspace or installation folder.

from databricks.labs.dqx.engine import DQEngine
from databricks.sdk import WorkspaceClient

dq_engine = DQEngine(WorkspaceClient())

# Checks can be defined in code as below or generated by the profiler
# Must be defined as list[dict]
checks = yaml.safe_load("""
- criticality: warn
check:
function: is_not_null_and_not_empty
arguments:
column: col3
# ...
""")

# save checks in a local path
# always overwrite the file
dq_engine.save_checks_in_local_file(checks, path="checks.yml")

# save checks in arbitrary workspace location
# always overwrite the file
dq_engine.save_checks_in_workspace_file(checks, workspace_path="/Shared/App1/checks.yml")

# save checks in file defined in 'checks_file' in the run config
# always overwrite the file
# only works if DQX is installed in the workspace
dq_engine.save_checks_in_installation(checks, method="file", assume_user=True, run_config_name="default")

# save checks in a Delta table with default run config for filtering
# append checks in the table
dq_engine.save_checks_in_table(checks, table_name="dq.config.checks_table", mode="append")

# save checks in a Delta table with specific run config for filtering
# overwrite checks in the table for the given run config
dq_engine.save_checks_in_table(checks, table_name="dq.config.checks_table", run_config_name="workflow_001", mode="overwrite")

# save checks in table defined in 'checks_table' in the run config
# always overwrite checks in the table for the given run config
# only works if DQX is installed in the workspace
dq_engine.save_checks_in_installation(checks, method="table", assume_user=True, run_config_name="default")