Data Profiling and Quality Checks Generation
Data profiling can be run to profile input data and generate quality rule candidates with summary statistics. The generated data quality rules (checks) candidates can be used as input for the quality checking (see Adding quality checks to the application). In addition, the DLT generator can generate native Lakeflow Pipeline (formerly Delta Live Tables (DLT)) expectations.
Data profiling is typically performed as a one-time action for the input dataset to discover the initial set of quality rule candidates. The check candidates should be manually reviewed before being applied to the data. This is not intended to be a continuously repeated or scheduled process, thereby also minimizing concerns regarding compute intensity and associated costs.
Profiling and Generating Checks
Profiling a DataFrame
Data loaded as a DataFrame can be profiled to generate summary statistics and candidate data quality rules.
- Python
from databricks.labs.dqx.profiler.profiler import DQProfiler
from databricks.labs.dqx.profiler.generator import DQGenerator
from databricks.labs.dqx.profiler.dlt_generator import DQDltGenerator
from databricks.labs.dqx.engine import DQEngine
from databricks.sdk import WorkspaceClient
input_df = spark.read.table("catalog1.schema1.table1")
# profile input data
ws = WorkspaceClient()
profiler = DQProfiler(ws)
summary_stats, profiles = profiler.profile(input_df)
# generate DQX quality rules/checks
generator = DQGenerator(ws)
checks = generator.generate_dq_rules(profiles) # with default level "error"
dq_engine = DQEngine(ws)
# save checks in arbitrary workspace location
dq_engine.save_checks_in_workspace_file(checks, workspace_path="/Shared/App1/checks.yml")
# generate Lakeflow Pipeline (DLT) expectations
dlt_generator = DQDltGenerator(ws)
dlt_expectations = dlt_generator.generate_dlt_rules(profiles, language="SQL")
print(dlt_expectations)
dlt_expectations = dlt_generator.generate_dlt_rules(profiles, language="Python")
print(dlt_expectations)
dlt_expectations = dlt_generator.generate_dlt_rules(profiles, language="Python_Dict")
print(dlt_expectations)
The profiler samples 30% of the data (sample ratio = 0.3) and limits the input to 1000 records by default. These and other configuration options can be customized as detailed in the Profiling Options section.
Profiling a Table
Tables can be loaded and profiled using profile_table
.
- Python
from databricks.labs.dqx.profiler.profiler import DQProfiler
from databricks.sdk import WorkspaceClient
ws = WorkspaceClient()
profiler = DQProfiler(ws)
summary_stats, profiles = profiler.profile_table(
table="catalog1.schema1.table1",
columns=["col1", "col2", "col3"], # specify columns to profile
)
print("Summary Statistics:", summary_stats)
print("Generated Profiles:", profiles)
Profiling Multiple Tables
The profiler can discover and profile multiple tables in Unity Catalog. Tables can be passed explicitly as a list or be included/excluded using wildcard patterns (e.g. "catalog.schema.*"
).
- Python
from databricks.labs.dqx.profiler.profiler import DQProfiler
from databricks.sdk import WorkspaceClient
ws = WorkspaceClient()
profiler = DQProfiler(ws)
# Profile several tables by name
results = profiler.profile_tables(
tables=["main.data.table_001", "main.data.table_002"]
)
# Process results for each table
for summary_stats, profiles in results:
print(f"Table statistics: {summary_stats}")
print(f"Generated profiles: {profiles}")
# Profile several tables by wildcard patterns
results = profiler.profile_tables(
patterns=["main.*", "data.*"]
)
# Process results for each table
for summary_stats, profiles in results:
print(f"Table statistics: {summary_stats}")
print(f"Generated profiles: {profiles}")
# Exclude tables matching specific patterns
results = profiler.profile_tables(
patterns=["sys.*", "*_tmp"],
exclude_matched=True
)
# Process results for each table
for summary_stats, profiles in results:
print(f"Table statistics: {summary_stats}")
print(f"Generated profiles: {profiles}")
Profiling Options
The profiler supports extensive configuration options to customize the profiling behavior.
Profiling Options for a single DataFrame or Table
You can use options
parameter to pass a dictionary with custom options when profiling a DataFrame or table.
- Python
- CLI
from databricks.labs.dqx.profiler.profiler import DQProfiler
from databricks.sdk import WorkspaceClient
# Custom profiling options
custom_options = {
# Sampling options
"sample_fraction": 0.2, # Sample 20% of the data
"sample_seed": 42, # Seed for reproducible sampling
"limit": 2000, # Limit to 2000 records after sampling
# Outlier detection options
"remove_outliers": True, # Enable outlier detection for min/max rules
"outlier_columns": ["price", "age"], # Only detect outliers in specific columns
"num_sigmas": 2.5, # Use 2.5 standard deviations for outlier detection
# Null value handling
"max_null_ratio": 0.05, # Generate is_not_null rule if <5% nulls
# String handling
"trim_strings": True, # Trim whitespace from strings before analysis
"max_empty_ratio": 0.02, # Generate is_not_null_or_empty rule if <2% empty strings
# Distinct value analysis
"distinct_ratio": 0.01, # Generate is_in rule if <1% distinct values
"max_in_count": 20, # Maximum items in is_in rule list
# Value rounding
"round": True, # Round min/max values for cleaner rules
}
ws = WorkspaceClient()
profiler = DQProfiler(ws)
# Apply custom options for profiling a DataFrame
summary_stats, profiles = profiler.profile(input_df, options=custom_options)
# Apply custom options for profiling a table
summary_stats, profiles = profiler.profile_table(
table="catalog1.schema1.table1",
columns=["col1", "col2", "col3"],
options=custom_options
)
You can optionally install DQX in the workspace (see the Installation Guide). A config, dashboard, and profiler workflow are installed as part of the installation. The workflow can be run manually in the workspace UI or using the CLI as below.
DQX operates exclusively at the PySpark dataframe level and does not interact directly with databases or storage systems. DQX does not persist data after performing quality checks, meaning users must handle data storage themselves. Since DQX does not manage the input location, output table, or quarantine table, it is the user's responsibility to store or persist the processed data as needed.
Open the config to check available run configs and adjust the settings if needed:
databricks labs dqx open-remote-config
See the example of the config in the installation guide.
Run profiler workflow:
databricks labs dqx profile --run-config "default"
The generated quality rule candidates and summary statistics will be in the installation folder, as defined in the run config. The "default" run config will be used if the run config is not provided. The run config is used to select specific run configuration from the 'config.yml'.
The following DQX configuration from 'config.yml' is used by the profiler workflow:
- 'input_config': configuration for the input data.
- 'checks_file': relative location of the generated quality rule candidates as
yaml
orjson
file inside the installation folder (default:checks.yml
). - 'profiler_config': configuration for the profiler containing:
- 'summary_stats_file': relative location of the summary statistics (default:
profile_summary.yml
) inside the installation folder - 'sample_fraction': fraction of data to sample for profiling.
- 'sample_seed': seed for reproducible sampling.
- 'limit': maximum number of records to analyze.
- 'summary_stats_file': relative location of the summary statistics (default:
- 'profiler_spark_conf': optional spark configuration to use with the profiler job
- 'profiler_override_clusters': optional clusters configuration to use with profiler job
Logs are printed in the console and saved in the installation folder. You can display the logs from the latest profiler workflow run by executing:
databricks labs dqx logs --workflow profiler
Profiling Options for multiple table
When profiling multiple tables, you can pass a list of dictionaries to apply different options to each one. Wildcard patterns are supported, allowing you to match table names and apply specific options based on those patterns.
- Python
from databricks.labs.dqx.profiler.profiler import DQProfiler
from databricks.sdk import WorkspaceClient
ws = WorkspaceClient()
profiler = DQProfiler(ws)
tables = [
"dqx.bronze.table_001",
"dqx.silver.table_001",
"dqx.silver.table_002",
]
# Custom options with wildcard patterns
custom_table_options = [
{
"table": "*", # matches all tables by pattern
"options": {"sample_fraction": 0.5}
},
{
"table": "dqx.silver.*", # matches tables in the 'dqx.silver' schema by pattern
"options": {"num_sigmas": 5}
},
{
"table": "dqx.silver.table_*", # matches tables in 'dqx.silver' schema and having 'table_' prefix
"options": {"num_sigmas": 5}
},
{
"table": "dqx.silver.table_002", # matches a specific table, overrides generic option
"options": {"sample_fraction": 0.1}
},
]
# Profile multiple tables using custom options
results = profiler.profile_tables(tables=tables, options=custom_table_options)
# Profile multiple tables by wildcard patterns using custom options
results = profiler.profile_tables(
patterns=["dqx.*"],
options=custom_table_options
)
Lakeflow Pipelines (formerly Delta Live Tables (DLT)) Expectations Generation
The DLT generator creates Lakeflow Pipelines expectation statements from profiler results.
- Python
from databricks.labs.dqx.profiler.dlt_generator import DQDltGenerator
from databricks.sdk import WorkspaceClient
# After profiling your data
ws = WorkspaceClient()
profiler = DQProfiler(ws)
summary_stats, profiles = profiler.profile(input_df)
# Generate Lakeflow Pipeline (DLT) expectations
dlt_generator = DQDltGenerator(ws)
# Generate SQL expectations
sql_expectations = dlt_generator.generate_dlt_rules(profiles, language="SQL")
print("SQL Expectations:")
for expectation in sql_expectations:
print(expectation)
# Output example:
# CONSTRAINT user_id_is_null EXPECT (user_id is not null)
# CONSTRAINT age_isnt_in_range EXPECT (age >= 18 and age <= 120)
# Generate SQL expectations with actions
sql_with_drop = dlt_generator.generate_dlt_rules(profiles, language="SQL", action="drop")
print("SQL Expectations with DROP action:")
for expectation in sql_with_drop:
print(expectation)
# Output example:
# CONSTRAINT user_id_is_null EXPECT (user_id is not null) ON VIOLATION DROP ROW
# Generate Python expectations
python_expectations = dlt_generator.generate_dlt_rules(profiles, language="Python")
print("Python Expectations:")
print(python_expectations)
# Output example:
# @dlt.expect_all({
# "user_id_is_null": "user_id is not null",
# "age_isnt_in_range": "age >= 18 and age <= 120"
# })
# Generate Python dictionary format
dict_expectations = dlt_generator.generate_dlt_rules(profiles, language="Python_Dict")
print("Python Dictionary Expectations:")
print(dict_expectations)
# Output example:
# {
# "user_id_is_null": "user_id is not null",
# "age_isnt_in_range": "age >= 18 and age <= 120"
# }
Storing Quality Checks
You can save checks defined in code or generated by the profiler to a table or file as yaml
or json
in the local path, workspace or installation folder.
- Python
from databricks.labs.dqx.engine import DQEngine
from databricks.sdk import WorkspaceClient
dq_engine = DQEngine(WorkspaceClient())
# Checks can be defined in code as below or generated by the profiler
# Must be defined as list[dict]
checks = yaml.safe_load("""
- criticality: warn
check:
function: is_not_null_and_not_empty
arguments:
column: col3
# ...
""")
# save checks in a local path
# always overwrite the file
dq_engine.save_checks_in_local_file(checks, path="checks.yml")
# save checks in arbitrary workspace location
# always overwrite the file
dq_engine.save_checks_in_workspace_file(checks, workspace_path="/Shared/App1/checks.yml")
# save checks in file defined in 'checks_file' in the run config
# always overwrite the file
# only works if DQX is installed in the workspace
dq_engine.save_checks_in_installation(checks, method="file", assume_user=True, run_config_name="default")
# save checks in a Delta table with default run config for filtering
# append checks in the table
dq_engine.save_checks_in_table(checks, table_name="dq.config.checks_table", mode="append")
# save checks in a Delta table with specific run config for filtering
# overwrite checks in the table for the given run config
dq_engine.save_checks_in_table(checks, table_name="dq.config.checks_table", run_config_name="workflow_001", mode="overwrite")
# save checks in table defined in 'checks_table' in the run config
# always overwrite checks in the table for the given run config
# only works if DQX is installed in the workspace
dq_engine.save_checks_in_installation(checks, method="table", assume_user=True, run_config_name="default")
Performance Considerations
When profiling large datasets, use sampling or limits for best performance.
- Python
# For large datasets, use aggressive sampling
large_dataset_opts = {
"sample_fraction": 0.01, # Sample only 1% for very large datasets
"limit": 10000, # Increase limit for better statistical accuracy
"sample_seed": 42, # Use consistent seed for reproducible results
}
# For medium datasets, use moderate sampling
medium_dataset_opts = {
"sample_fraction": 0.1, # Sample 10%
"limit": 5000, # Reasonable limit
}
# For small datasets, disable sampling
small_dataset_opts = {
"sample_fraction": None, # Use all data
"limit": None, # No limit
}
Summary statistics from limited samples may not reflect the characteristics of the overall dataset. Balance the sampling rate and limits with your desired profile accuracy. Manually review and tune rules generated from profiles on sample data to ensure correctness.