Skip to main content

Profiling and Quality Checks Generation

Data profiling can be run to profile input data and generate quality rule candidates with summary statistics. The generated data quality rules (checks) candidates can be used as input for the quality checking as described here. In addition, the Lakeflow (DLT) generator can generate native Lakeflow Pipelines expectations.

You can run profiling and quality checks generation using the following approaches:

Data profiling usage

Data profiling is typically performed as a one-time action for the input dataset to discover the initial set of quality rule candidates. The check candidates should be manually reviewed before being applied to the data. This is not intended to be a continuously repeated or scheduled process, thereby also minimizing concerns regarding compute intensity and associated costs.

Programmatic approach

You can use DQX classes to profile data and generate quality checks candidates programmatically. There are multiple storage options for saving the generated quality checks as described here.

Profiling a DataFrame

Data loaded as a DataFrame can be profiled using DQX classes to generate summary statistics and candidate data quality rules.

from databricks.labs.dqx.profiler.profiler import DQProfiler
from databricks.labs.dqx.profiler.generator import DQGenerator
from databricks.labs.dqx.profiler.dlt_generator import DQDltGenerator
from databricks.labs.dqx.config import WorkspaceFileChecksStorageConfig
from databricks.labs.dqx.engine import DQEngine
from databricks.sdk import WorkspaceClient

input_df = spark.read.table("catalog1.schema1.table1")

# profile input data
ws = WorkspaceClient()
profiler = DQProfiler(ws)
summary_stats, profiles = profiler.profile(input_df)

# generate DQX quality rules/checks
generator = DQGenerator(ws)
checks = generator.generate_dq_rules(profiles) # with default level "error"

dq_engine = DQEngine(ws)

# save checks as YAML in arbitrary workspace location
dq_engine.save_checks(checks, config=WorkspaceFileChecksStorageConfig(location="/Shared/App1/checks.yml"))

# generate Lakeflow Pipeline (DLT) expectations
dlt_generator = DQDltGenerator(ws)

dlt_expectations = dlt_generator.generate_dlt_rules(profiles, language="SQL")
print(dlt_expectations)

dlt_expectations = dlt_generator.generate_dlt_rules(profiles, language="Python")
print(dlt_expectations)

dlt_expectations = dlt_generator.generate_dlt_rules(profiles, language="Python_Dict")
print(dlt_expectations)

The profiler samples 30% of the data (sample ratio = 0.3) and limits the input to 1000 records by default. These and other configuration options can be customized as detailed in the Profiling Options section.

Profiling a Table

Profiling and quality checks generation can be run on individual tables.

from databricks.labs.dqx.profiler.profiler import DQProfiler
from databricks.sdk import WorkspaceClient

ws = WorkspaceClient()
profiler = DQProfiler(ws)

summary_stats, profiles = profiler.profile_table(
table="catalog1.schema1.table1",
columns=["col1", "col2", "col3"], # specify columns to profile
)

print("Summary Statistics:", summary_stats)
print("Generated Profiles:", profiles)

Profiling multiple Tables

The profiler can discover and profile multiple tables accessible via Unity Catalog. Tables can be passed explicitly as a list or be included/excluded using wildcard patterns (e.g. "catalog.schema.*").

from databricks.labs.dqx.profiler.profiler import DQProfiler
from databricks.sdk import WorkspaceClient

ws = WorkspaceClient()
profiler = DQProfiler(ws)

# Profile several tables by name
results = profiler.profile_tables(
tables=["main.data.table_001", "main.data.table_002"]
)

# Process results for each table
for summary_stats, profiles in results:
print(f"Table statistics: {summary_stats}")
print(f"Generated profiles: {profiles}")

# Profile several tables by wildcard patterns
results = profiler.profile_tables(
patterns=["main.*", "data.*"]
)

# Process results for each table
for summary_stats, profiles in results:
print(f"Table statistics: {summary_stats}")
print(f"Generated profiles: {profiles}")

# Exclude tables matching specific patterns
results = profiler.profile_tables(
patterns=["sys.*", "*_tmp"],
exclude_matched=True
)

# Process results for each table
for summary_stats, profiles in results:
print(f"Table statistics: {summary_stats}")
print(f"Generated profiles: {profiles}")

Profiling and expectations generation for Lakeflow Pipelines

You can integrate DQX quality checking directly with Lakeflow Pipelines (formerly Delta Live Tables, DLT). You can also generate Lakeflow Pipelines expectation statements from profiler results.

from databricks.labs.dqx.profiler.dlt_generator import DQDltGenerator
from databricks.sdk import WorkspaceClient

# After profiling your data
ws = WorkspaceClient()
profiler = DQProfiler(ws)
summary_stats, profiles = profiler.profile(input_df)

# Generate Lakeflow Pipelines expectations
dlt_generator = DQDltGenerator(ws)

# Generate SQL expectations
sql_expectations = dlt_generator.generate_dlt_rules(profiles, language="SQL")
print("SQL Expectations:")
for expectation in sql_expectations:
print(expectation)
# Output example:
# CONSTRAINT user_id_is_null EXPECT (user_id is not null)
# CONSTRAINT age_isnt_in_range EXPECT (age >= 18 and age <= 120)

# Generate SQL expectations with actions
sql_with_drop = dlt_generator.generate_dlt_rules(profiles, language="SQL", action="drop")
print("SQL Expectations with DROP action:")
for expectation in sql_with_drop:
print(expectation)
# Output example:
# CONSTRAINT user_id_is_null EXPECT (user_id is not null) ON VIOLATION DROP ROW

# Generate Python expectations
python_expectations = dlt_generator.generate_dlt_rules(profiles, language="Python")
print("Python Expectations:")
print(python_expectations)
# Output example:
# @dlt.expect_all({
# "user_id_is_null": "user_id is not null",
# "age_isnt_in_range": "age >= 18 and age <= 120"
# })

# Generate Python dictionary format
dict_expectations = dlt_generator.generate_dlt_rules(profiles, language="Python_Dict")
print("Python Dictionary Expectations:")
print(dict_expectations)
# Output example:
# {
# "user_id_is_null": "user_id is not null",
# "age_isnt_in_range": "age >= 18 and age <= 120"
# }

No-code approach (Profiler Workflow)

You can run profiler workflow to profile tables and generate quality checks candidates. The profiler workflow saves the generated checks automatically in the checks location as defined in the configuration file.

You need to install DQX as a tool in the workspace to have the profiler workflow available (see installation guide). The workflow is not scheduled to run automatically by default, minimizing concerns regarding compute and associated costs. Profiling is typically done as a one-time action for the input dataset to discover the initial set of quality rule candidates.

You can run the profiler workflow from Databricks UI or by using Databricks CLI:

databricks labs dqx profile --run-config "default"

You can also provide --timeout-minutes option.

Execution logs from the profiler are printed in the console and saved in the installation folder. You can display the logs from the latest profiler workflow run by executing:

databricks labs dqx logs --workflow profiler

The generated quality rule candidates and summary statistics will be saved in the checks_location set in the run config of the configuration file.

You can open the configuration file (.dqx/config.yml) in the installation folder from Databricks UI or by executing the following Databricks CLI command:

databricks labs dqx open-remote-config

The following fields from the configuration file are used by the profiler workflow:

  • input_config: configuration for the input data.
  • checks_location: absolute or relative location (relative to the installation folder) to the generated quality rule candidates as 'yaml' or 'json' file (default: 'checks.yml').
  • profiler_config: configuration for the profiler containing:
    • summary_stats_file: relative location within the installation folder of the summary statistics (default: 'profile_summary.yml')
    • sample_fraction: fraction of data to sample for profiling.
    • sample_seed: seed for reproducible sampling.
    • limit: maximum number of records to analyze.
  • serverless_clusters: whether to use serverless clusters for running the workflow (default: 'true'). Using serverless clusters is recommended as it allows for automated cluster management and scaling.
  • profiler_spark_conf: (optional) spark configuration to use for the profiler workflow, only applicable if serverless_clusters is set to 'false'.
  • profiler_override_clusters: (optional) cluster configuration to use for profiler workflow, only applicable if serverless_clusters is set to 'false'.

Example of the configuration file (relevant fields only):

  serverless_clusters: true  # default is true, set to false to use standard clusters
run_configs:
- name: default
checks_location: catalog.table.checks # can be a table or file
input_config:
format: delta
location: /databricks-datasets/delta-sharing/samples/nyctaxi_2019
is_streaming: false # set to true if wanting to run it using streaming
profiler_config:
limit: 1000
sample_fraction: 0.3
summary_stats_file: profile_summary_stats.yml

Profiling options

The profiler supports extensive configuration options to customize the profiling behavior.

Profiling options for a single table

For code-level integration you can use options parameter to pass a dictionary with custom options when profiling a DataFrame or table.

from databricks.labs.dqx.profiler.profiler import DQProfiler
from databricks.sdk import WorkspaceClient

# Custom profiling options
custom_options = {
# Sampling options
"sample_fraction": 0.2, # Sample 20% of the data
"sample_seed": 42, # Seed for reproducible sampling
"limit": 2000, # Limit to 2000 records after sampling

# Outlier detection options
"remove_outliers": True, # Enable outlier detection for min/max rules
"outlier_columns": ["price", "age"], # Only detect outliers in specific columns
"num_sigmas": 2.5, # Use 2.5 standard deviations for outlier detection

# Null value handling
"max_null_ratio": 0.05, # Generate is_not_null rule if <5% nulls

# String handling
"trim_strings": True, # Trim whitespace from strings before analysis
"max_empty_ratio": 0.02, # Generate is_not_null_or_empty rule if <2% empty strings

# Distinct value analysis
"distinct_ratio": 0.01, # Generate is_in rule if <1% distinct values
"max_in_count": 20, # Maximum items in is_in rule list

# Value rounding
"round": True, # Round min/max values for cleaner rules
}

ws = WorkspaceClient()
profiler = DQProfiler(ws)

# Apply custom options for profiling a DataFrame
summary_stats, profiles = profiler.profile(input_df, options=custom_options)

# Apply custom options for profiling a table
summary_stats, profiles = profiler.profile_table(
table="catalog1.schema1.table1",
columns=["col1", "col2", "col3"],
options=custom_options
)

Profiling options for multiple table

When profiling multiple tables, you can pass a list of dictionaries to apply different options to each one. Wildcard patterns are supported, allowing you to match table names and apply specific options based on those patterns.

from databricks.labs.dqx.profiler.profiler import DQProfiler
from databricks.sdk import WorkspaceClient

ws = WorkspaceClient()
profiler = DQProfiler(ws)

tables = [
"dqx.bronze.table_001",
"dqx.silver.table_001",
"dqx.silver.table_002",
]

# Custom options with wildcard patterns
custom_table_options = [
{
"table": "*", # matches all tables by pattern
"options": {"sample_fraction": 0.5}
},
{
"table": "dqx.silver.*", # matches tables in the 'dqx.silver' schema by pattern
"options": {"num_sigmas": 5}
},
{
"table": "dqx.silver.table_*", # matches tables in 'dqx.silver' schema and having 'table_' prefix
"options": {"num_sigmas": 5}
},
{
"table": "dqx.silver.table_002", # matches a specific table, overrides generic option
"options": {"sample_fraction": 0.1}
},
]

# Profile multiple tables using custom options
results = profiler.profile_tables(tables=tables, options=custom_table_options)

# Profile multiple tables by wildcard patterns using custom options
results = profiler.profile_tables(
patterns=["dqx.*"],
options=custom_table_options
)

Understanding summary statistics

How the data is profiled

To ensure efficiency, summary statistics are computed on a sampled subset of your data, not the entire dataset. By default, the profiler proceeds as follows:

  1. It samples the rows of the dataset you provide by using Spark’s DataFrame.sample (default sample_fraction = 0.3), which is probabilistic, hence the sampled row count is not guaranteed to be exact; the number may be slightly higher or lower.
  2. A limit is applied to the sampled rows (default limit = 1000).
  3. Statistics are computed on this final sampled-and-limited DataFrame.

These are defaults that can be customized programmatically as detailed in the Profiling Options section, or via profiler_config in the configuration file if DQX is installed as a tool.

Rule of thumb: count ≈ min(limit, sample_fraction × total_rows).

Summary Statistics Reference

FieldMeaningNotes
countRows actually profiled (after sampling and limit)≈ min(limit, sample_fraction × total_rows)
meanArithmetic average of non-null numeric valuesN/A for non-numeric
stddevSample standard deviation of non-null numeric valuesN/A for non-numeric
minSmallest non-null valueString = lexicographic; Date/Timestamp = earliest; Numeric = minimum
25 / 50 / 75Approximate 25th/50th/75th percentiles of non-null numeric valuesUses Spark approximate quantiles
maxLargest non-null valueString = lexicographic; Date/Timestamp = latest; Numeric = maximum
count_non_nullNumber of non-null entries within the profiled rows
count_nullNumber of null entries within the profiled rowscount_non_null + count_null = count

Performance considerations

When profiling large datasets, use sampling and limits for best performance.

# For large datasets, use aggressive sampling
large_dataset_opts = {
"sample_fraction": 0.01, # Sample only 1% for very large datasets
"limit": 10000, # Increase limit for better statistical accuracy
"sample_seed": 42, # Use consistent seed for reproducible results
}

# For medium datasets, use moderate sampling
medium_dataset_opts = {
"sample_fraction": 0.1, # Sample 10%
"limit": 5000, # Reasonable limit
}

# For small datasets, disable sampling
small_dataset_opts = {
"sample_fraction": None, # Use all data
"limit": None, # No limit
}
Profiling with sampled data

Summary statistics from limited samples may not reflect the characteristics of the overall dataset. Balance the sampling rate and limits with your desired profile accuracy. Manually review and tune rules generated from profiles on sample data to ensure correctness.