Skip to main content

DQX Profiler

To profile data and generate candidate data quality rules with DQX, you can use the DQProfiler, DQGenerator, and DQDltGenerator classes. The profiler analyzes datasets to generate summary statistics and data quality rule candidates automatically. These components require a Databricks workspace client for authentication and interaction with the Databricks workspace.

When running the code on a Databricks workspace, the workspace client is automatically authenticated, whether DQX is used in a notebook, script, or job/workflow. You only need the following code to create the workspace client if you run DQX on Databricks workspace:

from databricks.sdk import WorkspaceClient
from databricks.labs.dqx.profiler.profiler import DQProfiler
from databricks.labs.dqx.profiler.generator import DQGenerator
from databricks.labs.dqx.profiler.dlt_generator import DQDltGenerator

ws = WorkspaceClient()
profiler = DQProfiler(ws)
generator = DQGenerator(ws)
dlt_generator = DQDltGenerator(ws)

For external environments, such as CI servers or local machines, you can authenticate to Databricks using any method supported by the Databricks SDK. For detailed instructions, refer to the default authentication flow.

Profiler methods

The DQProfiler class provides methods to analyze datasets and generate data quality profiles for each column:

Available DQProfiler methods
MethodDescriptionArguments
profileProfiles a DataFrame to generate summary statistics and data quality rules.df: DataFrame to profile; columns: Optional list of column names to include (default: all columns); options: Optional dictionary of profiling options (merged with defaults).
profile_tableProfiles a table to generate summary statistics and data quality rules.table: Fully-qualified table name (e.g. 'catalog.schema.table'); columns: Optional list of column names to include (default: all columns); options: Optional dictionary of profiling options (merged with defaults);
profile_tablesProfiles multiple tables in Unity Catalog with pattern matching.tables: Optional list of table names; patterns: Optional list of regex patterns to match tables; exclude_matched: Whether to exclude matched tables (default False); columns: Optional dictionary with table names as keys and lists of column names as values; options: Optional dictionary with table names as keys and profiling options as values (merged with defaults).

Profiling Options

The profiler supports extensive configuration options to customize behavior:

OptionDefault ValueDescription
roundTrueRound min/max values for cleaner rules
max_in_count10Generate is_in rule if distinct values < this count
distinct_ratio0.05Generate is_in rule if distinct values < 5% of total
max_null_ratio0.01Generate is_not_null rule if null values < 1% of total
remove_outliersTrueEnable outlier detection for min/max rules
outlier_columns[]Specific columns for outlier detection (empty = all numeric)
num_sigmas3Number of standard deviations for outlier detection
trim_stringsTrueTrim whitespace from strings before analysis
max_empty_ratio0.01Generate is_not_null_or_empty if empty strings < 1% of total
sample_fraction0.3Sample 30% of the data for profiling
sample_seedNoneSeed for sampling (None = random)
limit1000Maximum number of records to analyze

DQProfile Structure

The DQProfile dataclass represents a single data quality rule candidate generated by the profiler:

@dataclass
class DQProfile:
name: str # Type of rule (e.g., "is_not_null", "min_max", "is_in")
column: str # Column name the rule applies to
description: str | None = None # Optional description of how the rule was generated
parameters: dict[str, Any] | None = None # Optional parameters for the rule

DQGenerator methods

The DQGenerator class converts profiling results into DQX quality rules:

Available DQGenerator methods
MethodDescriptionArgumentsSupports local execution
generate_dq_rulesGenerates a list of data quality rules from profiling results.rules: List of DQProfile objects; level: Criticality level for generated rules (default "error").Yes

DQDltGenerator methods

The DQDltGenerator class creates Delta Live Tables expectation statements from profiling results:

Available DQDltGenerator methods
MethodDescriptionArgumentsSupports local execution
generate_dlt_rulesGenerates Delta Live Table rules in the specified language.rules: List of DQProfile objects; action: Optional violation action ("drop", "fail", or None); language: Target language ("SQL", "Python", or "Python_Dict").Yes
Complete Profiling Guide

For comprehensive examples, advanced options, and best practices, see the Data Profiling Guide.