Skip to main content

databricks.labs.dqx.config

InputConfig Objects

@dataclass
class InputConfig()

Configuration class for input data sources (e.g. tables or files).

OutputConfig Objects

@dataclass
class OutputConfig()

Configuration class for output data sinks (e.g. tables or files).

__post_init__

def __post_init__()

Normalize trigger configuration by converting string boolean representations to actual booleans. This is required due to the limitation of the config deserializer.

ProfilerConfig Objects

@dataclass
class ProfilerConfig()

Configuration class for profiler.

summary_stats_file

file containing profile summary statistics

sample_fraction

fraction of data to sample (30%)

sample_seed

seed for sampling

limit

limit the number of records to profile

filter

filter to apply to the data before profiling

criticality

default criticality for generated rules ("error" or "warn")

IsolationForestConfig Objects

@dataclass
class IsolationForestConfig()

Algorithm parameters for Spark ML IsolationForest.

TemporalAnomalyConfig Objects

@dataclass
class TemporalAnomalyConfig()

Configuration for temporal feature extraction.

FeatureEngineeringConfig Objects

@dataclass
class FeatureEngineeringConfig()

Configuration for multi-type feature engineering in anomaly detection.

max_input_columns

Soft limit - warns but proceeds if exceeded

max_engineered_features

Soft limit on total engineered features

categorical_cardinality_threshold

OneHot if <=20, Frequency if >20

AnomalyParams Objects

@dataclass
class AnomalyParams()

Optional tuning parameters for row anomaly detection.

Attributes:

  • sample_fraction - Fraction of data to sample for training (default 0.3).
  • max_rows - Maximum rows to use for training (default 1,000,000).
  • train_ratio - Train/validation split ratio (default 0.8).
  • ensemble_size - Number of models in ensemble (default 3). Set to None for single model. Ensemble models provide:
    • More robust anomaly scores (averaged across models)
    • Confidence scores via standard deviation
    • Better generalization
  • Performance - Optimized ensemble scoring makes this negligible overhead.
  • algorithm_config - Isolation Forest parameters (contamination, num_trees, seed).
  • feature_engineering - Feature engineering parameters (temporal features, scaling, etc.).

ensemble_size

Default 3-model ensemble for robustness, tie-breaking, and confidence scores

AnomalyConfig Objects

@dataclass
class AnomalyConfig()

Configuration for row anomaly detection.

columns

Auto-discovered if omitted

segment_by

Auto-discovered if omitted (when columns also omitted)

model_name

Optional in workflows; defaults to dqx_anomaly_<run_config.name>

RunConfig Objects

@dataclass
class RunConfig()

Configuration class for the data quality checks

name

name of the run configuration

quarantine_config

quarantined data table

metrics_config

summary metrics table

checks_user_requirements

user input for AI-assisted rule generation

warehouse_id

warehouse id to use in the dashboard

reference_tables

reference tables to use in the checks

anomaly_config

optional anomaly detection configuration

LLMModelConfig Objects

@dataclass
class LLMModelConfig()

Configuration for LLM model

api_key

when used with Profiler Workflow, this should be a secret: secret_scope/secret_key

api_base

when used with Profiler Workflow, this should be a secret: secret_scope/secret_key

LLMConfig Objects

@dataclass(frozen=True)
class LLMConfig()

Configuration for LLM usage

ExtraParams Objects

@dataclass(frozen=True)
class ExtraParams()

Class to represent extra parameters for DQEngine.

WorkspaceConfig Objects

@dataclass
class WorkspaceConfig()

Configuration class for the workspace

extra_params

extra parameters to pass to the jobs, e.g. result_column_names

profiler_max_parallelism

max parallelism for profiling multiple tables

quality_checker_max_parallelism

max parallelism for quality checking multiple tables

custom_metrics

custom summary metrics tracked by the observer when applying checks

as_dict

def as_dict() -> dict

Convert the WorkspaceConfig to a dictionary for serialization. This method ensures that all fields, including boolean False values, are properly serialized. Used by blueprint's installation when saving the config (Installation.save()).

Returns:

A dictionary representation of the WorkspaceConfig.

get_run_config

def get_run_config(run_config_name: str | None = "default") -> RunConfig

Get the run configuration for a given run name, or the default configuration if no run name is provided.

Arguments:

  • run_config_name - The name of the run configuration to get, e.g. input table or job name (use "default" if not provided).

Returns:

The run configuration.

Raises:

  • InvalidConfigError - If no run configurations are available or if the specified run configuration name is not found.

BaseChecksStorageConfig Objects

@dataclass
class BaseChecksStorageConfig(abc.ABC)

Marker base class for storage configuration.

Arguments:

  • location - The file path or table name where checks are stored.

FileChecksStorageConfig Objects

@dataclass
class FileChecksStorageConfig(BaseChecksStorageConfig)

Configuration class for storing checks in a file.

Arguments:

  • location - The file path where the checks are stored.

WorkspaceFileChecksStorageConfig Objects

@dataclass
class WorkspaceFileChecksStorageConfig(BaseChecksStorageConfig)

Configuration class for storing checks in a workspace file.

Arguments:

  • location - The workspace file path where the checks are stored.

TableChecksStorageConfig Objects

@dataclass
class TableChecksStorageConfig(BaseChecksStorageConfig)

Configuration class for storing checks in a table.

Arguments:

  • location - The table name where the checks are stored.
  • run_config_name - The name of the run configuration to use for checks, e.g. input table or job name (use "default" if not provided).
  • mode - The mode for writing checks to a table ('append' or 'overwrite', default 'append').
    • overwrite: Replaces all rows for this run_config_name when the fingerprint differs. Skips write when the fingerprint already exists.
    • append: Adds new rows when the fingerprint differs; multiple versions can coexist. Skips write when the fingerprint already exists.
  • rule_set_fingerprint - Optional SHA-256 fingerprint of the rule set to load. When provided, loads rules matching this specific fingerprint instead of the latest batch. When None (default), loads the latest batch.

run_config_name

to filter checks by run config

rule_set_fingerprint

to filter checks by rule set fingerprint

LakebaseChecksStorageConfig Objects

@dataclass
class LakebaseChecksStorageConfig(BaseChecksStorageConfig)

Configuration class for storing checks in a Lakebase table.

Arguments:

  • location - Fully qualified name of the Lakebase table to store checks in the format 'database.schema.table'.
  • instance_name - Name of the Lakebase instance.
  • client_id - ID of the Databricks service principal to use for the Lakebase connection.
  • port - The Lakebase port (default is '5432').
  • run_config_name - Name of the run configuration to use for checks (default is 'default').
  • mode - The mode for writing checks to a table ('append' or 'overwrite', default 'append').
    • overwrite: Replaces all rows for this run_config_name when the fingerprint differs. Skips write when the fingerprint already exists.
    • append: Adds new rows when the fingerprint differs; multiple versions can coexist. Skips write when the fingerprint already exists.
  • rule_set_fingerprint - Optional SHA-256 fingerprint of the rule set to load. When provided, loads rules matching this specific fingerprint instead of the latest batch. When None (default), loads the latest batch.

VolumeFileChecksStorageConfig Objects

@dataclass
class VolumeFileChecksStorageConfig(BaseChecksStorageConfig)

Configuration class for storing checks in a Unity Catalog volume file.

Arguments:

  • location - The Unity Catalog volume file path where the checks are stored.

InstallationChecksStorageConfig Objects

@dataclass
class InstallationChecksStorageConfig(WorkspaceFileChecksStorageConfig,
TableChecksStorageConfig,
VolumeFileChecksStorageConfig,
LakebaseChecksStorageConfig)

Configuration class for storing checks in an installation.

Arguments:

  • location - The installation path where the checks are stored (e.g., table name, file path). Not used when using installation method, as it is retrieved from the installation config, unless overwrite_location is enabled.
  • run_config_name - The name of the run configuration to use for checks, e.g. input table or job name (use "default" if not provided).
  • product_name - The product name for retrieving checks from the installation (default is 'dqx').
  • assume_user - Whether to assume the user is the owner of the checks (default is True).
  • install_folder - The installation folder where DQX is installed. DQX will be installed in a default directory if no custom folder is provided:
    • User's home directory: "/Users/<your_user>/.dqx"
    • Global directory if DQX_FORCE_INSTALL=global: "/Applications/dqx"
  • overwrite_location - Whether to overwrite the location from run config if provided (default is False).

location

retrieved from the installation config

run_config_name

to retrieve run config