databricks.labs.dqx.config
InputConfig Objects
@dataclass
class InputConfig()
Configuration class for input data sources (e.g. tables or files).
OutputConfig Objects
@dataclass
class OutputConfig()
Configuration class for output data sinks (e.g. tables or files).
__post_init__
def __post_init__()
Normalize trigger configuration by converting string boolean representations to actual booleans. This is required due to the limitation of the config deserializer.
ProfilerConfig Objects
@dataclass
class ProfilerConfig()
Configuration class for profiler.
summary_stats_file
file containing profile summary statistics
sample_fraction
fraction of data to sample (30%)
sample_seed
seed for sampling
limit
limit the number of records to profile
filter
filter to apply to the data before profiling
criticality
default criticality for generated rules ("error" or "warn")
IsolationForestConfig Objects
@dataclass
class IsolationForestConfig()
Algorithm parameters for Spark ML IsolationForest.
TemporalAnomalyConfig Objects
@dataclass
class TemporalAnomalyConfig()
Configuration for temporal feature extraction.
FeatureEngineeringConfig Objects
@dataclass
class FeatureEngineeringConfig()
Configuration for multi-type feature engineering in anomaly detection.
max_input_columns
Soft limit - warns but proceeds if exceeded
max_engineered_features
Soft limit on total engineered features
categorical_cardinality_threshold
OneHot if <=20, Frequency if >20
AnomalyParams Objects
@dataclass
class AnomalyParams()
Optional tuning parameters for row anomaly detection.
Attributes:
sample_fraction- Fraction of data to sample for training (default 0.3).max_rows- Maximum rows to use for training (default 1,000,000).train_ratio- Train/validation split ratio (default 0.8).ensemble_size- Number of models in ensemble (default 3). Set to None for single model. Ensemble models provide:- More robust anomaly scores (averaged across models)
- Confidence scores via standard deviation
- Better generalization
Performance- Optimized ensemble scoring makes this negligible overhead.algorithm_config- Isolation Forest parameters (contamination, num_trees, seed).feature_engineering- Feature engineering parameters (temporal features, scaling, etc.).
ensemble_size
Default 3-model ensemble for robustness, tie-breaking, and confidence scores
AnomalyConfig Objects
@dataclass
class AnomalyConfig()
Configuration for row anomaly detection.
columns
Auto-discovered if omitted
segment_by
Auto-discovered if omitted (when columns also omitted)
model_name
Optional in workflows; defaults to dqx_anomaly_<run_config.name>
RunConfig Objects
@dataclass
class RunConfig()
Configuration class for the data quality checks
name
name of the run configuration
quarantine_config
quarantined data table
metrics_config
summary metrics table
checks_user_requirements
user input for AI-assisted rule generation
warehouse_id
warehouse id to use in the dashboard
reference_tables
reference tables to use in the checks
anomaly_config
optional anomaly detection configuration
LLMModelConfig Objects
@dataclass
class LLMModelConfig()
Configuration for LLM model
api_key
when used with Profiler Workflow, this should be a secret: secret_scope/secret_key
api_base
when used with Profiler Workflow, this should be a secret: secret_scope/secret_key
LLMConfig Objects
@dataclass(frozen=True)
class LLMConfig()
Configuration for LLM usage
ExtraParams Objects
@dataclass(frozen=True)
class ExtraParams()
Class to represent extra parameters for DQEngine.
WorkspaceConfig Objects
@dataclass
class WorkspaceConfig()
Configuration class for the workspace
extra_params
extra parameters to pass to the jobs, e.g. result_column_names
profiler_max_parallelism
max parallelism for profiling multiple tables
quality_checker_max_parallelism
max parallelism for quality checking multiple tables
custom_metrics
custom summary metrics tracked by the observer when applying checks
as_dict
def as_dict() -> dict
Convert the WorkspaceConfig to a dictionary for serialization. This method ensures that all fields, including boolean False values, are properly serialized. Used by blueprint's installation when saving the config (Installation.save()).
Returns:
A dictionary representation of the WorkspaceConfig.
get_run_config
def get_run_config(run_config_name: str | None = "default") -> RunConfig
Get the run configuration for a given run name, or the default configuration if no run name is provided.
Arguments:
run_config_name- The name of the run configuration to get, e.g. input table or job name (use "default" if not provided).
Returns:
The run configuration.
Raises:
InvalidConfigError- If no run configurations are available or if the specified run configuration name is not found.
BaseChecksStorageConfig Objects
@dataclass
class BaseChecksStorageConfig(abc.ABC)
Marker base class for storage configuration.
Arguments:
location- The file path or table name where checks are stored.
FileChecksStorageConfig Objects
@dataclass
class FileChecksStorageConfig(BaseChecksStorageConfig)
Configuration class for storing checks in a file.
Arguments:
location- The file path where the checks are stored.
WorkspaceFileChecksStorageConfig Objects
@dataclass
class WorkspaceFileChecksStorageConfig(BaseChecksStorageConfig)
Configuration class for storing checks in a workspace file.
Arguments:
location- The workspace file path where the checks are stored.
TableChecksStorageConfig Objects
@dataclass
class TableChecksStorageConfig(BaseChecksStorageConfig)
Configuration class for storing checks in a table.
Arguments:
location- The table name where the checks are stored.run_config_name- The name of the run configuration to use for checks, e.g. input table or job name (use "default" if not provided).mode- The mode for writing checks to a table ('append' or 'overwrite', default 'append').- overwrite: Replaces all rows for this run_config_name when the fingerprint differs. Skips write when the fingerprint already exists.
- append: Adds new rows when the fingerprint differs; multiple versions can coexist. Skips write when the fingerprint already exists.
rule_set_fingerprint- Optional SHA-256 fingerprint of the rule set to load. When provided, loads rules matching this specific fingerprint instead of the latest batch. When None (default), loads the latest batch.
run_config_name
to filter checks by run config
rule_set_fingerprint
to filter checks by rule set fingerprint
LakebaseChecksStorageConfig Objects
@dataclass
class LakebaseChecksStorageConfig(BaseChecksStorageConfig)
Configuration class for storing checks in a Lakebase table.
Arguments:
location- Fully qualified name of the Lakebase table to store checks in the format 'database.schema.table'.instance_name- Name of the Lakebase instance.client_id- ID of the Databricks service principal to use for the Lakebase connection.port- The Lakebase port (default is '5432').run_config_name- Name of the run configuration to use for checks (default is 'default').mode- The mode for writing checks to a table ('append' or 'overwrite', default 'append').- overwrite: Replaces all rows for this run_config_name when the fingerprint differs. Skips write when the fingerprint already exists.
- append: Adds new rows when the fingerprint differs; multiple versions can coexist. Skips write when the fingerprint already exists.
rule_set_fingerprint- Optional SHA-256 fingerprint of the rule set to load. When provided, loads rules matching this specific fingerprint instead of the latest batch. When None (default), loads the latest batch.
VolumeFileChecksStorageConfig Objects
@dataclass
class VolumeFileChecksStorageConfig(BaseChecksStorageConfig)
Configuration class for storing checks in a Unity Catalog volume file.
Arguments:
location- The Unity Catalog volume file path where the checks are stored.
InstallationChecksStorageConfig Objects
@dataclass
class InstallationChecksStorageConfig(WorkspaceFileChecksStorageConfig,
TableChecksStorageConfig,
VolumeFileChecksStorageConfig,
LakebaseChecksStorageConfig)
Configuration class for storing checks in an installation.
Arguments:
location- The installation path where the checks are stored (e.g., table name, file path). Not used when using installation method, as it is retrieved from the installation config, unless overwrite_location is enabled.run_config_name- The name of the run configuration to use for checks, e.g. input table or job name (use "default" if not provided).product_name- The product name for retrieving checks from the installation (default is 'dqx').assume_user- Whether to assume the user is the owner of the checks (default is True).install_folder- The installation folder where DQX is installed. DQX will be installed in a default directory if no custom folder is provided:- User's home directory: "/Users/<your_user>/.dqx"
- Global directory if
DQX_FORCE_INSTALL=global: "/Applications/dqx"
overwrite_location- Whether to overwrite the location from run config if provided (default is False).
location
retrieved from the installation config
run_config_name
to retrieve run config