Skip to main content

DQX Engine

To perform data quality checking with DQX, you must create a DQEngine object. The engine requires a Databricks workspace client for authentication and interaction with the Databricks workspace.

When running the code on a Databricks workspace, the workspace client is automatically authenticated, whether DQX is used in a notebook, script, or job/workflow. You only need the following code to create the workspace client if you run DQX on Databricks workspace:

from databricks.sdk import WorkspaceClient
from databricks.labs.dqx.engine import DQEngine

ws = WorkspaceClient()
dq_engine = DQEngine(ws)

For external environments, such as CI servers or local machines, you can authenticate to Databricks using any method supported by the Databricks SDK. For detailed instructions, refer to the default authentication flow. If you're using Databricks configuration profiles or Databricks-specific environment variables for authentication, you can create the workspace client without needing to provide additional arguments:

ws = WorkspaceClient()

The DQEngine is initialized by default with standard Spark Session available in the current environment. If you need to use a custom Spark Session, such as from Databricks Connect you can pass it as an argument when creating the DQEngine instance:

from databricks.connect import DatabricksSession

spark = DatabricksSession.builder.getOrCreate()
dq_engine = DQEngine(ws, spark)

For local execution without a Databricks workspace, please refer to the local testing section.

DQX engine methods

The following table outlines the available methods of the DQEngine and their functionalities:

Available DQX engine methods
MethodDescriptionArgumentsSupports local execution
apply_checksApplies quality checks to the DataFrame; Returns a DataFrame with result columns and an optional Spark Observation with summary metrics.df: DataFrame to check; checks: List of checks defined using DQX classes, each check is an instance of the DQRule class; ref_dfs: (optional) Reference DataFrames to use in the checks, if applicable.Yes
apply_checks_and_splitApplies quality checks to the DataFrame; Returns valid and invalid (quarantine) DataFrames with result columns and an optional Spark Observation with summary metrics.df: DataFrame to check; checks: List of checks defined using DQX classes, each check is an instance of the DQRule class; ref_dfs: (optional) Reference DataFrames to use in the checks, if applicable.Yes
apply_checks_by_metadataApplies quality checks defined as a dictionary to the DataFrame; Returns a DataFrame with result columns and an optional Spark Observation with summary metrics.df: DataFrame to check; checks: List of checks defined as dictionary; custom_check_functions: (optional) Dictionary with custom check functions (e.g., globals() of the calling module); ref_dfs: (optional) Reference DataFrames to use in the checks, if applicable.Yes
apply_checks_by_metadata_and_splitApplies quality checks defined as a dictionary; Returns valid and invalid (quarantine) DataFrames with result columns and an optional Spark Observation with summary metrics.df: DataFrame to check; checks: List of checks defined as dictionary; custom_check_functions: (optional) Dictionary with custom check functions (e.g., globals() of the calling module); ref_dfs: (optional) Reference DataFrames to use in the checks, if applicable.Yes
apply_checks_and_save_in_tableApplies quality checks using DQRule objects, writes results to valid and invalid Delta table(s) with result columns, and optionally writes summary metrics to a Delta table.input_config: InputConfig object with the table name and options for reading the input data; checks: List of checks defined using DQX classes, each check is an instance of the DQRule class; output_config: OutputConfig object with the table name, output mode, and options for the output data; quarantine_config: (optional) OutputConfig object with the table name, output mode, and options for the quarantine data - if provided, data will be split; ref_dfs: (optional) Reference DataFrames to use in the checks, if applicable; checks_location: (optional) location of the checks, only used for reporting in the summary metrics table.No
apply_checks_by_metadata_and_save_in_tableApplies quality checks defined as a dictionary, writes results to valid and invalid Delta table(s) with result columns, and optionally writes summary metrics to a Delta table.input_config: InputConfig object with the table name and options for reading the input data; checks: List of checks defined as dictionary; output_config: OutputConfig object with the table name, output mode, and options for the output data; quarantine_config: (optional) OutputConfig object with the table name, output mode, and options for the quarantine data - if provided, data will be split; metrics_config: (optional) OutputConfig object with the table name, output mode, and options for the summary metrics; custom_check_functions: (optional) Dictionary with custom check functions; ref_dfs: (optional) Reference DataFrames to use in the checks, if applicable; checks_location: (optional) location of the checks, only used for reporting in the summary metrics table.No
apply_checks_and_save_in_tablesApplies quality checks persisted in a storage to multiple tables and writes results to valid and invalid Delta table(s) with result columns.run_configs: list of run config objects (RunConfig) containing input config (InputConfig), output config (OutputConfig), quarantine config (OutputConfig, if provided data will be split), 'checks_location', and if provided 'reference_tables' and 'custom_check_functions'; max_parallelism: (optional) Maximum number of tables to check in parallel (defaults to the number of CPU cores).No
apply_checks_and_save_in_tables_for_patternsApplies quality checks persisted in a storage to multiple tables matching provided wildcard patterns and writes results to valid and invalid Delta table(s) with result columns. Skip output and quarantine tables based on specified suffixes.patterns: List of table names or filesystem-style wildcards (e.g. 'schema.') to include (if None, all tables are included); ; exclude_patterns: (optional) List of table names or filesystem-style wildcards (e.g., '_dq_output') to exclude, useful if wanting to exclude existing output or quarantine tables; checks_location: Location of the checks files (e.g. absolute workspace or volume directory or delta table), for file based locations, checks are expected to be found under 'checks_location/input_table_name.yml'; exclude_matched:(optional) Whether to exclude matched tables (default False); run_config_template: (optional) Run configuration template to use for all tables (skip location in the 'input_config', 'output_config', and 'quarantine_config' fields as it is derived from patterns, skip 'checks_location' of the run config as it is derived separately, autogenerate 'input_config' and 'output_config' if not provided, use 'reference_tables' and 'custom_check_functions' if provided; max_parallelism: (optional) Maximum number of tables to check in parallel (defaults to the number of CPU cores); output_table_suffix: (optional) Suffix to append to the output table name (default "_dq_output"); quarantine_table_suffix: (optional) Suffix to append to the quarantine table name (default "_dq_quarantine").No
validate_checksValidates the provided quality checks to ensure they conform to the expected structure and types.checks: List of checks to validate; custom_check_functions: (optional) Dictionary of custom check functions that can be used; validate_custom_check_functions: (optional) If True, validates custom check functions (defaults to True).Yes
get_invalidRetrieves records from the DataFrame that violate data quality checks (records with warnings and errors).df: Input DataFrame.Yes
get_validRetrieves records from the DataFrame that pass all data quality checks.df: Input DataFrame.Yes
load_checksLoads quality rules (checks) from storage backend. Multiple storage backends are supported including tables, files, workspace files, or installation-managed sources inferred from run config.config: Configuration for loading checks from a storage backend, e.g., FileChecksStorageConfig (local YAML/JSON file or workspace file), WorkspaceFileChecksStorageConfig (workspace file with absolute path), VolumeFileChecksStorageConfig (Unity Catalog Volume YAML/JSON), TableChecksStorageConfig (table), InstallationChecksStorageConfig (installation-managed backend using checks_location in run config).Yes (only with FileChecksStorageConfig)
save_checksSaves quality rules (checks) to a storage backend. Multiple storage backends are supported including tables, files, workspace files, or installation-managed targets inferred from run config.checks: List of checks defined as dictionary; config: Configuration for saving checks in a storage backend, e.g., FileChecksStorageConfig (local YAML/JSON file or workspace file), WorkspaceFileChecksStorageConfig (workspace file with absolute path), VolumeFileChecksStorageConfig (Unity Catalog Volume YAML/JSON), TableChecksStorageConfig (table), InstallationChecksStorageConfig (installation-managed backend using checks_location in run config).Yes (only with FileChecksStorageConfig)
save_results_in_tableSaves quality checking results and (optionally) summary metrics to Delta table(s).output_df: (optional) DataFrame containing the output data; quarantine_df: (optional) DataFrame containing invalid data; observation: (optional) Spark Observation tracking summary metrics; output_config: OutputConfig object with the table name, output mode, and options for the output data; quarantine_config: (optional) OutputConfig object with the table name, output mode, and options for the quarantine data; metrics_config: (optional) OutputConfig object with the table name, output mode, and options for the summary metrics data; run_config_name: Name of the run config to use; install_folder: (optional) Installation folder where DQX is installed (only required for custom folder); assume_user: (optional) If True, assume user installation, otherwise global.No
save_summary_metricsSaves quality checking summary metrics to a Delta table.observed_metrics: dict[str, Any] Collected summary metrics from Spark Observation; metrics_config: OutputConfig object with the table name, output mode, and options for the summary metrics data; input_config: (optional) InputConfig object with the table name for reading the input data; output_config: (optional) OutputConfig object with the table name for the output data; quarantine_config: (optional) OutputConfig object with the table name for the quarantine data.No
get_streaming_metrics_listenerGets a streaming metrics listener for writing metrics to an output table. Only required when using streaming DataFrames.metrics_config: OutputConfig object with the table name, output mode, and options for the summary metrics data; input_config: (optional) InputConfig object with the table name for reading the input data; output_config: (optional) OutputConfig object with the table name for the output data; quarantine_config: (optional) OutputConfig object with the table name for the quarantine data; checks_location: (optional) checks location; target_query_id: (optional) Query ID of the specific streaming query to monitor, if provided, metrics will be collected only for this query.No

The 'Supports local execution' in the above table indicates which methods can be used for local testing without a Databricks workspace (see the usage in local testing section).

RunConfig support the following parameters:

  • name: The name of the run config.
  • input_config (InputConfig): Configuration of the input data.
  • output_config (OutputConfig): Configuration of the output data.
  • quarantine_config (OutputConfig): Configuration of the quarantine data (if provided, data will be split).
  • checks_location: Location of the checks.
  • reference_tables (dict[str, InputConfig]): Dictionary of reference DataFrames stored in tables/views to use in the checks.
  • custom_check_functions (dict[str, str]): Mapping of fully qualified custom check function name to the module location, such as:
    • absolute workspace path: {"my_func", e.g. "/Workspace/my_repo/my_module.py"}
    • relative workspace path (installation folder prefix applied), e.g. {"my_func": "my_module.py"}
    • UC volume path, e.g. {"my_func": "/Volumes/main/default/my_repo/my_module.py"}

InputConfig support the following parameters:

  • location: The location of the input data source (e.g. table name or file path).
  • format: The format of the input data (default is delta).
  • is_streaming: Whether the input data is a streaming source (default is False).
  • schema: Optional schema for the input data.
  • options: Additional options for reading the input data, such as partitioning or merge settings.

OutputConfig support the following parameters:

  • location: The location of the output data (e.g. table name).
  • format: The format of the output data (default is delta).
  • mode: The write mode for the output data (overwrite or append, default is append).
  • options: Additional options for writing the output data, such as schema merge settings.
  • trigger: Optional trigger settings for streaming output, such as trigger={"availableNow": True} or trigger={"processingTime": "10 seconds"}.

Supported storage backend configurations (implementations of BaseChecksStorageConfig) for load_checks and save_checks methods:

  • FileChecksStorageConfig can be used to save or load checks from a local filesystem, or workspace file if invoked from Databricks notebook or job, with fields:
    • location: absolute or relative file path in the local filesystem (JSON or YAML); also works with absolute or relative workspace file paths if invoked from Databricks notebook or job.
  • WorkspaceFileChecksStorageConfig can be used to save or load checks from a workspace file, with fields:
    • location: absolute workspace file path (JSON or YAML).
  • TableChecksStorageConfig can be used to save or load checks from a table, with fields:
    • location: table fully qualified name.
    • run_config_name: (optional) run configuration name to load (it can be any string), e.g. input table or job name (use "default" if not provided).
    • mode: (optional) write mode for saving checks (overwrite or append, default is overwrite). The overwrite mode will only replace checks for the specific run config and not all checks in the table.
  • VolumeFileChecksStorageConfig can be used to save or load checks from a Unity Catalog Volume file, with fields:
    • location: Unity Catalog Volume file path (JSON or YAML).
  • InstallationChecksStorageConfig can be used to save or load checks from workspace installation, with fields:
    • location (optional): automatically set based on the checks_location field from the run configuration.
    • install_folder: (optional) installation folder where DQX is installed, only required when custom installation folder is used.
    • run_config_name (optional) - run configuration name to load (it can be any string), e.g. input table or job name (use "default" if not provided).
    • product_name: (optional) name of the product (use "dqx" if not provided).
    • assume_user: (optional) if True, assume user installation, otherwise global installation (skipped if install_folder is provided).

For details on how to prepare reference DataFrames (ref_dfs) and custom check function mapping (custom_check_functions) refer to Quality Checks Reference.