DQX Engine
To perform data quality checking with DQX, you must create a DQEngine
object.
The engine requires a Databricks workspace client for authentication and interaction with the Databricks workspace.
When running the code on a Databricks workspace, the workspace client is automatically authenticated, whether DQX is used in a notebook, script, or job/workflow. You only need the following code to create the workspace client if you run DQX on Databricks workspace:
- Python
from databricks.sdk import WorkspaceClient
from databricks.labs.dqx.engine import DQEngine
ws = WorkspaceClient()
dq_engine = DQEngine(ws)
For external environments, such as CI servers or local machines, you can authenticate to Databricks using any method supported by the Databricks SDK. For detailed instructions, refer to the default authentication flow. If you're using Databricks configuration profiles or Databricks-specific environment variables for authentication, you can create the workspace client without needing to provide additional arguments:
ws = WorkspaceClient()
The DQEngine
is initialized by default with standard Spark Session available in the current environment.
If you need to use a custom Spark Session, such as from Databricks Connect you can pass it as an argument when creating the DQEngine
instance:
from databricks.connect import DatabricksSession
spark = DatabricksSession.builder.getOrCreate()
dq_engine = DQEngine(ws, spark)
For local execution without a Databricks workspace, please refer to the local testing section.
DQX engine methods
The following table outlines the available methods of the DQEngine
and their functionalities:
Available DQX engine methods
Method | Description | Arguments | Supports local execution |
---|---|---|---|
apply_checks | Applies quality checks to the DataFrame and returns a DataFrame with reporting columns. | df : DataFrame to check; checks : List of checks to the DataFrame. Each check is an instance of the DQRule class; ref_dfs : Reference dataframes to use in the checks, if applicable. | Yes |
apply_checks_and_split | Applies quality checks to the DataFrame and returns valid and invalid (quarantine) DataFrames with reporting columns. | df : DataFrame to check; checks : List of checks to apply to the DataFrame. Each check is an instance of the DQRule class; ref_dfs : Reference dataframes to use in the checks, if applicable. | Yes |
apply_checks_and_save_in_table | Applies quality checks using DQRule objects and writes results to valid and invalid Delta table(s) with reporting columns. | input_config : InputConfig object with the table name and options for reading the input data; checks : List of DQRule instances to apply; output_config : OutputConfig object with the table name, output mode, and options for the output data; quarantine_config : OutputConfig object with the table name, output mode, and options for the quarantine data - if provided, data will be split; ref_dfs : Reference dataframes to use in the checks, if applicable. | No |
apply_checks_by_metadata | Applies quality checks defined as a dictionary to the DataFrame and returns a DataFrame with reporting columns. | df : DataFrame to check. checks: List of dictionaries describing checks; custom_check_functions : Optional dictionary with custom check functions (e.g., globals() of the calling module); ref_dfs : Reference dataframes to use in the checks, if applicable. | Yes |
apply_checks_by_metadata_and_split | Applies quality checks defined as a dictionary and returns valid and invalid (quarantine) DataFrames. | df : DataFrame to check; checks : List of dictionaries describing checks; custom_check_functions : Optional dictionary with custom check functions (e.g., globals() of the calling module); ref_dfs : Reference dataframes to use in the checks, if applicable. | Yes |
apply_checks_by_metadata_and_save_in_table | Applies quality checks defined as a dictionary and writes results to valid and invalid Delta table(s) with reporting columns. | input_config : InputConfig object with the table name and options for reading the input data; checks : List of metadata check dictionaries; output_config : OutputConfig object with the table name, output mode, and options for the output data; quarantine_config : OutputConfig object with the table name, output mode, and options for the quarantine data - if provided, data will be split; custom_check_functions : Optional dictionary with custom check functions; ref_dfs : Reference dataframes to use in the checks, if applicable. | No |
validate_checks | Validates the provided quality checks to ensure they conform to the expected structure and types. | checks : List of checks to validate; custom_check_functions : Optional dictionary of custom check functions that can be used. | Yes |
get_invalid | Retrieves records from the DataFrame that violate data quality checks (records with warnings and errors). | df : Input DataFrame. | Yes |
get_valid | Retrieves records from the DataFrame that pass all data quality checks. | df : Input DataFrame. | Yes |
load_checks_from_local_file | Loads quality rules from a local file (supports YAML and JSON). | path : Path to a file containing the checks. | Yes |
save_checks_in_local_file | Saves quality rules to a local file in YAML format. | checks : List of checks to save; path : Path to a file containing the checks. | Yes |
load_checks_from_workspace_file | Loads checks from a file (JSON or YAML) stored in the Databricks workspace. | workspace_path : Path to the file in the workspace. | No |
load_checks_from_installation | Loads checks from the workspace installation configuration file (checks_file field). | run_config_name : Name of the run config to use; product_name : Name of the product/installation directory; assume_user : If True, assume user installation. | No |
save_checks_in_workspace_file | Saves checks to a file (YAML) in the Databricks workspace. | checks : List of checks to save; workspace_path : Destination path for the checks file in the workspace. | No |
save_checks_in_installation | Saves checks to the installation folder as a YAML file. | checks : List of checks to save; run_config_name : Name of the run config to use; assume_user : If True, assume user installation. | No |
load_run_config | Loads run configuration from the installation folder. | run_config_name : Name of the run config to use; assume_user : If True, assume user installation. | No |
save_results_in_table | Save results of quality checking to delta table(s). | output_df : (optional) Dataframe containing the output data; quarantine_df : (optional) Dataframe containing the output data; output_config : OutputConfig object with the table name, output mode, and options for the output data; quarantine_config : OutputConfig object with the table name, output mode, and options for the quarantine data - if provided, data will be split; run_config_name : Name of the run config to use; assume_user : If True, assume user installation. | No |
The 'Supports local execution' in the above table indicates which methods can be used for local testing without a Databricks workspace (see the usage in local testing section).
InputConfig
support the following parameters:
location
: The location of the input data source (e.g. table name or file path).format
: The format of the input data (default isdelta
).is_streaming
: Whether the input data is a streaming source (default isFalse
).schema
: Optional schema for the input data.options
: Additional options for reading the input data, such as partitioning or merge settings.
OutputConfig
support the following parameters:
location
: The location of the output data (e.g. table name).format
: The format of the output data (default isdelta
).mode
: The write mode for the output data (overwrite
orappend
, default isappend
).options
: Additional options for writing the output data, such as schema merge settings.trigger
: Optional trigger settings for streaming output, such astrigger={"processingTime": "10 seconds"}
.