Skip to main content

Motivation

Current data quality frameworks often fall short of providing detailed explanations for specific row or column data quality issues and are primarily designed for complete datasets, making integration into streaming workloads difficult. They cannot quarantine invalid data and have compatibility issues with Databricks Runtime.

This project introduces a simple but powerful Python validation framework for assessing the data quality of PySpark DataFrames. It enables real-time quality validation during data processing rather than relying solely on post-factum monitoring. The validation output includes detailed information on why specific rows and columns have issues, allowing for quicker identification and resolution of data quality problems. The framework offers the ability to quarantine invalid data and investigate quality issues before they escalate.

What makes DQX unique?

✅  Ease of use: DQX is designed to be simple and intuitive, making it easy for various personas — such as data engineers, data scientists, analysts, and business users — to define data quality checks without extensive training.

✅  Detailed information on data quality issues: DQX provides granular insights into data quality problems, allowing users to understand the specific reasons behind data quality failures and reduce troubleshooting time to a minimum.

✅  Streaming support: DQX is built to work seamlessly with both batch and streaming data, enabling real-time data quality validation.

✅  Proactive monitoring: DQX allows you to monitor data quality before writing to the target table, ensuring that only valid data is persisted. Reactive monitoring is also supported.

✅  Custom quality rules: Users can define their own data quality rules easily, making DQX flexible and adaptable to various data quality requirements and domains.

✅  Integration with Databricks: DQX is designed to work seamlessly with Databricks across all engines (Spark Core, Spark Structured Streaming, and Lakeflow Pipelines / DLT) and cluster types (standard and serverless).

✅  Row, Column and Dataset-level rules: DQX supports row-level, column-level and dataset-level data quality checks, allowing for comprehensive validation of data at different granularities.

How DQX works

Option 1: Apply checks and quarantine "bad" data.

Apply checks on the DataFrame and quarantine invalid records to ensure "bad" data is never written to the output.

Quarantine

Option 2: Apply checks and annotate "bad" data.

Apply checks on the DataFrame and annotate invalid records as additional columns.

DQX

DQX usage in the Lakehouse Architecture

In the Lakehouse architecture, new data validation should happen during data entry into the curated layer to ensure bad data is not propagated to the subsequent layers. With DQX, you can implement Dead-Letter pattern to quarantine invalid data and re-ingest it after curation to ensure data quality constraints are met. The data quality can be monitored in real-time between layers, and the quarantine process can be automated.

Lakehouse

When to use DQX

DQX is a robust framework designed to define, validate, and enforce data quality rules within Databricks environments. Consider using DQX when you need:

  • Proactive Monitoring: Implement quality checks before data is saved.
  • Data Quarantining: Isolate and manage bad data to prevent it from affecting downstream processes.
  • Streaming Support: Monitor and validate data in real-time.
  • Granular Validation: Perform row, column and dataset-level quality checks.
  • Custom Quality Rules: Define and enforce tailored data quality rules specific to your business needs.

For monitoring data quality of already persisted data in Delta tables (post-factum monitoring), try Databricks Lakehouse Monitoring.

DQX seamlessly integrates with other Databricks tools and frameworks such as:

  • Lakeflow Pipelines (formerly DLT), where you can use DQX to apply DQX quality checks or to profile input data and generate Lakeflow expectation candidates.
  • dbt, where you can apply DQX quality checks on the output of dbt models.