Skip to main content

Motivation

Current data quality frameworks often fall short of providing detailed explanations for specific row or column data quality issues and are primarily designed for complete datasets, making integration into streaming workloads difficult. They cannot quarantine invalid data and have compatibility issues with Databricks Runtime.

This project introduces a simple but powerful Python validation framework for assessing the data quality of PySpark DataFrames. It enables real-time quality validation during data processing rather than relying solely on post-factum monitoring. The validation output includes detailed information on why specific rows and columns have issues, allowing for quicker identification and resolution of data quality problems. The framework offers the ability to quarantine invalid data and investigate quality issues before they escalate.

How DQX works

Option 1: Apply checks and quarantine "bad" data.

Apply checks on the DataFrame and quarantine invalid records to ensure "bad" data is never written to the output.

Quarantine

Option 2: Apply checks and annotate "bad" data.

Apply checks on the DataFrame and annotate invalid records as additional columns.

DQX

DQX usage in the Lakehouse Architecture

In the Lakehouse architecture, new data validation should happen during data entry into the curated layer to ensure bad data is not propagated to the subsequent layers. With DQX, you can implement Dead-Letter pattern to quarantine invalid data and re-ingest it after curation to ensure data quality constraints are met. The data quality can be monitored in real-time between layers, and the quarantine process can be automated.

Lakehouse

When to use DQX

  • Use DQX if you need: pro-active monitoring (before data is written to a target table), streaming support, or custom quality rules.
  • For monitoring data quality of already persisted data in Delta tables (post-factum monitoring), try Databricks Lakehouse Monitoring.
  • DQX can be integrated with DLT to check data quality. However, your first choice for DLT pipelines should be DLT Expectations. DQX can be used to profile data and generate DLT expectation candidates.

You can integrate DQX with other data transformation frameworks that support PySpark, such as dbt. This can be integrated by running DQX on tables produced by dbt models or as part of dbt project by running DQX in dbt Python models.