Skip to main content

Motivation

Current data quality frameworks often fall short in providing detailed explanations for specific row or column data quality issues and are primarily designed for complete datasets, making integration into streaming workloads difficult. They also lack the ability to quarantine invalid data and have compatibility issues with Databricks Runtime.

This project introduces a simple Python validation framework for assessing data quality of PySpark DataFrames. It enables real-time quality validation during data processing rather than relying solely on post-factum monitoring. The validation output includes detailed information on why specific rows and columns have issues, allowing for quicker identification and resolution of data quality problems. The framework offers the ability to quarantine invalid data and investigate quality issues before they escalate.

DQX

Invalid data can be quarantined to make sure bad data is never written to the output.

Quarantine

In the Lakehouse architecture, the validation of new data should happen at the time of data entry into the Curated Layer to make sure bad data is not propagated to the subsequent layers. With DQX you can easily quarantine invalid data and re-ingest it after curation to ensure that data quality constraints are met.

Lakehouse

When to use DQX

  • Use DQX if you need pro-active monitoring (before data is written to a target table).
  • For monitoring data quality of already persisted data in Delta tables (post-factum monitoring), try Databricks Lakehouse Monitoring.
  • DQX can be integrated with DLT for data quality checking but your first choice for DLT pipelines should be DLT Expectations. DQX can be used to profile data and generate DLT expectation candidates.