DQX - Data Quality Framework
Provided by Databricks Labs
DQX is a data quality framework for Apache Spark that enables you to define, monitor, and address data quality issues in your Python-based data pipelines.
Capabilities
Info of Failed Checks
Get detailed insights into why a check has failed.
Data Format Agnostic
Works seamlessly with PySpark DataFrames.
Spark Batch & Spark Structured Streaming Support
Includes Delta Live Tables (DLT) integration.
Custom Reactions to Failed Checks
Drop, mark, or quarantine invalid data flexibly.
Check Levels
Use warning or error levels for failed checks.
Row & Column Level Rules
Define quality rules at both row and column levels.
Profiling & Quality Rules Generation
Automatically profile input data and generate data quality rule candidates.
Code or Config Checks
Define checks as code or configuration.
Validation Summary & Quality Dashboard
Track and identify data quality issues effectively.