databricks.labs.dqx.metrics_observer
DQMetricsObservation Objects
@dataclass(frozen=True)
class DQMetricsObservation()
Observer metrics class used to persist summary metrics.
Arguments:
run_id- Unique observation id.run_name- Name of the observations (default is 'dqx').observed_metrics- Dictionary of observed metrics.run_time_overwrite- Run time when the data quality summary metrics were observed. If None, current_timestamp() is used.error_column_name- Name of the error column when running quality checks.warning_column_name- Name of the warning column when running quality checks.input_location- (optional) Location where input data is loaded from when running quality checks (fully-qualified table name or file path).output_location- (optional) Location where output data is persisted when running quality checks (fully-qualified table name or file path).quarantine_location- (optional) Location where quarantined data is persisted when running quality checks (fully-qualified table name or file path).checks_location- (optional) Location where checks are loaded from when running quality checks (fully-qualified table name or file path).
DQMetricsObserver Objects
@dataclass
class DQMetricsObserver()
Observation class used to track summary metrics about data quality when validating datasets with DQX
Arguments:
name- Name of the observations which will be displayed in listener metrics (default is 'dqx'). Also used as run_name field when saving the metrics to a table.custom_metrics- Optional list of SQL expressions defining custom, dataset-level quality metrics
id
@cached_property
def id() -> str
ID of the observer.
Returns:
Unique ID
metrics
@cached_property
def metrics() -> list[str]
Gets the observer metrics as Spark SQL expressions.
Returns:
A list of Spark SQL expressions defining the observer metrics (both default and custom).
observation
@property
def observation() -> Observation
Spark Observation which can be attached to a DataFrame to track summary metrics. Metrics will be collected
when the 1st action is triggered on the attached DataFrame. Subsequent operations on the attached DataFrame
will not update the observed metrics. See: PySpark Observation
for complete documentation.
Returns:
A Spark Observation instance
set_column_names
def set_column_names(error_column_name: str, warning_column_name: str) -> None
Sets the default column names (e.g. _errors and _warnings) for monitoring summary metrics.
Arguments:
error_column_name- Error column namewarning_column_name- Warning column name
build_metrics_df
@staticmethod
def build_metrics_df(spark: SparkSession,
observation: DQMetricsObservation) -> DataFrame
Builds a Spark DataFrame from a DQMetricsObservation.
Arguments:
spark-SparkSessionused to create theDataFrameobservation-DQMetricsObservationwith summary metrics
Returns:
A Spark DataFrame with summary metrics