dbldatagen.constraints.unique_combinations module

This module defines the Positive class

class UniqueCombinations(columns=None)[source]

Bases: NoFilterMixin, Constraint

Unique Combinations constraints

Applies constraint to ensure columns have unique combinations - i.e the set of columns supplied only have one combination of each set of values

Parameters:

columns – string column name or list of column names as strings.If no columns are specified, all output columns will be considered when dropping duplicate combinations.

if the columns are not specified, or the column name of ‘*’ is used, all columns that would be present in the final output are considered.

Essentially applies the constraint that the named columns have unique values for each combination of columns.

The uniqueness constraint may apply to columns that are omitted - i.e not part of the final output. If no column or column list is supplied, all columns that would be present in the final output are considered.

This is useful to enforce unique ids, unique keys etc.

..Note: When applied to streaming dataframe, it will perform any deduplication only within a batch.

If stateful operation is needed, where duplicates are eliminated across the entire stream, it is recommended to use a watermark and apply deduplication logic to the dataframe produced by the build() method.

For high volume streaming dataframes, this may consume substantial resources when maintaining state - hence deduplication will only be performed within a batch.

prepareDataGenerator(dataGenerator)[source]

Prepare the data generator to generate data that matches the constraint

This method may modify the data generation rules to meet the constraint

Parameters:

dataGenerator – Data generation object that will generate the dataframe

Returns:

modified or unmodified data generator

transformDataframe(dataGenerator, dataFrame)[source]

Transform the dataframe to make data conform to constraint if possible

This method should not modify the dataGenerator - but may modify the dataframe

Parameters:
  • dataGenerator – Data generation object that generated the dataframe

  • dataFrame – generated dataframe

Returns:

modified or unmodified Spark dataframe

The default transformation returns the dataframe unmodified