dbldatagen.constraints.unique_combinations module
This module defines the Positive class
- class UniqueCombinations(columns=None)[source]
Bases:
NoFilterMixin
,Constraint
Unique Combinations constraints
Applies constraint to ensure columns have unique combinations - i.e the set of columns supplied only have one combination of each set of values
- Parameters:
columns – string column name or list of column names as strings.If no columns are specified, all output columns will be considered when dropping duplicate combinations.
if the columns are not specified, or the column name of ‘*’ is used, all columns that would be present in the final output are considered.
Essentially applies the constraint that the named columns have unique values for each combination of columns.
The uniqueness constraint may apply to columns that are omitted - i.e not part of the final output. If no column or column list is supplied, all columns that would be present in the final output are considered.
This is useful to enforce unique ids, unique keys etc.
..Note: When applied to streaming dataframe, it will perform any deduplication only within a batch.
If stateful operation is needed, where duplicates are eliminated across the entire stream, it is recommended to use a watermark and apply deduplication logic to the dataframe produced by the build() method.
For high volume streaming dataframes, this may consume substantial resources when maintaining state - hence deduplication will only be performed within a batch.
- prepareDataGenerator(dataGenerator)[source]
Prepare the data generator to generate data that matches the constraint
This method may modify the data generation rules to meet the constraint
- Parameters:
dataGenerator – Data generation object that will generate the dataframe
- Returns:
modified or unmodified data generator
- transformDataframe(dataGenerator, dataFrame)[source]
Transform the dataframe to make data conform to constraint if possible
This method should not modify the dataGenerator - but may modify the dataframe
- Parameters:
dataGenerator – Data generation object that generated the dataframe
dataFrame – generated dataframe
- Returns:
modified or unmodified Spark dataframe
The default transformation returns the dataframe unmodified