dbldatagen.distributions.data_distribution module
This file defines the base class for statistical distributions
Each inherited version of the DataDistribution object is used to generate random numbers drawn from a specific distribution.
As the test data generator needs to scale the set of values generated across different data ranges, the generate function is intended to generate values scaled to values between 0 and 1.
AS some distributions don’t have easily predicted bounds, we scale the random data sets by taking the minimum and maximum value of each generated data set and using that as the range for the generated data.
For some distributions, there may be alternative more efficient mechanisms for scaling the data to the [0, 1] interval.
Some data distributions are scaled to the [0,1] interval as part of their data generation and no further scaling is needed.
- class DataDistribution[source]
Bases:
ABC
Base class for all distributions
- abstract generateNormalizedDistributionSample()[source]
Generate sample of data for distribution
- Returns:
random samples from distribution scaled to values between 0 and 1
Note implementors should provide implementation for this,
Return value is expected to be a Pyspark SQL column expression such as F.expr(“rand()”)
- static get_np_random_generator(random_seed)[source]
Get numpy random number generator
- Parameters:
random_seed – Numeric random seed to use. If < 0, then no random
- Returns:
- property randomSeed
get the randomSeed attribute
- property rounding
get the rounding attribute