dbldatagen.distributions.data_distribution module

This file defines the base class for statistical distributions

Each inherited version of the DataDistribution object is used to generate random numbers drawn from a specific distribution.

As the test data generator needs to scale the set of values generated across different data ranges, the generate function is intended to generate values scaled to values between 0 and 1.

AS some distributions don’t have easily predicted bounds, we scale the random data sets by taking the minimum and maximum value of each generated data set and using that as the range for the generated data.

For some distributions, there may be alternative more efficient mechanisms for scaling the data to the [0, 1] interval.

Some data distributions are scaled to the [0,1] interval as part of their data generation and no further scaling is needed.

class DataDistribution[source]

Bases: ABC

Base class for all distributions

abstract generateNormalizedDistributionSample()[source]

Generate sample of data for distribution

Returns:: random samples from distribution scaled to values between 0 and 1

Note implementors should provide implementation for this,

Return value is expected to be a Pyspark SQL column expression such as F.expr(“rand()”)

static get_np_random_generator(random_seed)[source]

Get numpy random number generator

Parameters:: random_seed – Numeric random seed to use. If < 0, then no random
Returns:

property randomSeed: get the randomSeed attribute

property rounding: get the rounding attribute

withRandomSeed(seed)[source]

Create copy of object and set the random seed attribute

Parameters:: seed – random generator seed value to set. Should be integer, float or None
Returns:: new instance of data distribution object with rounding set

withRounding(rounding)[source]

Create copy of object and set the rounding attribute

Parameters:: rounding – rounding value to set. Should be True or False
Returns:: new instance of data distribution object with rounding set