Generating Data that Conforms to a Known Statistical Distribution
By the default, data that is being generated at random uses a uniform random number generator.
Sometimes it is useful to generate data that conforms to a known distribution.
While use of the weights
option with discrete value lists can be used to introduce skew,
this can be awkward to manage for large sets of values.
To enable this, we support use of known distributions for randomly generated data on any field.
When the field is not numeric, the underlying seed value will generated to conform to the known distribution before being converted to the appropriate type as per usual semantics.
Note that the distribution will be scaled to the possible range of values
The following distributions are supported:
normal or Gaussian distribution
Beta distribution
Gamma distribution
Exponential distribution
Note the
distribution
option will have no effect for values that are not randomly generated as per use of therandom
option.For values generated randomly, continuous distributions can still be used with discrete values such as strings as the underlying random numbers used to select the appropriate discrete values will be drawn from the specified distribution. So, for discrete values, the frequency of occurrence of particular values should conform approximately to the underlying distribution.
Examples
In the following example (taken from the section on date ranges), we will simulate returns and ensure the return date is after the purchase date.
Here we specify an explicit date range and add a random number of days for the return.
However, unlike the example in the date range section, we will use a specific distribution to make returns more frequent in the period immediately following the purchase.
from pyspark.sql.types import IntegerType
import dbldatagen as dg
import dbldatagen.distributions as dist
row_count = 1000 * 100
testDataSpec = (
dg.DataGenerator(spark, name="test_data_set1", rows=row_count)
.withColumn("purchase_id", IntegerType(), minValue=1000000, maxValue=2000000)
.withColumn("product_code", IntegerType(), uniqueValues=10000, random=True)
.withColumn(
"purchase_date",
"date",
data_range=dg.DateRange("2017-10-01 00:00:00", "2018-10-06 11:55:00", "days=3"),
random=True,
)
# create return delay , favoring short delay times
.withColumn(
"return_delay",
"int",
minValue=1,
maxValue=100,
random=True,
distribution=dist.Gamma(1.0, 2.0),
omit=True,
)
.withColumn(
"return_date",
"date",
expr="date_add(purchase_date, return_delay)",
baseColumn=["purchase_date", "return_delay"],
)
)
dfTestData = testDataSpec.build()
Here we use a computed column, return_delay
, for effect only. By specifying omit=True
,
it is omitted from the final data set.
You can view the distribution of the return delays using the following code sample in the Databricks environment.
import pyspark.sql.functions as F
dfDelays = dfTestData.withColumn("delay", F.expr("datediff(return_date, purchase_date)"))
display(dfDelays)
Use the plot options to plot the delay as a bar chart.
Specify the key as delay
, the values as delay
and the aggregation as COUNT
to see the data
distribution.