Generating Data that Conforms to a Known Statistical Distribution

By the default, data that is being generated at random uses a uniform random number generator.

Sometimes it is useful to generate data that conforms to a known distribution.

While use of the weights option with discrete value lists can be used to introduce skew, this can be awkward to manage for large sets of values.

To enable this, we support use of known distributions for randomly generated data on any field.

When the field is not numeric, the underlying seed value will generated to conform to the known distribution before being converted to the appropriate type as per usual semantics.

Note that the distribution will be scaled to the possible range of values

The following distributions are supported:

  • normal or Gaussian distribution

  • Beta distribution

  • Gamma distribution

  • Exponential distribution

Note the distribution option will have no effect for values that are not randomly generated as per use of the random option.

For values generated randomly, continuous distributions can still be used with discrete values such as strings as the underlying random numbers used to select the appropriate discrete values will be drawn from the specified distribution. So, for discrete values, the frequency of occurrence of particular values should conform approximately to the underlying distribution.

Examples

In the following example (taken from the section on date ranges), we will simulate returns and ensure the return date is after the purchase date.

Here we specify an explicit date range and add a random number of days for the return.

However, unlike the example in the date range section, we will use a specific distribution to make returns more frequent in the period immediately following the purchase.

from pyspark.sql.types import IntegerType

import dbldatagen as dg
import dbldatagen.distributions as dist


row_count = 1000 * 100
testDataSpec = (
    dg.DataGenerator(spark, name="test_data_set1", rows=row_count)
    .withColumn("purchase_id", IntegerType(), minValue=1000000, maxValue=2000000)
    .withColumn("product_code", IntegerType(), uniqueValues=10000, random=True)
    .withColumn(
        "purchase_date",
        "date",
        data_range=dg.DateRange("2017-10-01 00:00:00", "2018-10-06 11:55:00", "days=3"),
        random=True,
    )
    # create return delay , favoring short delay times
    .withColumn(
        "return_delay",
        "int",
        minValue=1,
        maxValue=100,
        random=True,
        distribution=dist.Gamma(1.0, 2.0),
        omit=True,
    )
    .withColumn(
        "return_date",
        "date",
        expr="date_add(purchase_date, return_delay)",
        baseColumn=["purchase_date", "return_delay"],
    )
)

dfTestData = testDataSpec.build()

Here we use a computed column, return_delay, for effect only. By specifying omit=True, it is omitted from the final data set.

You can view the distribution of the return delays using the following code sample in the Databricks environment.

import pyspark.sql.functions as F
dfDelays = dfTestData.withColumn("delay", F.expr("datediff(return_date, purchase_date)"))

display(dfDelays)

Use the plot options to plot the delay as a bar chart.

Specify the key as delay, the values as delay and the aggregation as COUNT to see the data distribution.