# Generating Data that Conforms to a Known Statistical Distribution

By the default, data that is being generated at random uses a uniform random number generator. 

Sometimes it is useful to generate data that conforms to a known distribution. 

While use of the `weights` option with discrete value lists can be used to introduce skew, 
this can be awkward to manage for large sets of values.

To enable this, we support use of known distributions for randomly generated data on any field.

When the field is not numeric, the underlying seed value will generated to conform to the 
known distribution before being converted 
to the appropriate type as per usual semantics. 

Note that the distribution will be scaled to the possible range of values

The following distributions are supported:
- normal or Gaussian distribution
- Beta distribution
- Gamma distribution
- Exponential distribution

> Note the `distribution` option will have no effect for values that are not randomly generated as
> per use of the `random` option.
> 
> For values generated randomly, continuous distributions can still be used with discrete values such as strings
> as the underlying random numbers used to select the appropriate discrete values will be drawn from the specified
> distribution. So, for discrete values, the frequency of occurrence of particular values should conform approximately
> to the underlying distribution.


### Examples 

In the following example (taken from the section on date ranges), we will simulate returns and 
ensure the return date is after the purchase date.

Here we specify an explicit date range and add a random number of days for the return.

However, unlike the example in the date range section, we will use a specific distribution to 
make returns more frequent in the period immediately following the purchase.

```python 
from pyspark.sql.types import IntegerType

import dbldatagen as dg
import dbldatagen.distributions as dist


row_count = 1000 * 100
testDataSpec = (
    dg.DataGenerator(spark, name="test_data_set1", rows=row_count)
    .withColumn("purchase_id", IntegerType(), minValue=1000000, maxValue=2000000)
    .withColumn("product_code", IntegerType(), uniqueValues=10000, random=True)
    .withColumn(
        "purchase_date",
        "date",
        data_range=dg.DateRange("2017-10-01 00:00:00", "2018-10-06 11:55:00", "days=3"),
        random=True,
    )
    # create return delay , favoring short delay times
    .withColumn(
        "return_delay",
        "int",
        minValue=1,
        maxValue=100,
        random=True,
        distribution=dist.Gamma(1.0, 2.0),
        omit=True,
    )
    .withColumn(
        "return_date",
        "date",
        expr="date_add(purchase_date, return_delay)",
        baseColumn=["purchase_date", "return_delay"],
    )
)

dfTestData = testDataSpec.build()
```

Here we use a computed column, `return_delay`, for effect only. By specifying `omit=True`, 
it is omitted from the final data set.

You can view the distribution of the return delays using the following code sample in the Databricks 
environment.

```python 
import pyspark.sql.functions as F
dfDelays = dfTestData.withColumn("delay", F.expr("datediff(return_date, purchase_date)"))

display(dfDelays)
```

Use the plot options to plot the delay as a bar chart.

Specify the key as `delay`, the values as `delay` and the aggregation as `COUNT` to see the data 
distribution.