Repeatable Data Generation

One of the basic principles of the data generator is that all data can be generated multiple times and produce the same results unless a column is marked random and no random seed is being used.

Data is generated by generating an internal autoincrementing field called id. The name of this column may be overridden by use of an argument to the instantiation of the DataGenerator object.

All additional columns are generated through some transformation of the id column or some other designated baseColumn, either by using its value, a hash of its value or an array of values if you are using multiple baseColumn fields.

Additional transformations are then applied as directed by the options - including transformation of the base field to discrete values, to an arbitrary range of numbers, dates or timestamps, or by applying explicit SQL expressions to the base column.

For text data, the transformations may include applying a suffix, prefix or a format string.

In addition, there are a number of mechanisms for generating pseudo random text - the ILText text generator, a template text generator and the ability to incorporate 3rd party libraries for text generation. But as outlined in the next section, these use random number generators with predetermine random seeeds.

Note

The internal id field is omitted from the output by default. But calling the method withIdOutput will include the id field in the output.

What all of this means is that the data generation is entirely repeatable, and if you generate the same data set multiple times, you will produce the same data.

In addition, two columns in two different tables produced with the same generation rules and random seed will produce the same values allowing for creation of multiple tables with referential integrity for use in joins.

Note

The exception to repeatability is where the data set contains the timestamp or date of when the data is written. In these cases, runs from a later date will have different values.

This is why we stress generating date or timestamp ranges with a specific begin, end and interval in the section on repeatability, rather than simply using SQL now() , current_timestamp() etc.

Generating random data

What if random data is desired to avoid data appearing in strict sequences?(although using hash as the baseColumnType to compute the derived value will do some deterministic randomization)

A field may be marked as random in which case a random number generator will be used (either Spark SQL rand() or Numpy random number generators depending on context).

By default a predefined random seed will be used for all random columns - so by definition all data is repeatable.

All columns will use the same random seed unless the random seed method is specified to be ‘hash_fieldname’ or the seed is overridden at the column level. In the case of the use of the ‘hash_fieldname’ generation method, it will use a hash value of the field name so that each column has a different seed.

Note

The text generators for templates and ILText always generate random data irrespective of the base column. That means, that these will produce repeatable data from run to run if a random seed is used - but not produce the same values for the same value of the base column.

True random Data

To generate true random values, the random seed of -1 must be specified, either at the data spec level or at the individual column level. When specified at the individual column level, it only applies to that column.

In this case, there is no guarantees of data repeatability - but you can constrain the data generated to specific ranges to use as foreign keys for data in other tables.

If columns are not marked random - they will produce a repeatable set of data (with the exception of ILText, Template generation and third party library integration). For most columns, as the columns are produced by a deterministic transformation on the corresponding base columns, the data will always be repeatable.

For columns generated using an inherently random process such as those produced with the template generation, ILText and text data generator plugins, the random process will be seeded with a constant value unless the corresponding column specification is marked as random.

Note

Again this means data will be repeatable run to run but not for a specific value of the base column. For some 3rd party libraries such as Faker there is no integration of the random seeding capabilities at present so data will not be repeatable run to run.

If a random seed is provided, either as an argument to the DataGenerator instance specification, or as option on the column specification, the random seed will be applied to fields when random data generation is used.

By default, a default random seed is used and the randomSeedMethod is set to ‘hash_fieldname’ unless a specific random seed is provided.

Use of Hashed field names

If two columns are generated with the same seed and same data generation rules, they will produce the same values unless they are marked random and a different random seed is used for each field.

To avoid having to pass an explicit seed to every column to generate different sequences of values, you can specify that the randomSeedMethod is hash_fieldname. In this case, a hash of the field name will be used. While this cannot guarantee no collisions, in practice the probabilitty of this occurring is extremely low, even for tables with 100s of columns.

Data Generation algorithm

The overall data generation algorithm operates as follows:

The order of column generation is computed based on which columns depend on which base columns. This is referred to in the code as computing the build plan.
For the number of rows specified, generate implicit id ranging from 0 to n - 1
For each column in schema or interim set of columns , generate column according to colum spec. Computation is according to build plan ordering.
- If the data spec was initialized from a schema and no column specs are specified , columns are generated as id column values converted to column type.
- For columns of date or timestamp - these will be converted to date or timestamp as interpreting first as numeric timestamp.
- For string values default generation pattern will be to generate string ’item-’+ value of id column
if column spec is supplied , column generation will proceed according to spec. Note column spec does not add column - only describes previously added column ( via schema or addColumn)
- if column spec already exists , later column spec supercedes it
- if column already exists ( due to being in schema or specified via addColumn), adding another column with the same name causes an error (unless the column is marked as implicit. Note all schema columns are marked as implicit allowing them to be overridden.
The output will drop columns that marked as omit
If the id column is to be retained in the output, use withIdOutput() to flag this to the data spec.
A schema is not required - the resulting schema will be result of all addColumn statements
The final output layout will be derived from any schema + any additional columns added via withColumn

Column Value Generation

Each column will be generated based on base column value or a random value in the range 0 to number of rows. By default the base column is the id column - this may be overridden.
If column has limited number of values, the base value will use modulo arithmetic to compute the value for array based feature columns , the column spec will determine generation of each value
Column spec may specify am explicit SQL expression which will override default column data generation. column spec may specify minValue , maxValue values and step to control value cycling
Column spec may specify a list of values to restrict values to specific discrete values
String values are generated from prefix _ base value _ suffix
Use of omit=True omits column from final output
Columns do not generate nulls unless the option percentNulls is specified.

Revisiting the IOT data example

Let’s look at the IOT data example again.

We want to simulate events from devices such as phones.

Each phone has a manufacturer and model / line.

We have a distinct set of users, and multiple IOT events for each user, but for every user we want to generate the same manufacturer and phone model or line. (We could do this via a join but thats a separate topic.)

Here we want to generate a random set of events but ensure that the device properties remain the same for the device from event to event.

from pyspark.sql.types import LongType, IntegerType, StringType

import dbldatagen as dg

shuffle_partitions_requested = 8
device_population = 100000
data_rows = 20 * 1000000
partitions_requested = 20

spark.conf.set("spark.sql.shuffle.partitions", shuffle_partitions_requested)

country_codes = ['CN', 'US', 'FR', 'CA', 'IN', 'JM', 'IE', 'PK', 'GB', 'IL', 'AU', 'SG',
                 'ES', 'GE', 'MX', 'ET', 'SA', 'LB', 'NL']
country_weights = [1300, 365, 67, 38, 1300, 3, 7, 212, 67, 9, 25, 6, 47, 83, 126, 109, 58, 8,
                   17]

manufacturers = ['Delta corp', 'Xyzzy Inc.', 'Lakehouse Ltd', 'Acme Corp', 'Embanks Devices']

lines = ['delta', 'xyzzy', 'lakehouse', 'gadget', 'droid']

testDataSpec = (
    dg.DataGenerator(spark, name="device_data_set", rows=data_rows,
                     partitions=partitions_requested,
                     randomSeedMethod='hash_fieldname')
    .withIdOutput()
    # we'll use hash of the base field to generate the ids to
    # avoid a simple incrementing sequence
    .withColumn("internal_device_id", LongType(), minValue=0x1000000000000,
                uniqueValues=device_population, omit=True, baseColumnType="hash")

    # note for format strings, we must use "%lx" not "%x" as the
    # underlying value is a long
    .withColumn("device_id", StringType(), format="0x%013x",
                baseColumn="internal_device_id")

    # the device / user attributes will be the same for the same device id
    # so lets use the internal device id as the base column for these attribute
    .withColumn("country", StringType(), values=country_codes,
                weights=country_weights,
                baseColumn="internal_device_id")
    .withColumn("manufacturer", StringType(), values=manufacturers,
                baseColumn="internal_device_id")

    # use omit = True if you don't want a column to appear in the final output
    # but just want to use it as part of generation of another column
    .withColumn("line", StringType(), values=lines, baseColumn="manufacturer",
                baseColumnType="hash", omit=True)
    .withColumn("model_ser", IntegerType(), minValue=1, maxValue=11,
                baseColumn="device_id",
                baseColumnType="hash", omit=True)

    .withColumn("model_line", StringType(), expr="concat(line, '#', model_ser)",
                baseColumn=["line", "model_ser"])
    .withColumn("event_type", StringType(),
                values=["activation", "deactivation", "plan change",
                        "telecoms activity", "internet activity", "device error"],
                random=True)
    .withColumn("event_ts", "timestamp",
                begin="2020-01-01 01:00:00", end="2020-12-31 23:59:00",
                interval="1 minute",
                random=True)

    )

dfTestData = testDataSpec.build()

display(dfTestData)

The withColumn method call for the internalDeviceId column uses the uniqueValues option to control the number of unique values.
The withColumn method call for the manufacture column uses the baseColumn option to ensure we get the same manufacturer value for each internalDeviceId. This allows us to generate IOT style events randomly, but still constrain properties whenever the same internalDeviceId occurs.

Note

A column may be based on one or more other columns. This means the value of that column will be used as a seed for generating the new column. The baseColumnType option determines if the actual value , or hash of the value is used as the seed value.

The withColumn method call for the line column introduces a temporary column for purposes of generating other columns, but through the use of the omit option, omits it from the final data set.
To ensure deterministic matching of values such as country, we base them on the internal id.