Generating Column Data

The data generation specification object controls how the data is to be generated. This includes:

The number of rows to be generated
Whether columns are generated from an existing schema
Whether implicit columns used in the generation process are output in the final data set
Whether the generated data set will be a streaming or batch data set
How the column data should be generated and what the dependencies for each column are
How random and psuedo-random data is generated
The structure for structured columns and JSON valued columns

Initializing a data generation spec

The DataGenerator object instance defines a specification for how data is to be generated. It consists of a set of column generation specifications in addition to various instance level attributes. These control the data generation process.

The data generation process itself is deferred until the data generation instance build method is executed.

So until the build method is invoked, the data generation specification is in initialization mode.

Once build has been invoked, the data generation instance holds state about the data set generated.

While build can be invoked a subsequent time, making further modifications to the definition post build before calling build again is not recommended. We recommend the use of the clone method to make a new data generation specification similar to an existing one if further modifications are needed.

See the method clone for further information.

Adding columns to a data generation spec

During creation of the data generator object instance (i.e before build is called) each invocation of withColumnSpec, withColumnSpecs or withColumn adds or modifies a column generation specification.

When building the data generation spec, the withSchema method may be used to add columns from an existing schema. This does _not_ prevent the use of withColumn to add new columns.

Use withColumn to define a new column. This method takes a parameter to specify the data type.

See the method withColumn for further details.

Use withColumnSpec to define how a column previously defined in a schema should be generated. This method does not take a data type property, but uses the data type information defined in the schema. See the method withColumnSpec for further details.

Use withColumnSpecs to define how multiple columns imported from a schema should be generated. As the pattern matching may inadvertently match an unintended column, it is permitted to override the specification added through this method by a subsequent call to withColumnSpec to change the definition of how a specific column should be generated.

See the method withColumnSpecs for further details.

Use the method withStructColumn for simpler creation of struct and JSON valued columns.

By default all columns are marked as being dependent on an internal id seed column. Use the baseColumn attribute to mark a column as being dependent on another column or set of columns. Use of the base column attribute has several effects:

The order in which columns are built will be changed
The value of the base column (or the hash of the values, depending on the setting for baseColumnType) will be used to control the generation of data for the new column. The exception to this is when the random attribute is set - in this case, the root value will be a random number.

Note

Using the same column names for new columns as existing columns may conflict with existing columns. This may cause errors during data generation.

If you need to generate a field with the same name as the seed column (by default id), you may override the default seed column name in the constructor of the data generation spec through the use of the seedColumnName parameter.

Note that Spark SQL is case insensitive with respect to column names. So a column name that differs only in case to an existing column may cause issues.

Generating complex columns - structs, maps, arrays

Complex column types are supported - that is a column may have its type specified as an array, map or struct. This can be specified in the datatype parameter to the withColumn method as a string such as “array<string>” or as a composite of datatype object instances.

If the column type is based on a struct, map or array, then either the expr or the values attributes must be specified to provide a value or range of possible values for the column.

If the values attribute is being used to specify a range of possible values, each of the values elements must be of the same type as the column.

If neither the expr or values attributes are specified, then the default column value will be NULL.

For array valued columns, where all of the elements of the array are to be generated with the same column specification, an alternative method is also supported.

You can specify that a column has a specific number of features with structType of ‘array’ to control the generation of the column. In this case, the datatype should be the type of the individual element, not of the array.

For example, the following code will generate rows with varying numbers of synthetic emails for each customer:

import dbldatagen as dg

ds = (
     dg.DataGenerator(sparkSession=spark, name="test_dataset1", rows=1000, partitions=4,
                      random=True)
     .withColumn("name", "string", percentNulls=0.01, template=r'\\w \\w|\\w A. \\w|test')
     .withColumn("emails", "string", template=r'\\w.\\w@\\w.com', random=True,
                 numFeatures=(1, 6), structType="array")
)

df = ds.build()

The helper method withStructColumn of the DataGenerator class enables simpler definition of structured and JSON valued columns.

See the documentation for the method withStructColumn for further details.

The mechanics of column data generation

The data set is generated when the build method is invoked on the data generation instance.

This performs the following actions:

A pseudo build plan will be computed for debugging purposes
The set of columns is reordered to control the order in which column data is generated. The ordering is based on the baseColumn attribute of individual column generation spec.
Cumulative density functions will be computed for columns where weighted values are specified
The data set will be computed as a Spark data frame for the data in the order of the computed column ordering
Percent nulls transformations will be applied to columns where the percentNulls attribute was specified
The final set of output fields will be selected (omitting any columns where the omit attribute was set to True)

Note

Normally the columns will be built in the order specified in the spec. Use of the baseColumn attribute may change the column build ordering.

This has several implications:

If a column is referred to in an expression, the baseColumn attribute may need to be defined with a dependency on that column
If a column uses a base column with a restricted range of values then it is possible that the column will not generate the full range of values in the column generation spec
If the base column is of type boolean or some other restricted range type, computations on that base value may not produce the expected range of values
If base column is not specified, you may see errors reporting that the column in an expression does not exist. This may be fixed by specifying a column dependency using the baseColumn attribute

Note

The implementation performs primitive scanning of SQL expressions (specified using the expr attribute) to determine if the sql expression depends on earlier columns and if so, will put the building of the column in a separate phase.

However it does not reorder the building sequence if there is a reference to a column that will be built later in the SQL expression. To enforce the dependency, you must use the baseColumn attribute to indicate the dependency.