Troubleshooting

Tools and aids to troubleshooting

Use of the datagenerator explain method

To aid in debugging data generation issues, you may use the explain method of the data generator class to produce a synopsis of how the data will be generated.

If run after the build method was invoked, the output will include an execution history explaining how the data was generated.

See:

explain

You may also configure the data generator to produce more verbose output when building the dataspec and the resulting data set.

To do this, set the verbose option to True when creating the dataspec.

Additionally, setting the debug option to True will produce additional debug level output. These correspond to the info and debug log levels of the internal messages.

For example:

import dbldatagen as dg
import pyspark.sql.functions as F

data_rows = 10 * 1000 * 1000

uniqueCustomers = 10 * 1000000

dataspec = (dg.DataGenerator(spark, rows=data_rows, partitions=4, verbose=True)
               .withColumn("customer_id","long", uniqueValues=uniqueCustomers)
               .withColumn("city", "string", template=r"\w")
               .withColumn("name", "string", template=r"\w \w|\w \w \w")
               .withColumn("email", "string", template=r"\w@\w.com|\w@\w.org|\w.\w@\w.com")
               )
df = dataspec.build()

display(df)   df1 = dataspec.build()

See:

DataGenerator

Operational message logging

By default the data generation process output produce error and warning messages via the Python logging module.

In addition, the operation of the data generation process will record messages related to how the data is being or was generated to an internal explain log during the execution of the build method.

So essentially the explain method displays the contents of the explain log from the last build invocation. If build has not yet been run, it will display the explain logging messages from the build planning process.

Regular logging messages are generated using the standard logger.

You can display additional logging messages by specifying the verbose option during creation of the DataGenerator instance.

Note

Building planning performs pre-build tasks such as computing the order in which columns need to be generated. Build planning messages are available via the explain method

Examining log outputs

Logging outputs will be displayed automatically when using the data generator in a Databricks notebook environment

Common issues and resolution

Attempting to add a column named id

By default, the data generator reserves the column named id to act as the seed column for other columns in the data generation spec. However you may need to use the name id may be used for a specific column definition in the generated data which differs from the default seed column in operation.

In this case, you may customize the name of the seed column to an alternative name via the seedColumnName parameter to the construction of the DataGenerator instance

The following code shows its use:

import dbldatagen as dg
import pyspark.sql.functions as F

data_rows = 10 * 1000 * 1000

uniqueCustomers = 10 * 1000000

dataspec = (dg.DataGenerator(spark, rows=data_rows, partitions=4, seedColumnName='_id')
               .withColumn("id","long", uniqueValues=uniqueCustomers)
               .withColumn("city", "string", template=r"\w")
               .withColumn("name", "string", template=r"\w \w|\w \w \w")
               .withColumn("email", "string", template=r"\w@\w.com|\w@\w.org|\w.\w@\w.com")
               )
df = dataspec.build()

display(df)

Attempting to compute column before dependent columns are computed

By default, the value for a column is computed based on some transformation of the seed column (named id by default). You can use other columns as the seed column for a given column via the baseColumn attribute - which takes either the name of column as a string or a Python list of column names, if the column is dependent on multiple columns.

Use of the expr attribute (which allows for the use of arbitrary SQL expressions) can also create dependencies on other columns.

If a column depends on other columns through referencing them in the body of the expression specified in the expr attribute, it is necessary to ensure that the columns on which the expression depends are computed first. Use the baseColumn attribute to ensure that dependent columns are computed first. The baseColumn attribute may specify either a string that names the column on which the current column depends or a list of column names specified as a list of strings.

For example, the following code has dependencies in some of the expr SQL expressions on earlier columns. In these cases, we use the baseColumn attribute to ensure the correct column build order.

import dbldatagen as dg


country_codes = ['CN', 'US', 'FR', 'CA', 'IN', 'JM', 'IE', 'PK', 'GB', 'IL', 'AU', 'SG',
                 'ES', 'GE', 'MX', 'ET', 'SA', 'LB', 'NL']
country_weights = [1300, 365, 67, 38, 1300, 3, 7, 212, 67, 9, 25, 6, 47, 83, 126, 109, 58, 8,
                   17]

device_population = 100000

manufacturers = ['Delta corp', 'Xyzzy Inc.', 'Lakehouse Ltd', 'Acme Corp', 'Embanks Devices']

lines = ['delta', 'xyzzy', 'lakehouse', 'gadget', 'droid']

testDataSpec = (
    dg.DataGenerator(spark, name="device_data_set", rows=1000000,
                     partitions=8,
                     randomSeedMethod='hash_fieldname')
    # we'll use hash of the base field to generate the ids to
    # avoid a simple incrementing sequence
    .withColumn("internal_device_id", "long", minValue=0x1000000000000,
                uniqueValues=device_population, omit=True, baseColumnType="hash")

    # note for format strings, we must use "%lx" not "%x" as the
    # underlying value is a long
    .withColumn("device_id", "string", format="0x%013x",
                baseColumn="internal_device_id")

    # the device / user attributes will be the same for the same device id
    # so lets use the internal device id as the base column for these attribute
    .withColumn("country", "string", values=country_codes,
                weights=country_weights,
                baseColumn="internal_device_id")

    .withColumn("manufacturer", "string", values=manufacturers,
                baseColumn="internal_device_id", omit=True)

    .withColumn("line", StringType(), values=lines, baseColumn="manufacturer",
                baseColumnType="hash", omit=True)

    # note use of baseColumn to control column build ordering
    .withColumn("manufacturer_info", "string",
                 expr="to_json(named_struct('line', line, 'manufacturer', manufacturer))",
                baseColumn=["line", "manufacturer"]
               )

    .withColumn("event_type", "string",
                values=["activation", "deactivation", "plan change",
                        "telecoms activity", "internet activity", "device error"],
                random=True, omit=True)

    .withColumn("event_ts", "timestamp",
                begin="2020-01-01 01:00:00",
                end="2020-12-31 23:59:00",
                interval="1 minute",
                random=True,
                omit=True)

    # note use of baseColumn to control column build ordering
    .withColumn("event_info", "string",
                 expr="to_json(named_struct('event_type', event_type, 'event_ts', event_ts))",
                 baseColumn=["event_type", "event_ts"])
    )

dfTestData = testDataSpec.build()

display(dfTestData)