.. Databricks Labs Data Generator documentation master file, created by sphinx-quickstart on Sun Jun 21 10:54:30 2020. Troubleshooting =============== Tools and aids to troubleshooting --------------------------------- Use of the datagenerator `explain` method ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ To aid in debugging data generation issues, you may use the `explain` method of the data generator class to produce a synopsis of how the data will be generated. If run after the `build` method was invoked, the output will include an execution history explaining how the data was generated. See: * :data:`~dbldatagen.data_generator.DataGenerator.explain` You may also configure the data generator to produce more verbose output when building the dataspec and the resulting data set. To do this, set the ``verbose`` option to ``True`` when creating the dataspec. Additionally, setting the ``debug`` option to ``True`` will produce additional debug level output. These correspond to the ``info`` and ``debug`` log levels of the internal messages. For example: .. code-block:: python import dbldatagen as dg import pyspark.sql.functions as F data_rows = 10 * 1000 * 1000 uniqueCustomers = 10 * 1000000 dataspec = (dg.DataGenerator(spark, rows=data_rows, partitions=4, verbose=True) .withColumn("customer_id","long", uniqueValues=uniqueCustomers) .withColumn("city", "string", template=r"\w") .withColumn("name", "string", template=r"\w \w|\w \w \w") .withColumn("email", "string", template=r"\w@\w.com|\w@\w.org|\w.\w@\w.com") ) df = dataspec.build() display(df) df1 = dataspec.build() See: * :data:`~dbldatagen.data_generator.DataGenerator` Operational message logging ^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. sidebar:: Logging Warning, error and info messages are available via standard logging capabilities. By default the data generation process output produce error and warning messages via the Python `logging` module. In addition, the operation of the data generation process will record messages related to how the data is being or was generated to an internal explain log during the execution of the `build` method. So essentially the `explain` method displays the contents of the explain log from the last `build` invocation. If `build` has not yet been run, it will display the explain logging messages from the build planning process. Regular logging messages are generated using the standard logger. You can display additional logging messages by specifying the `verbose` option during creation of the `DataGenerator` instance. .. note:: Building planning performs pre-build tasks such as computing the order in which columns need to be generated. Build planning messages are available via the `explain` method Examining log outputs ^^^^^^^^^^^^^^^^^^^^^ Logging outputs will be displayed automatically when using the data generator in a Databricks notebook environment Common issues and resolution ---------------------------- Attempting to add a column named `id` ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. sidebar:: Customizing seed column Use the `seedColumnName` attribute when creating `DataGenerator` instance to customize the seed column name. By default, the data generator reserves the column named `id` to act as the seed column for other columns in the data generation spec. However you may need to use the name `id` may be used for a specific column definition in the generated data which differs from the default seed column in operation. In this case, you may customize the name of the seed column to an alternative name via the `seedColumnName` parameter to the construction of the `DataGenerator` instance The following code shows its use: .. code-block:: python import dbldatagen as dg import pyspark.sql.functions as F data_rows = 10 * 1000 * 1000 uniqueCustomers = 10 * 1000000 dataspec = (dg.DataGenerator(spark, rows=data_rows, partitions=4, seedColumnName='_id') .withColumn("id","long", uniqueValues=uniqueCustomers) .withColumn("city", "string", template=r"\w") .withColumn("name", "string", template=r"\w \w|\w \w \w") .withColumn("email", "string", template=r"\w@\w.com|\w@\w.org|\w.\w@\w.com") ) df = dataspec.build() display(df) Attempting to compute column before dependent columns are computed ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ By default, the value for a column is computed based on some transformation of the seed column (named `id` by default). You can use other columns as the seed column for a given column via the `baseColumn` attribute - which takes either the name of column as a string or a Python list of column names, if the column is dependent on multiple columns. .. sidebar:: Column build ordering Column build order is optimized for best performance during data generation. To ensure columns are computed in correct order, use the `baseColumn` attribute. Use of the `expr` attribute (which allows for the use of arbitrary SQL expressions) can also create dependencies on other columns. If a column depends on other columns through referencing them in the body of the expression specified in the `expr` attribute, it is necessary to ensure that the columns on which the expression depends are computed first. Use the `baseColumn` attribute to ensure that dependent columns are computed first. The `baseColumn` attribute may specify either a string that names the column on which the current column depends or a list of column names specified as a list of strings. For example, the following code has dependencies in some of the `expr` SQL expressions on earlier columns. In these cases, we use the `baseColumn` attribute to ensure the correct column build order. .. code-block:: python import dbldatagen as dg country_codes = ['CN', 'US', 'FR', 'CA', 'IN', 'JM', 'IE', 'PK', 'GB', 'IL', 'AU', 'SG', 'ES', 'GE', 'MX', 'ET', 'SA', 'LB', 'NL'] country_weights = [1300, 365, 67, 38, 1300, 3, 7, 212, 67, 9, 25, 6, 47, 83, 126, 109, 58, 8, 17] device_population = 100000 manufacturers = ['Delta corp', 'Xyzzy Inc.', 'Lakehouse Ltd', 'Acme Corp', 'Embanks Devices'] lines = ['delta', 'xyzzy', 'lakehouse', 'gadget', 'droid'] testDataSpec = ( dg.DataGenerator(spark, name="device_data_set", rows=1000000, partitions=8, randomSeedMethod='hash_fieldname') # we'll use hash of the base field to generate the ids to # avoid a simple incrementing sequence .withColumn("internal_device_id", "long", minValue=0x1000000000000, uniqueValues=device_population, omit=True, baseColumnType="hash") # note for format strings, we must use "%lx" not "%x" as the # underlying value is a long .withColumn("device_id", "string", format="0x%013x", baseColumn="internal_device_id") # the device / user attributes will be the same for the same device id # so lets use the internal device id as the base column for these attribute .withColumn("country", "string", values=country_codes, weights=country_weights, baseColumn="internal_device_id") .withColumn("manufacturer", "string", values=manufacturers, baseColumn="internal_device_id", omit=True) .withColumn("line", StringType(), values=lines, baseColumn="manufacturer", baseColumnType="hash", omit=True) # note use of baseColumn to control column build ordering .withColumn("manufacturer_info", "string", expr="to_json(named_struct('line', line, 'manufacturer', manufacturer))", baseColumn=["line", "manufacturer"] ) .withColumn("event_type", "string", values=["activation", "deactivation", "plan change", "telecoms activity", "internet activity", "device error"], random=True, omit=True) .withColumn("event_ts", "timestamp", begin="2020-01-01 01:00:00", end="2020-12-31 23:59:00", interval="1 minute", random=True, omit=True) # note use of baseColumn to control column build ordering .withColumn("event_info", "string", expr="to_json(named_struct('event_type', event_type, 'event_ts', event_ts))", baseColumn=["event_type", "event_ts"]) ) dfTestData = testDataSpec.build() display(dfTestData)