.. Databricks Labs Data Generator documentation master file, created by sphinx-quickstart on Sun Jun 21 10:54:30 2020. Options and Additional Features =============================== Options for column specification -------------------------------- Data generation for each column begins with generating a base expression for the column, which will be a random value if `random` is True, or some transformation of the `baseColumns` if not. The options are then applied - each option successively modifying the generated value. Where possible, the effects of an options are applied over the effects of other options so the effect is accumulative. Finally type conversion is applied. .. note:: Options such as `minValue`, `maxValue` and `step` can be applied to strings also if their underlying generation is from a numeric root or `baseColumn`. The default base column `id` is of type long. The following table lists some of the common options that can be applied with the ``withColumn`` and ``withColumnSpec`` methods. ================ ========================================================================================= Parameter Usage ================ ========================================================================================= minValue Minimum value for range of generated value. Alternatively, use ``dataRange`` maxValue Minimum value for range of generated value. Alternatively, use ``dataRange`` step Step to use for range of generated value. Alternatively, use ``dataRange`` prefix Prefix text to apply to expression random If `True`, will generate random values for column value. Defaults to `False` randomSeedMethod Determines how seed will be used. See `Generating random values` If set to the value 'fixed', will use fixed random seed. If set to 'hash_fieldname', it will use a hash of the field name as the random seed for a specific column. randomSeed Random seed for the generation of random numbers. See `Generating random values` If not set, settings will depend on the `randomSeedMethod` and top-level data generator `randomSeed` and `randomSeed` settings. If `randomSeedMethod` is `hashFieldName` for this column or for the data specification as whole, the random seed for each column a hash value based on the field name. This is the default unless these settings are overridden. If `randomSeed` has a value of -1, then the random value will be a random number from the uniform distribution and the data generated will not be the same from run to run. distribution Controls the statistical distribution of random values when the column is generated randomly. Accepts the values "normal", or a Distribution object instance baseColumn Either the string name of the base column, or a list of columns to use to control data generation values List of discrete values for the column. Discrete values can numeric, dates timestamps, strings etc weights List of discrete weights for the column. Controls spread of values percentNulls Percentage of nulls to generate for column Fraction representing percentage between 0.0 and 1.0 uniqueValues Number of distinct unique values for the column. Use as alternative to `dataRange` begin Beginning of range for date and timestamp fields end End of range for date and timestamp fields interval Interval of range for date and timestamp fields dataRange An instance of an `NRange` or `DateRange` object This can be used in place of ``minValue``, etc template Template controlling text generation omit If True, omit column from final output. Use when column is only needed to compute other columns expr SQL expression to control data generation numColumns Number of columns when generating multiple columns with same specification numFeatures Synonym for `numColumns` structType If set to `array`, generates array value from multiple columns. ================ ========================================================================================= .. note:: If the `dataRange` parameter is specified as well as the `minValue`, `maxValue` or `step`, the results are undetermined. For more information, see :data:`~dbldatagen.daterange.DateRange` or :data:`~dbldatagen.daterange.NRange`. Generating multiple columns with same generation spec ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ You may generate multiple columns with the same column generation spec by specifying `numFeatures` or `numColumns` with an integer value to generate a specific number of columns. The generated columns will be suffixed with a number representing the column - for example "email_0", "email_1" etc. If you specify the attribute ``structType="array"``, the multiple columns will be combined into a single array valued column. Generating random values ^^^^^^^^^^^^^^^^^^^^^^^^ By default, each columns' data is generated by applying various transformations to a root value for a column. The root value is generated from the base column(s) when the random attribute is not true. The base column value is used directly or indirectly depending on the value of `baseColumnMethod`. If the attribute, `random` is True, the root column value is generated from a random base column value. For random columns, the `randomSeedMethod` and the `randomSeed` method determine how the random root value is generated. When the `randomSeedMethod` attribute value is `fixed`, it will be generated using a random number generator with a designated `randomSeed` unless the `randomSeed` value is -1. When the `randomSeed` value is -1, then the generated values will be generated without a fixed random seed, so data will be different from run to run. If the `randomSeedMethod` value is `hash_fieldname`, the random seed for each column is computed using a hash function over the field name. This guarantees that data generation is repeatable unless the `randomSeed` attribute has a value of -1, and the `randomSeedMethod` value is `fixed`. The following example illustrates some of these features. .. code-block:: python ds = ( dg.DataGenerator(sparkSession=spark, name="test_dataset1", rows=1000, partitions=4, random=True) .withColumn("name", "string", percentNulls=0.01, template=r'\\w \\w|\\w A. \\w|test') .withColumn("emails", "string", template=r'\\w.\\w@\\w.com', random=True, numFeatures=numFeaturesSupplied, structType="array") ) df = ds.build() The use of `random=True` at the DataGenerator instance level applies `random=True` to all columns. The combination of `numFeatures=(2,6)` and `structType='array'` will generate array values with varying number of elements according to the underlying value generation rules - in this case, the use of a template to generate text. By default random number seeds are derived from field names, and in the case of columns with multiple features, the seed will be different for each feature element. Using custom SQL to control data generation ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The `expr` attribute can be used to specify an arbitrary Spark SQL expression to control how the data is generated for a column. If the body of the SQL references other columns, you will need to ensure that those columns are created first. By default, the columns are created in the order specified. However, you can control the order of column creation using the `baseColumn` attribute. More details ^^^^^^^^^^^^ The full set of options for column specification which may be used with the ``withColumn``, ``withColumnSpec`` and and ``withColumnSpecs`` method can be found at: * :data:`~dbldatagen.column_spec_options.ColumnSpecOptions` Example ^^^^^^^ The following example shows use of these options to generate user records, each having a variable set of randomly generated emails. .. code-block:: python import dbldatagen as dg import logging from pyspark.sql.types import ArrayType, StringType dataspec = dg.DataGenerator(spark, rows=10 * 1000000) logging.info(dataspec.partitions) dataspec = ( dataspec .withColumn("name", "string", percentNulls=0.01, template=r'\\w \\w|\\w A. \\w|test') .withColumn("serial_number", "string", minValue=1000000, maxValue=10000000, prefix="dr", random=True) # generate a fixed length array of email addresses .withColumn("email", "string", template=r'\\w.\\w@\\w.com', omit=True, numColumns=5, structType="array", random=True, randomSeed=-1) .withColumn("emailCount", "int", expr="abs(hash(id)) % 4)+1)") .withColumn("emails", ArrayType(StringType()), expr="slice(email, 1, emailCount", baseColumns=["email"]) .withColumn("license_plate", "string", template=r'\\n-\\n') ) dfTestData = dataspec.build() display(dfTestData) Generating views automatically ------------------------------ Views can be automatically generated when the data set is generated. The view name will use the ``name`` argument specified when creating the data generator instance. See the following links for more details: * :data:`~dbldatagen.data_generator.DataGenerator.build` Generating streaming data ------------------------- By default, the data generator produces data suitable for use in batch data frame processing. The following code sample illustrates generating a streaming data frame: .. code-block:: python import os import time from pyspark.sql.types import IntegerType, StringType, FloatType import dbldatagen as dg # various parameter values row_count = 100000 time_to_run = 15 rows_per_second = 5000 time_now = int(round(time.time() * 1000)) base_dir = "/tmp/datagenerator_{}".format(time_now) test_dir = os.path.join(base_dir, "data") checkpoint_dir = os.path.join(base_dir, "checkpoint") # build our data spec dataSpec = (dg.DataGenerator(sparkSession=spark, name="test_data_set1", rows=self.row_count, partitions=4, randomSeedMethod='hash_fieldname') .withIdOutput() .withColumn("code1", IntegerType(), minValue=100, maxValue=200) .withColumn("code2", IntegerType(), minValue=0, maxValue=10) .withColumn("code3", StringType(), values=['a', 'b', 'c']) .withColumn("code4", StringType(), values=['a', 'b', 'c'], random=True) .withColumn("code5", StringType(), values=['a', 'b', 'c'], random=True, weights=[9, 1, 1]) ) # generate the data using a streaming data frame dfData = dataSpec.build(withStreaming=True, options={'rowsPerSecond': self.rows_per_second}) (dfData .writeStream .format("delta") .outputMode("append") .option("path", test_dir) .option("checkpointLocation", checkpoint_dir) .start()) start_time = time.time() time.sleep(self.time_to_run) # note stopping the stream may produce exceptions # - these can be ignored for x in spark.streams.active: x.stop() end_time = time.time()