dbldatagen.data_generator module

This file defines the DataGenError and DataGenerator classes

class DataGenerator(sparkSession=None, name=None, randomSeedMethod=None, rows=1000000, startingId=0, randomSeed=None, partitions=None, verbose=False, batchSize=None, debug=False, seedColumnName='id', random=False, **kwargs)[source]

Bases: object

Main Class for test data set generation

This class acts as the entry point to all test data generation activities.

Parameters:
  • sparkSession – spark Session object to use

  • name – is name of data set

  • randomSeedMethod – = seed method for random numbers - either None, ‘fixed’, ‘hash_fieldname’

  • rows – = amount of rows to generate

  • startingId – = starting value for generated seed column

  • randomSeed – = seed for random number generator

  • partitions – = number of partitions to generate, if not provided, uses spark.sparkContext.defaultParallelism

  • verbose – = if True, generate verbose output

  • batchSize – = UDF batch number of rows to pass via Apache Arrow to Pandas UDFs

  • debug – = if set to True, output debug level of information

  • seedColumnName – = if set, this should be the name of the seed or logical id column. Defaults to id

  • random – = if set, specifies default value of random attribute for all columns where not set

By default the seed column is named id. If you need to use this column name in your generated data, it is recommended that you use a different name for the seed column - for example _id.

This may be specified by setting the seedColumnName attribute to _id

Note: in a shared spark session, the sparkContext is not available, so the default parallelism is set to 200. We recommend passing an explicit value for partitions in this case.

build(withTempView=False, withView=False, withStreaming=False, options=None)[source]

build the test data set from the column definitions and return a dataframe for it

if withStreaming is True, generates a streaming data set. Use options to control the rate of generation of test data if streaming is used.

For example:

dfTestData = testDataSpec.build(withStreaming=True,options={ ‘rowsPerSecond’: 5000})

Parameters:
  • withTempView – if True, automatically creates temporary view for generated data set

  • withView – If True, automatically creates global view for data set

  • withStreaming – If True, generates data using Spark Structured Streaming Rate source suitable for writing with writeStream

  • options – optional Dict of options to control generating of streaming data

Returns:

Spark SQL dataframe of generated test data

property build_order

return the build order minus the seed column (which defaults to id)

The build order will be a list of lists - each list specifying columns that can be built at the same time

clone()[source]

Make a clone of the data spec via deep copy preserving same spark session

Returns:

deep copy of test data generator definition

computeBuildPlan()[source]

prepare for building by computing a pseudo build plan

The build plan is not a true build plan - it is only used for debugging purposes, but does not actually drive the column generation order.

Returns:

modified in-place instance of test data generator allowing for chaining of calls following Builder pattern

describe()[source]

return description of the dataset generation spec

Returns:

Dict object containing key attributes of test data generator instance

explain(suppressOutput=False)[source]

Explain the test data generation process

Parameters:

suppressOutput – If True, suppress display of build plan

Returns:

String containing explanation of test data generation for this specification

static flatten(lst)[source]

flatten list

Parameters:

lst – list to flatten

classmethod generateName()[source]

get a name for the data set

Uses the untitled name prefix and nextNameIndex to generate a dummy dataset name

Returns:

string containing generated name

getColumnSpec(name)[source]

get column spec for column having name supplied

Parameters:

name – name of column to find spec for

Returns:

column spec for named column if any

getColumnType(colName)[source]

Get column Spark SQL datatype for specified column

Parameters:

colName – name of column as string

Returns:

Spark SQL datatype for named column

getInferredColumnNames()[source]

get list of output columns

getOutputColumnNames()[source]

get list of output columns by flattening list of lists of column names normal columns will have a single column name but column definitions that result in multiple columns will produce a list of multiple names

Returns:

list of column names to be output in generated data set

getOutputColumnNamesAndTypes()[source]

get list of output columns by flattening list of lists of column names and types normal columns will have a single column name but column definitions that result in multiple columns will produce a list of multiple names

hasColumnSpec(colName)[source]

returns true if there is a column spec for the column

Parameters:

colName – name of column to check for

Returns:

True if column has spec, False otherwise

property inferredSchema

infer spark interim schema definition from the field specifications

..note::

If the data generation specification contains columns for which the datatype is inferred, the schema type for inferred columns may not be correct until the build command has completed.

isFieldExplicitlyDefined(colName)[source]

return True if column generation spec has been explicitly defined for column, else false

Note

A column is not considered explicitly defined if it was inferred from a schema or added with a wildcard statement. This impacts whether the column can be redefined.

option(optionKey, optionValue)[source]

set option to option value for later processing

Parameters:
  • optionKey – key for option

  • optionValue – value for option

Returns:

modified in-place instance of test data generator allowing for chaining of calls following Builder pattern

options(**kwargs)[source]

set options in bulk

Allows for multiple options with option=optionValue style of option passing

Returns:

modified in-place instance of test data generator allowing for chaining of calls following Builder pattern

property random

return the data generation spec default random setting for columns to be used when an explicit random attribute setting is not supplied

property randomSeed

return the data generation spec random seed

classmethod reset()[source]

reset any state associated with the data

property rowCount

Return the row count

This may differ from the original specified row counts, if counts need to be adjusted for purposes of keeping the ratio of rows to unique keys correct or other heuristics

property schema

infer spark output schema definition from the field specifications

Returns:

Spark SQL StructType for schema

..note::

If the data generation specification contains columns for which the datatype is inferred, the schema type for inferred columns may not be correct until the build command has completed.

property schemaFields

get list of schema fields for final output schema

Returns:

list of fields in schema

scriptMerge(tgtName=None, srcName=None, updateExpr=None, delExpr=None, joinExpr=None, timeExpr=None, insertExpr=None, useExplicitNames=True, updateColumns=None, updateColumnExprs=None, insertColumns=None, insertColumnExprs=None, srcAlias='src', tgtAlias='tgt', asHtml=False)[source]

generate merge table script suitable for format of test data set

Parameters:
  • tgtName – name of target table to use in generated script

  • tgtAlias – alias for target table - defaults to tgt

  • srcName – name of source table to use in generated script

  • srcAlias – alias for source table - defaults to src

  • updateExpr – optional string representing updated condition. If not present, then any row that does not match join condition is considered an update

  • delExpr – optional string representing delete condition - For example src.action=’DEL’. If not present, no delete clause is generated

  • insertExpr – optional string representing insert condition - If not present, there is no condition on insert other than no match

  • joinExpr – string representing join condition. For example, tgt.id=src.id

  • timeExpr – optional time travel expression - for example : TIMESTAMP AS OF timestamp_expression or VERSION AS OF version

  • insertColumns – Optional list of strings designating columns to insert. If not supplied, uses all columns defined in spec

  • insertColumnExprs – Optional list of strings designating designating column expressions for insert. By default, will use src column as insert value into target table. This should have the form [ (“insert_column_name”, “insert column expr”), …]

  • updateColumns – List of strings designating columns to update. If not supplied, uses all columns defined in spec

  • updateColumnExprs – Optional list of strings designating designating column expressions for update. By default, will use src column as update value for target table. This should have the form [ (“update_column_name”, “update column expr”), …]

  • useExplicitNames – If True, generate explicit column names in insert and update statements

  • asHtml – if true, generate output suitable for use with displayHTML method in notebook environment

Returns:

SQL string for scripted merge statement

scriptTable(name=None, location=None, tableFormat='delta', asHtml=False)[source]

generate create table script suitable for format of test data set

Parameters:
  • name – name of table to use in generated script

  • location – path to location of data. If specified (default is None), will generate an external table definition.

  • tableFormat – table format for table

  • asHtml – if true, generate output suitable for use with displayHTML method in notebook environment

Returns:

SQL string for scripted table

property seedColumnName

return the name of data generation seed column

setRowCount(rc)[source]

Modify the row count - useful when starting a new spec from a clone

Warning

Method is deprecated - use withRowCount instead

Parameters:

rc – The count of rows to generate

Returns:

modified in-place instance of test data generator allowing for chaining of calls following Builder pattern

classmethod useSeed(seedVal)[source]

set seed for random number generation

Arguments: :param seedVal: - new value for the random number seed

use_seed()

classmethod(function) -> method

Convert a function to be a class method.

A class method receives the class as implicit first argument, just like an instance method receives the instance. To declare a class method, use this idiom:

class C:

@classmethod def f(cls, arg1, arg2, …):

It can be called either on the class (e.g. C.f()) or on an instance (e.g. C().f()). The instance is ignored except for its class. If a class method is called for a derived class, the derived class object is passed as the implied first argument.

Class methods are different than C++ or Java static methods. If you want those, see the staticmethod builtin.

withColumn(colName, colType=StringType, minValue=None, maxValue=None, step=1, dataRange=None, prefix=None, random=None, distribution=None, baseColumn=None, nullable=True, omit=False, implicit=False, noWarn=False, **kwargs)[source]

add a new column to the synthetic data generation specification

Parameters:
  • colName – Name of column to add. If this conflicts with the underlying seed column (id), it is recommended that the seed column name is customized during the construction of the data generator spec.

  • colType – Data type for column. This may be specified as either a type from one of the possible pyspark.sql.types (e.g. StringType, DecimalType(10,3) etc) or as a string containing a Spark SQL type definition (i.e String, array<Integer>, map<String, Float>)

  • omit – if True, the column will be omitted from the final set of columns in the generated data. Used to create columns that are used by other columns as intermediate results. Defaults to False

  • expr – Specifies SQL expression used to create column value. If specified, overrides the default rules for creating column value. Defaults to None

  • baseColumn – String or list of columns to control order of generation of columns. If not specified, column is dependent on base seed column (which defaults to id)

Returns:

modified in-place instance of test data generator allowing for chaining of calls following Builder pattern

Note

if the value None is used for the colType parameter, the method will try to use the underlying datatype derived from the base columns.

If the value INFER_DATATYPE is used for the colType parameter and a SQL expression has been supplied via the expr parameter, the method will try to infer the column datatype from the SQL expression when the build() method is called.

Inferred data types can only be used if the expr parameter is specified.

Note that properties which return a schema based on the specification may not be accurate until the build() method is called. Prior to this, the schema may indicate a default column type for those fields.

You may also add a variety of additional options to further control the test data generation process. For full list of options, see dbldatagen.column_spec_options module.

withColumnSpec(colName, minValue=None, maxValue=None, step=1, prefix=None, random=None, distribution=None, implicit=False, dataRange=None, omit=False, baseColumn=None, **kwargs)[source]

add a column specification for an existing column

Returns:

modified in-place instance of test data generator allowing for chaining of calls following Builder pattern

You may also add a variety of options to further control the test data generation process. For full list of options, see dbldatagen.column_spec_options module.

withColumnSpecs(patterns=None, fields=None, matchTypes=None, **kwargs)[source]
Add column specs for columns matching
  1. list of field names,

  2. one or more regex patterns

  3. type (as in pyspark.sql.types)

Parameters:
  • patterns – patterns may specified a single pattern as a string or a list of patterns that match the column names. May be omitted.

  • fields – a string specifying an explicit field to match , or a list of strings specifying explicit fields to match. May be omitted.

  • matchTypes – a single Spark SQL datatype or list of Spark SQL data types to match. May be omitted.

Returns:

modified in-place instance of test data generator allowing for chaining of calls following Builder pattern

Note

matchTypes may also take SQL type strings or a list of SQL type strings such as “array<integer>”. However, you may not use INFER_DATYTYPE as part of the matchTypes list.

You may also add a variety of options to further control the test data generation process. For full list of options, see dbldatagen.column_spec_options module.

withConstraint(constraint)[source]

Add a constraint to control the data generation

Parameters:

constraint – a constraint object to apply to the data generation

Returns:

reference to the dataa generator spec allowing calls to be chained

Note: Irrespective of where the constraint has been added, the constraints are applied at the end of the data generation. Depending on the type of the constraint, the constraint may also affect other aspects of the data generation

withConstraints(constraints)[source]

Add a constraint to control the data generation

Parameters:

constraints – a list of constraint objects to apply to the data generation

Returns:

reference to the dataa generator spec allowing calls to be chained

Note: Irrespective of where the constraint has been added, the constraints are applied at the end of the data generation. Depending on the type of the constraint, the constraint may also affect other aspects of the data generation

withIdOutput()[source]

output seed column field (defaults to id) as a column in the generated data set if specified

If this is not called, the seed column field is omitted from the final generated data set

Returns:

modified in-place instance of test data generator allowing for chaining of calls following Builder pattern

withRowCount(rc)[source]

Modify the row count - useful when starting a new spec from a clone

Parameters:

rc – The count of rows to generate

Returns:

modified in-place instance of test data generator allowing for chaining of calls following Builder pattern

withSchema(sch)[source]

populate column definitions and specifications for each of the columns in the schema

Parameters:

sch – Spark SQL schema, from which fields are added

Returns:

modified in-place instance of test data generator allowing for chaining of calls following Builder pattern

withSqlConstraint(sqlExpression: str)[source]

Add a sql expression as a constraint

Parameters:

sqlExpression – SQL expression for constraint. Rows will be returned where SQL expression evaluates true

Returns:

Note in the current implementation, this may be equivalent to adding where clauses to the generated dataframe but in future releases, this may be optimized to affect the underlying data generation so that constraints are satisfied more efficiently.

withStructColumn(colName, fields=None, asJson=False, **kwargs)[source]

Add a struct column to the synthetic data generation specification. This will add a new column composed of a struct of the specified fields.

Parameters:
  • colName – name of column

  • fields – list of elements to compose as a struct valued column (each being a string or tuple), or a dict outlining the structure of the struct column

  • asJson – If False, generate a struct valued column. If True, generate a JSON string column

  • kwargs – keyword arguments to pass to the underlying column generators as per withColumn

Returns:

A modified in-place instance of data generator allowing for chaining of calls following the Builder pattern

Note

Additional options for the field specification may be specified as keyword arguments.

The fields specification specified by the fields argument may be :

  • A list of field references (strings) which will be used as both the field name and the SQL expression

  • A list of tuples of the form (field_name, field_expression) where field_name is the name of the field. In that case, the field_expression string should be a SQL expression to generate the field value

  • A Python dict outlining the structure of the struct column. The keys of the dict are the field names

When using the dict form of the field specifications, a field whose value is a list will be treated as creating a SQL array literal.