dbldatagen.data_generator module
This file defines the DataGenError and DataGenerator classes
- class DataGenerator(sparkSession=None, name=None, randomSeedMethod=None, rows=1000000, startingId=0, randomSeed=None, partitions=None, verbose=False, batchSize=None, debug=False, seedColumnName='id', random=False, **kwargs)[source]
Bases:
object
Main Class for test data set generation
This class acts as the entry point to all test data generation activities.
- Parameters:
sparkSession – spark Session object to use
name – is name of data set
randomSeedMethod – = seed method for random numbers - either None, ‘fixed’, ‘hash_fieldname’
rows – = amount of rows to generate
startingId – = starting value for generated seed column
randomSeed – = seed for random number generator
partitions – = number of partitions to generate, if not provided, uses spark.sparkContext.defaultParallelism
verbose – = if True, generate verbose output
batchSize – = UDF batch number of rows to pass via Apache Arrow to Pandas UDFs
debug – = if set to True, output debug level of information
seedColumnName – = if set, this should be the name of the seed or logical id column. Defaults to id
random – = if set, specifies default value of random attribute for all columns where not set
By default the seed column is named id. If you need to use this column name in your generated data, it is recommended that you use a different name for the seed column - for example _id.
This may be specified by setting the seedColumnName attribute to _id
Note: in a shared spark session, the sparkContext is not available, so the default parallelism is set to 200. We recommend passing an explicit value for partitions in this case.
- build(withTempView=False, withView=False, withStreaming=False, options=None)[source]
build the test data set from the column definitions and return a dataframe for it
if withStreaming is True, generates a streaming data set. Use options to control the rate of generation of test data if streaming is used.
For example:
dfTestData = testDataSpec.build(withStreaming=True,options={ ‘rowsPerSecond’: 5000})
- Parameters:
withTempView – if True, automatically creates temporary view for generated data set
withView – If True, automatically creates global view for data set
withStreaming – If True, generates data using Spark Structured Streaming Rate source suitable for writing with writeStream
options – optional Dict of options to control generating of streaming data
- Returns:
Spark SQL dataframe of generated test data
- property build_order
return the build order minus the seed column (which defaults to id)
The build order will be a list of lists - each list specifying columns that can be built at the same time
- clone()[source]
Make a clone of the data spec via deep copy preserving same spark session
- Returns:
deep copy of test data generator definition
- computeBuildPlan()[source]
prepare for building by computing a pseudo build plan
The build plan is not a true build plan - it is only used for debugging purposes, but does not actually drive the column generation order.
- Returns:
modified in-place instance of test data generator allowing for chaining of calls following Builder pattern
- describe()[source]
return description of the dataset generation spec
- Returns:
Dict object containing key attributes of test data generator instance
- explain(suppressOutput=False)[source]
Explain the test data generation process
- Parameters:
suppressOutput – If True, suppress display of build plan
- Returns:
String containing explanation of test data generation for this specification
- classmethod generateName()[source]
get a name for the data set
Uses the untitled name prefix and nextNameIndex to generate a dummy dataset name
- Returns:
string containing generated name
- getColumnSpec(name)[source]
get column spec for column having name supplied
- Parameters:
name – name of column to find spec for
- Returns:
column spec for named column if any
- getColumnType(colName)[source]
Get column Spark SQL datatype for specified column
- Parameters:
colName – name of column as string
- Returns:
Spark SQL datatype for named column
- getOutputColumnNames()[source]
get list of output columns by flattening list of lists of column names normal columns will have a single column name but column definitions that result in multiple columns will produce a list of multiple names
- Returns:
list of column names to be output in generated data set
- getOutputColumnNamesAndTypes()[source]
get list of output columns by flattening list of lists of column names and types normal columns will have a single column name but column definitions that result in multiple columns will produce a list of multiple names
- hasColumnSpec(colName)[source]
returns true if there is a column spec for the column
- Parameters:
colName – name of column to check for
- Returns:
True if column has spec, False otherwise
- property inferredSchema
infer spark interim schema definition from the field specifications
- ..note::
If the data generation specification contains columns for which the datatype is inferred, the schema type for inferred columns may not be correct until the build command has completed.
- isFieldExplicitlyDefined(colName)[source]
return True if column generation spec has been explicitly defined for column, else false
Note
A column is not considered explicitly defined if it was inferred from a schema or added with a wildcard statement. This impacts whether the column can be redefined.
- option(optionKey, optionValue)[source]
set option to option value for later processing
- Parameters:
optionKey – key for option
optionValue – value for option
- Returns:
modified in-place instance of test data generator allowing for chaining of calls following Builder pattern
- options(**kwargs)[source]
set options in bulk
Allows for multiple options with option=optionValue style of option passing
- Returns:
modified in-place instance of test data generator allowing for chaining of calls following Builder pattern
- property random
return the data generation spec default random setting for columns to be used when an explicit random attribute setting is not supplied
- property randomSeed
return the data generation spec random seed
- property rowCount
Return the row count
This may differ from the original specified row counts, if counts need to be adjusted for purposes of keeping the ratio of rows to unique keys correct or other heuristics
- property schema
infer spark output schema definition from the field specifications
- Returns:
Spark SQL StructType for schema
- ..note::
If the data generation specification contains columns for which the datatype is inferred, the schema type for inferred columns may not be correct until the build command has completed.
- property schemaFields
get list of schema fields for final output schema
- Returns:
list of fields in schema
- scriptMerge(tgtName=None, srcName=None, updateExpr=None, delExpr=None, joinExpr=None, timeExpr=None, insertExpr=None, useExplicitNames=True, updateColumns=None, updateColumnExprs=None, insertColumns=None, insertColumnExprs=None, srcAlias='src', tgtAlias='tgt', asHtml=False)[source]
generate merge table script suitable for format of test data set
- Parameters:
tgtName – name of target table to use in generated script
tgtAlias – alias for target table - defaults to tgt
srcName – name of source table to use in generated script
srcAlias – alias for source table - defaults to src
updateExpr – optional string representing updated condition. If not present, then any row that does not match join condition is considered an update
delExpr – optional string representing delete condition - For example src.action=’DEL’. If not present, no delete clause is generated
insertExpr – optional string representing insert condition - If not present, there is no condition on insert other than no match
joinExpr – string representing join condition. For example, tgt.id=src.id
timeExpr – optional time travel expression - for example : TIMESTAMP AS OF timestamp_expression or VERSION AS OF version
insertColumns – Optional list of strings designating columns to insert. If not supplied, uses all columns defined in spec
insertColumnExprs – Optional list of strings designating designating column expressions for insert. By default, will use src column as insert value into target table. This should have the form [ (“insert_column_name”, “insert column expr”), …]
updateColumns – List of strings designating columns to update. If not supplied, uses all columns defined in spec
updateColumnExprs – Optional list of strings designating designating column expressions for update. By default, will use src column as update value for target table. This should have the form [ (“update_column_name”, “update column expr”), …]
useExplicitNames – If True, generate explicit column names in insert and update statements
asHtml – if true, generate output suitable for use with displayHTML method in notebook environment
- Returns:
SQL string for scripted merge statement
- scriptTable(name=None, location=None, tableFormat='delta', asHtml=False)[source]
generate create table script suitable for format of test data set
- Parameters:
name – name of table to use in generated script
location – path to location of data. If specified (default is None), will generate an external table definition.
tableFormat – table format for table
asHtml – if true, generate output suitable for use with displayHTML method in notebook environment
- Returns:
SQL string for scripted table
- property seedColumnName
return the name of data generation seed column
- setRowCount(rc)[source]
Modify the row count - useful when starting a new spec from a clone
Warning
Method is deprecated - use withRowCount instead
- Parameters:
rc – The count of rows to generate
- Returns:
modified in-place instance of test data generator allowing for chaining of calls following Builder pattern
- classmethod useSeed(seedVal)[source]
set seed for random number generation
Arguments: :param seedVal: - new value for the random number seed
- use_seed()
classmethod(function) -> method
Convert a function to be a class method.
A class method receives the class as implicit first argument, just like an instance method receives the instance. To declare a class method, use this idiom:
- class C:
@classmethod def f(cls, arg1, arg2, …):
…
It can be called either on the class (e.g. C.f()) or on an instance (e.g. C().f()). The instance is ignored except for its class. If a class method is called for a derived class, the derived class object is passed as the implied first argument.
Class methods are different than C++ or Java static methods. If you want those, see the staticmethod builtin.
- withColumn(colName, colType=StringType, minValue=None, maxValue=None, step=1, dataRange=None, prefix=None, random=None, distribution=None, baseColumn=None, nullable=True, omit=False, implicit=False, noWarn=False, **kwargs)[source]
add a new column to the synthetic data generation specification
- Parameters:
colName – Name of column to add. If this conflicts with the underlying seed column (id), it is recommended that the seed column name is customized during the construction of the data generator spec.
colType – Data type for column. This may be specified as either a type from one of the possible pyspark.sql.types (e.g. StringType, DecimalType(10,3) etc) or as a string containing a Spark SQL type definition (i.e String, array<Integer>, map<String, Float>)
omit – if True, the column will be omitted from the final set of columns in the generated data. Used to create columns that are used by other columns as intermediate results. Defaults to False
expr – Specifies SQL expression used to create column value. If specified, overrides the default rules for creating column value. Defaults to None
baseColumn – String or list of columns to control order of generation of columns. If not specified, column is dependent on base seed column (which defaults to id)
- Returns:
modified in-place instance of test data generator allowing for chaining of calls following Builder pattern
Note
if the value
None
is used for thecolType
parameter, the method will try to use the underlying datatype derived from the base columns.If the value
INFER_DATATYPE
is used for thecolType
parameter and a SQL expression has been supplied via theexpr
parameter, the method will try to infer the column datatype from the SQL expression when thebuild()
method is called.Inferred data types can only be used if the
expr
parameter is specified.Note that properties which return a schema based on the specification may not be accurate until the
build()
method is called. Prior to this, the schema may indicate a default column type for those fields.You may also add a variety of additional options to further control the test data generation process. For full list of options, see dbldatagen.column_spec_options module.
- withColumnSpec(colName, minValue=None, maxValue=None, step=1, prefix=None, random=None, distribution=None, implicit=False, dataRange=None, omit=False, baseColumn=None, **kwargs)[source]
add a column specification for an existing column
- Returns:
modified in-place instance of test data generator allowing for chaining of calls following Builder pattern
You may also add a variety of options to further control the test data generation process. For full list of options, see dbldatagen.column_spec_options module.
- withColumnSpecs(patterns=None, fields=None, matchTypes=None, **kwargs)[source]
- Add column specs for columns matching
list of field names,
one or more regex patterns
type (as in pyspark.sql.types)
- Parameters:
patterns – patterns may specified a single pattern as a string or a list of patterns that match the column names. May be omitted.
fields – a string specifying an explicit field to match , or a list of strings specifying explicit fields to match. May be omitted.
matchTypes – a single Spark SQL datatype or list of Spark SQL data types to match. May be omitted.
- Returns:
modified in-place instance of test data generator allowing for chaining of calls following Builder pattern
Note
matchTypes may also take SQL type strings or a list of SQL type strings such as “array<integer>”. However, you may not use
INFER_DATYTYPE
as part of the matchTypes list.You may also add a variety of options to further control the test data generation process. For full list of options, see dbldatagen.column_spec_options module.
- withConstraint(constraint)[source]
Add a constraint to control the data generation
- Parameters:
constraint – a constraint object to apply to the data generation
- Returns:
reference to the dataa generator spec allowing calls to be chained
Note: Irrespective of where the constraint has been added, the constraints are applied at the end of the data generation. Depending on the type of the constraint, the constraint may also affect other aspects of the data generation
- withConstraints(constraints)[source]
Add a constraint to control the data generation
- Parameters:
constraints – a list of constraint objects to apply to the data generation
- Returns:
reference to the dataa generator spec allowing calls to be chained
Note: Irrespective of where the constraint has been added, the constraints are applied at the end of the data generation. Depending on the type of the constraint, the constraint may also affect other aspects of the data generation
- withIdOutput()[source]
output seed column field (defaults to id) as a column in the generated data set if specified
If this is not called, the seed column field is omitted from the final generated data set
- Returns:
modified in-place instance of test data generator allowing for chaining of calls following Builder pattern
- withRowCount(rc)[source]
Modify the row count - useful when starting a new spec from a clone
- Parameters:
rc – The count of rows to generate
- Returns:
modified in-place instance of test data generator allowing for chaining of calls following Builder pattern
- withSchema(sch)[source]
populate column definitions and specifications for each of the columns in the schema
- Parameters:
sch – Spark SQL schema, from which fields are added
- Returns:
modified in-place instance of test data generator allowing for chaining of calls following Builder pattern
- withSqlConstraint(sqlExpression: str)[source]
Add a sql expression as a constraint
- Parameters:
sqlExpression – SQL expression for constraint. Rows will be returned where SQL expression evaluates true
- Returns:
Note in the current implementation, this may be equivalent to adding where clauses to the generated dataframe but in future releases, this may be optimized to affect the underlying data generation so that constraints are satisfied more efficiently.
- withStructColumn(colName, fields=None, asJson=False, **kwargs)[source]
Add a struct column to the synthetic data generation specification. This will add a new column composed of a struct of the specified fields.
- Parameters:
colName – name of column
fields – list of elements to compose as a struct valued column (each being a string or tuple), or a dict outlining the structure of the struct column
asJson – If False, generate a struct valued column. If True, generate a JSON string column
kwargs – keyword arguments to pass to the underlying column generators as per withColumn
- Returns:
A modified in-place instance of data generator allowing for chaining of calls following the Builder pattern
Note
Additional options for the field specification may be specified as keyword arguments.
The fields specification specified by the fields argument may be :
A list of field references (strings) which will be used as both the field name and the SQL expression
A list of tuples of the form (field_name, field_expression) where field_name is the name of the field. In that case, the field_expression string should be a SQL expression to generate the field value
A Python dict outlining the structure of the struct column. The keys of the dict are the field names
When using the dict form of the field specifications, a field whose value is a list will be treated as creating a SQL array literal.