dbldatagen.column_generation_spec module

This file defines the ColumnGenerationSpec class

class ColumnGenerationSpec(name, colType=None, minValue=0, maxValue=None, step=1, prefix='', random=False, distribution=None, baseColumn=None, randomSeed=None, randomSeedMethod=None, implicit=False, omit=False, nullable=True, debug=False, verbose=False, seedColumnName='id', **kwargs)[source]

Bases: object

Column generation spec object - specifies how column is to be generated

Each column to be output will have a corresponding ColumnGenerationSpec object. This is added explicitly using the DataGenerators withColumnSpec or withColumn methods

If none is explicitly added, a default one will be generated.

The full set of arguments to the class is more than the explicitly called out parameters as any arguments that are not explicitly called out can still be passed due to the **kwargs expression.

This class is meant for internal use only.

Parameters:

name – Name of column (string).
colType – Spark SQL datatype instance, representing the type of the column.
min – minimum value of column
max – maximum value of the column
step – numeric step used in column data generation
prefix – string used as prefix to the column underlying value to produce a string value
random – Boolean, if True, will generate random values
distribution – Instance of distribution, that will control the distribution of the generated values
baseColumn – String or list of strings representing columns used as basis for generating the column data
randomSeed – random seed value used to generate the random value, if column data is random
randomSeedMethod – method for computing random values from the random seed. It may take on the values fixed, hash_fieldname or None
implicit – If True, the specification for the column can be replaced by a later definition. If not, a later attempt to replace the definition will flag an error. Typically used when generating definitions automatically from a schema, or when using wildcards in the specification
omit – if True, omit from the final output.
nullable – If True, column may be null - defaults to True.
debug – If True, output debugging log statements. Defaults to False.
verbose – If True, output logging statements at the info level. If False (the default), only output warning and error logging statements.
seedColumnName – if supplied, specifies seed column name

For full list of options, see dbldatagen.column_spec_options module.

property baseColumn: get the base column used to generate values for this column

property baseColumns: Return base columns as list of strings

property begin

get the begin attribute used to generate values for this column

For numeric columns, the range (minValue, maxValue, step) is used to control data generation. For date and time columns, the range (begin, end, interval) are used to control data generation

property datatype: get the Spark SQL data type used to generate values for this column

property end

get the end attribute used to generate values for this column

For numeric columns, the range (minValue, maxValue, step) is used to control data generation. For date and time columns, the range (begin, end, interval) are used to control data generation

property expr: get the expr attributed used to generate values for this column

property exprs: get the column generation exprs attribute used to generate values for this column.

getNames()[source]: get column names as list of strings

getNamesAndTypes()[source]: get column names as list of tuples (name, datatype)

getOrElse(key, default=None)[source]

Get value for option key if it exists or else return default

Parameters:

key – key name for option
default – default value if option was not provided

Returns:

option value or default

getPlanEntry()[source]

Get execution plan entry for object

Returns:: String representation of plan entry

property inferDatatype: If True indicates that datatype should be inferred to be result of computing SQL expression

property interval

get the interval attribute used to generate values for this column

For numeric columns, the range (minValue, maxValue, step) is used to control data generation. For date and time columns, the range (begin, end, interval) are used to control data generation

property isFieldOmitted

check if this field should be omitted from the output

If the field is omitted from the output, the field is available for use in expressions etc. but dropped from the final set of fields

property isRandom: returns True if column will be randomly generated

property isWeightedValuesColumn: check if column is a weighed values column

keys()[source]: Get the keys as list of strings

makeGenerationExpressions()[source]

Generate structured column if multiple columns or features are specified

if there are multiple columns / features specified using a single definition, it will generate a set of columns conforming to the same definition, renaming them as appropriate and combine them into a array if necessary (depending on the structure combination instructions)

param self:

is ColumnGenerationSpec for column

returns:

spark sql column or expression that can be used to generate a column

property max: get the column generation maxValue value used to generate values for this column

property min: get the column generation minValue value used to generate values for this column

property numColumns

get the numColumns attribute used to generate values for this column

if a column is specified with the numColumns attribute, this is used to create multiple copies of the column, named colName1 .. colNameN

property numFeatures

get the numFeatures attribute used to generate values for this column

if a column is specified with the numFeatures attribute, this is used to create multiple copies of the column, combined into an array or feature vector

property prefix

get the string prefix used to generate values for this column

When a string field is generated from this spec, the prefix is prepended to the generated string

property randomSeed: get random seed for column spec

setBaseColumnDatatypes(columnDatatypes)[source]

Set the data types for the base columns

Parameters:: column_datatypes – = list of data types for the base columns

property specOptions

get column spec options for spec

Note

This is intended for testing use only. Option values set directly through the options dict are not supported.

Returns:: underlying options object

property step: get the column generation step value used to generate values for this column

structType()[source]

get the structType attribute used to generate values for this column

When a column spec is specified to generate multiple copies of the column, this controls whether these are combined into an array etc

property suffix

get the string suffix used to generate values for this column

When a string field is generated from this spec, the suffix is appended to the generated string

property textGenerator: Get the text generator for the column spec

property text_separator: get the expr attributed used to generate values for this column