dbldatagen.column_generation_spec module

This file defines the ColumnGenerationSpec class

class ColumnGenerationSpec(name, colType=None, minValue=0, maxValue=None, step=1, prefix='', random=False, distribution=None, baseColumn=None, randomSeed=None, randomSeedMethod=None, implicit=False, omit=False, nullable=True, debug=False, verbose=False, seedColumnName='id', **kwargs)[source]

Bases: object

Column generation spec object - specifies how column is to be generated

Each column to be output will have a corresponding ColumnGenerationSpec object. This is added explicitly using the DataGenerators withColumnSpec or withColumn methods

If none is explicitly added, a default one will be generated.

The full set of arguments to the class is more than the explicitly called out parameters as any arguments that are not explicitly called out can still be passed due to the **kwargs expression.

This class is meant for internal use only.

Parameters:
  • name – Name of column (string).

  • colType – Spark SQL datatype instance, representing the type of the column.

  • min – minimum value of column

  • max – maximum value of the column

  • step – numeric step used in column data generation

  • prefix – string used as prefix to the column underlying value to produce a string value

  • random – Boolean, if True, will generate random values

  • distribution – Instance of distribution, that will control the distribution of the generated values

  • baseColumn – String or list of strings representing columns used as basis for generating the column data

  • randomSeed – random seed value used to generate the random value, if column data is random

  • randomSeedMethod – method for computing random values from the random seed. It may take on the values fixed, hash_fieldname or None

  • implicit – If True, the specification for the column can be replaced by a later definition. If not, a later attempt to replace the definition will flag an error. Typically used when generating definitions automatically from a schema, or when using wildcards in the specification

  • omit – if True, omit from the final output.

  • nullable – If True, column may be null - defaults to True.

  • debug – If True, output debugging log statements. Defaults to False.

  • verbose – If True, output logging statements at the info level. If False (the default), only output warning and error logging statements.

  • seedColumnName – if supplied, specifies seed column name

For full list of options, see dbldatagen.column_spec_options module.

property baseColumn

get the base column used to generate values for this column

property baseColumns

Return base columns as list of strings

property begin

get the begin attribute used to generate values for this column

For numeric columns, the range (minValue, maxValue, step) is used to control data generation. For date and time columns, the range (begin, end, interval) are used to control data generation

property datatype

get the Spark SQL data type used to generate values for this column

property end

get the end attribute used to generate values for this column

For numeric columns, the range (minValue, maxValue, step) is used to control data generation. For date and time columns, the range (begin, end, interval) are used to control data generation

property expr

get the expr attributed used to generate values for this column

property exprs

get the column generation exprs attribute used to generate values for this column.

getNames()[source]

get column names as list of strings

getNamesAndTypes()[source]

get column names as list of tuples (name, datatype)

getOrElse(key, default=None)[source]

Get value for option key if it exists or else return default

Parameters:
  • key – key name for option

  • default – default value if option was not provided

Returns:

option value or default

getPlanEntry()[source]

Get execution plan entry for object

Returns:

String representation of plan entry

property inferDatatype

If True indicates that datatype should be inferred to be result of computing SQL expression

property interval

get the interval attribute used to generate values for this column

For numeric columns, the range (minValue, maxValue, step) is used to control data generation. For date and time columns, the range (begin, end, interval) are used to control data generation

property isFieldOmitted

check if this field should be omitted from the output

If the field is omitted from the output, the field is available for use in expressions etc. but dropped from the final set of fields

property isRandom

returns True if column will be randomly generated

property isWeightedValuesColumn

check if column is a weighed values column

keys()[source]

Get the keys as list of strings

makeGenerationExpressions()[source]

Generate structured column if multiple columns or features are specified

if there are multiple columns / features specified using a single definition, it will generate a set of columns conforming to the same definition, renaming them as appropriate and combine them into a array if necessary (depending on the structure combination instructions)

param self:

is ColumnGenerationSpec for column

returns:

spark sql column or expression that can be used to generate a column

property max

get the column generation maxValue value used to generate values for this column

property min

get the column generation minValue value used to generate values for this column

property numColumns

get the numColumns attribute used to generate values for this column

if a column is specified with the numColumns attribute, this is used to create multiple copies of the column, named colName1 .. colNameN

property numFeatures

get the numFeatures attribute used to generate values for this column

if a column is specified with the numFeatures attribute, this is used to create multiple copies of the column, combined into an array or feature vector

property prefix

get the string prefix used to generate values for this column

When a string field is generated from this spec, the prefix is prepended to the generated string

property randomSeed

get random seed for column spec

setBaseColumnDatatypes(columnDatatypes)[source]

Set the data types for the base columns

Parameters:

column_datatypes – = list of data types for the base columns

property specOptions

get column spec options for spec

Note

This is intended for testing use only. Option values set directly through the options dict are not supported.

Returns:

underlying options object

property step

get the column generation step value used to generate values for this column

structType()[source]

get the structType attribute used to generate values for this column

When a column spec is specified to generate multiple copies of the column, this controls whether these are combined into an array etc

property suffix

get the string suffix used to generate values for this column

When a string field is generated from this spec, the suffix is appended to the generated string

property textGenerator

Get the text generator for the column spec

property text_separator

get the expr attributed used to generate values for this column