dbldatagen.column_generation_spec module
This file defines the ColumnGenerationSpec class
- class ColumnGenerationSpec(name, colType=None, minValue=0, maxValue=None, step=1, prefix='', random=False, distribution=None, baseColumn=None, randomSeed=None, randomSeedMethod=None, implicit=False, omit=False, nullable=True, debug=False, verbose=False, seedColumnName='id', **kwargs)[source]
- Bases: - object- Column generation spec object - specifies how column is to be generated - Each column to be output will have a corresponding ColumnGenerationSpec object. This is added explicitly using the DataGenerators withColumnSpec or withColumn methods - If none is explicitly added, a default one will be generated. - The full set of arguments to the class is more than the explicitly called out parameters as any arguments that are not explicitly called out can still be passed due to the **kwargs expression. - This class is meant for internal use only. - Parameters:
- name – Name of column (string). 
- colType – Spark SQL datatype instance, representing the type of the column. 
- min – minimum value of column 
- max – maximum value of the column 
- step – numeric step used in column data generation 
- prefix – string used as prefix to the column underlying value to produce a string value 
- random – Boolean, if True, will generate random values 
- distribution – Instance of distribution, that will control the distribution of the generated values 
- baseColumn – String or list of strings representing columns used as basis for generating the column data 
- randomSeed – random seed value used to generate the random value, if column data is random 
- randomSeedMethod – method for computing random values from the random seed. It may take on the values fixed, hash_fieldname or None 
- implicit – If True, the specification for the column can be replaced by a later definition. If not, a later attempt to replace the definition will flag an error. Typically used when generating definitions automatically from a schema, or when using wildcards in the specification 
- omit – if True, omit from the final output. 
- nullable – If True, column may be null - defaults to True. 
- debug – If True, output debugging log statements. Defaults to False. 
- verbose – If True, output logging statements at the info level. If False (the default), only output warning and error logging statements. 
- seedColumnName – if supplied, specifies seed column name 
 
 - For full list of options, see dbldatagen.column_spec_options module. - property baseColumn
- get the base column used to generate values for this column 
 - property baseColumns
- Return base columns as list of strings 
 - property begin
- get the begin attribute used to generate values for this column - For numeric columns, the range (minValue, maxValue, step) is used to control data generation. For date and time columns, the range (begin, end, interval) are used to control data generation 
 - property datatype
- get the Spark SQL data type used to generate values for this column 
 - property end
- get the end attribute used to generate values for this column - For numeric columns, the range (minValue, maxValue, step) is used to control data generation. For date and time columns, the range (begin, end, interval) are used to control data generation 
 - property expr
- get the expr attributed used to generate values for this column 
 - property exprs
- get the column generation exprs attribute used to generate values for this column. 
 - getOrElse(key, default=None)[source]
- Get value for option key if it exists or else return default - Parameters:
- key – key name for option 
- default – default value if option was not provided 
 
- Returns:
- option value or default 
 
 - getPlanEntry()[source]
- Get execution plan entry for object - Returns:
- String representation of plan entry 
 
 - property inferDatatype
- If True indicates that datatype should be inferred to be result of computing SQL expression 
 - property interval
- get the interval attribute used to generate values for this column - For numeric columns, the range (minValue, maxValue, step) is used to control data generation. For date and time columns, the range (begin, end, interval) are used to control data generation 
 - property isFieldOmitted
- check if this field should be omitted from the output - If the field is omitted from the output, the field is available for use in expressions etc. but dropped from the final set of fields 
 - property isRandom
- returns True if column will be randomly generated 
 - property isWeightedValuesColumn
- check if column is a weighed values column 
 - makeGenerationExpressions()[source]
- Generate structured column if multiple columns or features are specified - if there are multiple columns / features specified using a single definition, it will generate a set of columns conforming to the same definition, renaming them as appropriate and combine them into a array if necessary (depending on the structure combination instructions) - param self:
- is ColumnGenerationSpec for column 
- returns:
- spark sql column or expression that can be used to generate a column 
 
 - property max
- get the column generation maxValue value used to generate values for this column 
 - property min
- get the column generation minValue value used to generate values for this column 
 - property numColumns
- get the numColumns attribute used to generate values for this column - if a column is specified with the numColumns attribute, this is used to create multiple copies of the column, named colName1 .. colNameN 
 - property numFeatures
- get the numFeatures attribute used to generate values for this column - if a column is specified with the numFeatures attribute, this is used to create multiple copies of the column, combined into an array or feature vector 
 - property prefix
- get the string prefix used to generate values for this column - When a string field is generated from this spec, the prefix is prepended to the generated string 
 - property randomSeed
- get random seed for column spec 
 - setBaseColumnDatatypes(columnDatatypes)[source]
- Set the data types for the base columns - Parameters:
- column_datatypes – = list of data types for the base columns 
 
 - property specOptions
- get column spec options for spec - Note - This is intended for testing use only. Option values set directly through the options dict are not supported. - Returns:
- underlying options object 
 
 - property step
- get the column generation step value used to generate values for this column 
 - structType()[source]
- get the structType attribute used to generate values for this column - When a column spec is specified to generate multiple copies of the column, this controls whether these are combined into an array etc 
 - property suffix
- get the string suffix used to generate values for this column - When a string field is generated from this spec, the suffix is appended to the generated string 
 - property textGenerator
- Get the text generator for the column spec 
 - property text_separator
- get the expr attributed used to generate values for this column