dbldatagen.column_generation_spec module
This file defines the ColumnGenerationSpec class
- class ColumnGenerationSpec(name, colType=None, minValue=0, maxValue=None, step=1, prefix='', random=False, distribution=None, baseColumn=None, randomSeed=None, randomSeedMethod=None, implicit=False, omit=False, nullable=True, debug=False, verbose=False, seedColumnName='id', **kwargs)[source]
Bases:
object
Column generation spec object - specifies how column is to be generated
Each column to be output will have a corresponding ColumnGenerationSpec object. This is added explicitly using the DataGenerators withColumnSpec or withColumn methods
If none is explicitly added, a default one will be generated.
The full set of arguments to the class is more than the explicitly called out parameters as any arguments that are not explicitly called out can still be passed due to the **kwargs expression.
This class is meant for internal use only.
- Parameters:
name – Name of column (string).
colType – Spark SQL datatype instance, representing the type of the column.
min – minimum value of column
max – maximum value of the column
step – numeric step used in column data generation
prefix – string used as prefix to the column underlying value to produce a string value
random – Boolean, if True, will generate random values
distribution – Instance of distribution, that will control the distribution of the generated values
baseColumn – String or list of strings representing columns used as basis for generating the column data
randomSeed – random seed value used to generate the random value, if column data is random
randomSeedMethod – method for computing random values from the random seed. It may take on the values fixed, hash_fieldname or None
implicit – If True, the specification for the column can be replaced by a later definition. If not, a later attempt to replace the definition will flag an error. Typically used when generating definitions automatically from a schema, or when using wildcards in the specification
omit – if True, omit from the final output.
nullable – If True, column may be null - defaults to True.
debug – If True, output debugging log statements. Defaults to False.
verbose – If True, output logging statements at the info level. If False (the default), only output warning and error logging statements.
seedColumnName – if supplied, specifies seed column name
For full list of options, see dbldatagen.column_spec_options module.
- property baseColumn
get the base column used to generate values for this column
- property baseColumns
Return base columns as list of strings
- property begin
get the begin attribute used to generate values for this column
For numeric columns, the range (minValue, maxValue, step) is used to control data generation. For date and time columns, the range (begin, end, interval) are used to control data generation
- property datatype
get the Spark SQL data type used to generate values for this column
- property end
get the end attribute used to generate values for this column
For numeric columns, the range (minValue, maxValue, step) is used to control data generation. For date and time columns, the range (begin, end, interval) are used to control data generation
- property expr
get the expr attributed used to generate values for this column
- property exprs
get the column generation exprs attribute used to generate values for this column.
- getOrElse(key, default=None)[source]
Get value for option key if it exists or else return default
- Parameters:
key – key name for option
default – default value if option was not provided
- Returns:
option value or default
- getPlanEntry()[source]
Get execution plan entry for object
- Returns:
String representation of plan entry
- property inferDatatype
If True indicates that datatype should be inferred to be result of computing SQL expression
- property interval
get the interval attribute used to generate values for this column
For numeric columns, the range (minValue, maxValue, step) is used to control data generation. For date and time columns, the range (begin, end, interval) are used to control data generation
- property isFieldOmitted
check if this field should be omitted from the output
If the field is omitted from the output, the field is available for use in expressions etc. but dropped from the final set of fields
- property isRandom
returns True if column will be randomly generated
- property isWeightedValuesColumn
check if column is a weighed values column
- makeGenerationExpressions()[source]
Generate structured column if multiple columns or features are specified
if there are multiple columns / features specified using a single definition, it will generate a set of columns conforming to the same definition, renaming them as appropriate and combine them into a array if necessary (depending on the structure combination instructions)
- param self:
is ColumnGenerationSpec for column
- returns:
spark sql column or expression that can be used to generate a column
- property max
get the column generation maxValue value used to generate values for this column
- property min
get the column generation minValue value used to generate values for this column
- property numColumns
get the numColumns attribute used to generate values for this column
if a column is specified with the numColumns attribute, this is used to create multiple copies of the column, named colName1 .. colNameN
- property numFeatures
get the numFeatures attribute used to generate values for this column
if a column is specified with the numFeatures attribute, this is used to create multiple copies of the column, combined into an array or feature vector
- property prefix
get the string prefix used to generate values for this column
When a string field is generated from this spec, the prefix is prepended to the generated string
- property randomSeed
get random seed for column spec
- setBaseColumnDatatypes(columnDatatypes)[source]
Set the data types for the base columns
- Parameters:
column_datatypes – = list of data types for the base columns
- property specOptions
get column spec options for spec
Note
This is intended for testing use only. Option values set directly through the options dict are not supported.
- Returns:
underlying options object
- property step
get the column generation step value used to generate values for this column
- structType()[source]
get the structType attribute used to generate values for this column
When a column spec is specified to generate multiple copies of the column, this controls whether these are combined into an array etc
- property suffix
get the string suffix used to generate values for this column
When a string field is generated from this spec, the suffix is appended to the generated string
- property textGenerator
Get the text generator for the column spec
- property text_separator
get the expr attributed used to generate values for this column