dbldatagen.column_spec_options module
This file defines the ColumnSpecOptions class
- class ColumnSpecOptions(props, aliases=None)[source]
Bases:
object
Column spec options object - manages options for column specs.
This class has limited functionality - mainly used to validate and document the options, and the class is meant for internal use only.
- Parameters:
props – Used to pass list of properties for column generation spec property checking.
The following options are permitted on data generator withColumn, withColumnSpec and withColumnSpecs methods:
- Parameters:
name – Column name
type – Data type of column. Can be either instance of Spark SQL Datatype such as IntegerType() or string containing SQL name of type
minValue – Minimum value for range of generated value. As an alternative, you may use the dataRange parameter
maxValue – Maximum value for range of generated value. As an alternative, you may use the dataRange parameter
step – Step to use for range of generated value. As an alternative, you may use the dataRange parameter
numColumns – generate n columns numbered from 1 .. n-1 with same definition
numFeatures – generate n columns numbered from 0 .. n-1 with same definition. Alias for numColumns
structType – If specified as “array” and used with numColumns / numFeatures, will combine columns as array
random – If True, will generate random values for column value. Defaults to False
baseColumn – Either the string name of the base column, or a list of columns to use to control data generation. The option
baseColumns
is an alias forbaseColumn
.values – List of discrete values for the colummn. Discrete values for the column can be strings, numbers or constants conforming to type of column
weights – List of discrete weights for the colummn. Should be integer values. For example, you might declare a column for status values with a weighted distribution with the following statement: withColumn(“status”, StringType(), values=[‘online’, ‘offline’, ‘unknown’], weights=[3,2,1])
percentNulls – Specifies numeric percentage of generated values to be populated with SQL null. Value is fraction representing percentage between 0.0 and 1.0 For example: percentNulls=0.12 will give approximately 12% nulls for this field in the output.
uniqueValues – Number of unique values for column. If the unique values are specified for a timestamp or date field, the values will be chosen working back from the end of the previous month, unless begin, end and interval parameters are specified
begin – Beginning of range for date and timestamp fields. For dates and timestamp fields, use the begin, end and interval or dataRange parameters instead of minValue, maxValue and step
end – End of range for date and timestamp fields. For dates and timestamp fields, use the begin, end and interval or dataRange parameters instead of minValue, maxValue and step
interval – Interval of range for date and timestamp fields. For dates and timestamp fields, use the begin, end and interval or dataRange parameters instead of minValue, maxValue and step
dataRange – An instance of an NRange or DateRange object. This can be used in place of minValue, maxValue, step or begin, end, interval.
template – template controlling how text should be generated
textSeparator – string specifying separator to be used when constructing strings with prefix and suffix
prefix – string specifying prefix text to construct field from prefix and numeric value. Both prefix and suffix can be used together
suffix – string specifying suffix text to construct field from suffix and numeric value. Both prefix and suffix can be used together
omit – if True, column is omitted from the output. Used to use column for interim effect only.
expr – SQL expression to control data generation. Ignores column base value if present.
implicit – Used by system to mark that column has been inferred from a schema. Allows definition to be explicitly overridden.
precision – Used for rounding to specific decimal layout.
scale – Used for rounding to specific decimal layout.
distribution – Distribution for random number. Ignored if column is not random.
escapeSpecialChars – if True, require escape for all special chars in template
Note
If the dataRange parameter is specified as well as the minValue, maxValue or step, the results are undetermined.
For more information, see dbldatagen.daterange module or dbldatagen.nrange module.
- checkBoolOption(v, name=None, optional=True)[source]
Check that option is either not specified or of type boolean
- Parameters:
v – value to test
name – name of value to use in any reported errors or exceptions
optional – If True (default), indicates that value is optional and that None is a valid value for the option
- checkExclusiveOptions(options)[source]
check if the options are exclusive - i.e only one is not None
- Parameters:
options – list of options that will be mutually exclusive
- checkOptionValues(option, option_values)[source]
check if option value is in list of values
- Parameters:
option – list of options that will be mutually exclusive
option_values – list of possible option values that will be mutually exclusive
- checkValidColumnProperties(columnProps)[source]
check that column definition properties are recognized and that the column definition has required properties
- Parameters:
columnProps –
- property options
Get options dictionary for object
- Returns:
options dictionary for object