dbldatagen.column_spec_options module

This file defines the ColumnSpecOptions class

class ColumnSpecOptions(props, aliases=None)[source]

Bases: object

Column spec options object - manages options for column specs.

This class has limited functionality - mainly used to validate and document the options, and the class is meant for internal use only.

Parameters:: props – Used to pass list of properties for column generation spec property checking.

The following options are permitted on data generator withColumn, withColumnSpec and withColumnSpecs methods:

Parameters:

name – Column name
type – Data type of column. Can be either instance of Spark SQL Datatype such as IntegerType() or string containing SQL name of type
minValue – Minimum value for range of generated value. As an alternative, you may use the dataRange parameter
maxValue – Maximum value for range of generated value. As an alternative, you may use the dataRange parameter
step – Step to use for range of generated value. As an alternative, you may use the dataRange parameter
numColumns – generate n columns numbered from 1 .. n-1 with same definition
numFeatures – generate n columns numbered from 0 .. n-1 with same definition. Alias for numColumns
structType – If specified as “array” and used with numColumns / numFeatures, will combine columns as array
random – If True, will generate random values for column value. Defaults to False
baseColumn – Either the string name of the base column, or a list of columns to use to control data generation. The option baseColumns is an alias for baseColumn.
baseColumnType – Determines how the value is derived from the base column. Possible values are ‘auto’, ‘hash’, ‘raw_values’, ‘values’
values – List of discrete values for the colummn. Discrete values for the column can be strings, numbers or constants conforming to type of column
weights – List of discrete weights for the colummn. Should be integer values. For example, you might declare a column for status values with a weighted distribution with the following statement: withColumn(“status”, StringType(), values=[‘online’, ‘offline’, ‘unknown’], weights=[3,2,1])
percentNulls – Specifies numeric percentage of generated values to be populated with SQL null. Value is fraction representing percentage between 0.0 and 1.0 For example: percentNulls=0.12 will give approximately 12% nulls for this field in the output.
uniqueValues – Number of unique values for column. If the unique values are specified for a timestamp or date field, the values will be chosen working back from the end of the previous month, unless begin, end and interval parameters are specified
begin – Beginning of range for date and timestamp fields. For dates and timestamp fields, use the begin, end and interval or dataRange parameters instead of minValue, maxValue and step
end – End of range for date and timestamp fields. For dates and timestamp fields, use the begin, end and interval or dataRange parameters instead of minValue, maxValue and step
interval – Interval of range for date and timestamp fields. For dates and timestamp fields, use the begin, end and interval or dataRange parameters instead of minValue, maxValue and step
dataRange – An instance of an NRange or DateRange object. This can be used in place of minValue, maxValue, step or begin, end, interval.
template – template controlling how text should be generated
textSeparator – string specifying separator to be used when constructing strings with prefix and suffix
prefix – string specifying prefix text to construct field from prefix and numeric value. Both prefix and suffix can be used together
suffix – string specifying suffix text to construct field from suffix and numeric value. Both prefix and suffix can be used together
omit – if True, column is omitted from the output. Used to use column for interim effect only.
expr – SQL expression to control data generation. Ignores column base value if present.
implicit – Used by system to mark that column has been inferred from a schema. Allows definition to be explicitly overridden.
precision – Used for rounding to specific decimal layout.
scale – Used for rounding to specific decimal layout.
distribution – Distribution for random number. Ignored if column is not random.
escapeSpecialChars – if True, require escape for all special chars in template

When a column’s value is derived from the value of another column, the baseColumn and baseColumnType options can be used to control how the value is derived. The baseColumn option can be used to specify the name of the base column, and the baseColumnType option can be used to specify how the value is derived from the base column.

The following values are permitted for the baseColumnType option:

‘auto’: Automatically determine the base column type based on the column type of the base column.
‘hash’: Use a hash of the base column(s) value to derive the value of the new column.
‘raw_values’: Use the raw values of the base column to derive the value of the new column.
‘values’: Use the values of the base column to derive the value of the new column.

The baseColumn option can be used to specify the name of the base column. If the baseColumn option is not specified, the value of the new column will be derived from the seed or id column.

The baseColumnType option is optional. If it is not specified, the value of the new column will be derived based on the column type of the base column.

The derivation from raw_values differs from values in that the raw_values option will use the raw values of the base column to derive the value of the new column, while the values option will use the values of the base column to derive the value of the new column after scaling to the range or implied range of the new column.

For example a column with four categorical values , ‘A’, ‘B’, ‘C’, ‘D’ has an implied range of 0 .. 3.

Note

If the dataRange parameter is specified as well as the minValue, maxValue or step, the results are undetermined.

For more information, see dbldatagen.daterange module or dbldatagen.nrange module.

checkBoolOption(v, name=None, optional=True)[source]

Check that option is either not specified or of type boolean

Parameters:

v – value to test
name – name of value to use in any reported errors or exceptions
optional – If True (default), indicates that value is optional and that None is a valid value for the option

checkExclusiveOptions(options)[source]

check if the options are exclusive - i.e only one is not None

Parameters:: options – list of options that will be mutually exclusive

checkOptionValues(option, option_values)[source]

check if option value is in list of values

Parameters:

option – list of options that will be mutually exclusive
option_values – list of possible option values that will be mutually exclusive

checkValidColumnProperties(columnProps)[source]

check that column definition properties are recognized and that the column definition has required properties

Parameters:: columnProps –

getOrElse(key, default=None)[source]: Get val for key if it exists or else return default

property options

Get options dictionary for object

Returns:: options dictionary for object