dbldatagen.column_spec_options module

This file defines the ColumnSpecOptions class

class ColumnSpecOptions(props, aliases=None)[source]

Bases: object

Column spec options object - manages options for column specs.

This class has limited functionality - mainly used to validate and document the options, and the class is meant for internal use only.

Parameters:

props – Used to pass list of properties for column generation spec property checking.

The following options are permitted on data generator withColumn, withColumnSpec and withColumnSpecs methods:

Parameters:
  • name – Column name

  • type – Data type of column. Can be either instance of Spark SQL Datatype such as IntegerType() or string containing SQL name of type

  • minValue – Minimum value for range of generated value. As an alternative, you may use the dataRange parameter

  • maxValue – Maximum value for range of generated value. As an alternative, you may use the dataRange parameter

  • step – Step to use for range of generated value. As an alternative, you may use the dataRange parameter

  • numColumns – generate n columns numbered from 1 .. n-1 with same definition

  • numFeatures – generate n columns numbered from 0 .. n-1 with same definition. Alias for numColumns

  • structType – If specified as “array” and used with numColumns / numFeatures, will combine columns as array

  • random – If True, will generate random values for column value. Defaults to False

  • baseColumn – Either the string name of the base column, or a list of columns to use to control data generation. The option baseColumns is an alias for baseColumn.

  • baseColumnType – Determines how the value is derived from the base column. Possible values are ‘auto’, ‘hash’, ‘raw_values’, ‘values’

  • values – List of discrete values for the colummn. Discrete values for the column can be strings, numbers or constants conforming to type of column

  • weights – List of discrete weights for the colummn. Should be integer values. For example, you might declare a column for status values with a weighted distribution with the following statement: withColumn(“status”, StringType(), values=[‘online’, ‘offline’, ‘unknown’], weights=[3,2,1])

  • percentNulls – Specifies numeric percentage of generated values to be populated with SQL null. Value is fraction representing percentage between 0.0 and 1.0 For example: percentNulls=0.12 will give approximately 12% nulls for this field in the output.

  • uniqueValues – Number of unique values for column. If the unique values are specified for a timestamp or date field, the values will be chosen working back from the end of the previous month, unless begin, end and interval parameters are specified

  • begin – Beginning of range for date and timestamp fields. For dates and timestamp fields, use the begin, end and interval or dataRange parameters instead of minValue, maxValue and step

  • end – End of range for date and timestamp fields. For dates and timestamp fields, use the begin, end and interval or dataRange parameters instead of minValue, maxValue and step

  • interval – Interval of range for date and timestamp fields. For dates and timestamp fields, use the begin, end and interval or dataRange parameters instead of minValue, maxValue and step

  • dataRange – An instance of an NRange or DateRange object. This can be used in place of minValue, maxValue, step or begin, end, interval.

  • template – template controlling how text should be generated

  • textSeparator – string specifying separator to be used when constructing strings with prefix and suffix

  • prefix – string specifying prefix text to construct field from prefix and numeric value. Both prefix and suffix can be used together

  • suffix – string specifying suffix text to construct field from suffix and numeric value. Both prefix and suffix can be used together

  • omit – if True, column is omitted from the output. Used to use column for interim effect only.

  • expr – SQL expression to control data generation. Ignores column base value if present.

  • implicit – Used by system to mark that column has been inferred from a schema. Allows definition to be explicitly overridden.

  • precision – Used for rounding to specific decimal layout.

  • scale – Used for rounding to specific decimal layout.

  • distribution – Distribution for random number. Ignored if column is not random.

  • escapeSpecialChars – if True, require escape for all special chars in template

When a column’s value is derived from the value of another column, the baseColumn and baseColumnType options can be used to control how the value is derived. The baseColumn option can be used to specify the name of the base column, and the baseColumnType option can be used to specify how the value is derived from the base column.

The following values are permitted for the baseColumnType option:

  • ‘auto’: Automatically determine the base column type based on the column type of the base column.

  • ‘hash’: Use a hash of the base column(s) value to derive the value of the new column.

  • ‘raw_values’: Use the raw values of the base column to derive the value of the new column.

  • ‘values’: Use the values of the base column to derive the value of the new column.

The baseColumn option can be used to specify the name of the base column. If the baseColumn option is not specified, the value of the new column will be derived from the seed or id column.

The baseColumnType option is optional. If it is not specified, the value of the new column will be derived based on the column type of the base column.

The derivation from raw_values differs from values in that the raw_values option will use the raw values of the base column to derive the value of the new column, while the values option will use the values of the base column to derive the value of the new column after scaling to the range or implied range of the new column.

For example a column with four categorical values , ‘A’, ‘B’, ‘C’, ‘D’ has an implied range of 0 .. 3.

Note

If the dataRange parameter is specified as well as the minValue, maxValue or step, the results are undetermined.

For more information, see dbldatagen.daterange module or dbldatagen.nrange module.

checkBoolOption(v, name=None, optional=True)[source]

Check that option is either not specified or of type boolean

Parameters:
  • v – value to test

  • name – name of value to use in any reported errors or exceptions

  • optional – If True (default), indicates that value is optional and that None is a valid value for the option

checkExclusiveOptions(options)[source]

check if the options are exclusive - i.e only one is not None

Parameters:

options – list of options that will be mutually exclusive

checkOptionValues(option, option_values)[source]

check if option value is in list of values

Parameters:
  • option – list of options that will be mutually exclusive

  • option_values – list of possible option values that will be mutually exclusive

checkValidColumnProperties(columnProps)[source]

check that column definition properties are recognized and that the column definition has required properties

Parameters:

columnProps

getOrElse(key, default=None)[source]

Get val for key if it exists or else return default

property options

Get options dictionary for object

Returns:

options dictionary for object