dbldatagen.utils module

This file defines the DataGenError classes and utility functions

These are meant for internal use only

exception DataGenError(msg, baseException=None)[source]

Bases: Exception

Exception class to represent data generation errors

Parameters:
  • msg – message related to error

  • baseException – underlying exception, if any that caused the issue

coalesce_values(*args)[source]

For a supplied list of arguments, returns the first argument that does not have the value None

Parameters:

args – variable list of arguments which are evaluated

Returns:

First argument in list that evaluates to a non-None value

deprecated(message='')[source]

Define a deprecated decorator without dependencies on 3rd party libraries

Note there is a 3rd party library called deprecated that provides this feature but goal is to only have dependencies on packages already used in the Databricks runtime

ensure(cond, msg='condition does not hold true')[source]

ensure(cond, s) => throws Exception(s) if c is not true

Parameters:
  • cond – condition to test

  • msg – Message to add to exception if exception is raised

Raises:

DataGenError exception if condition does not hold true

Returns:

Does not return anything but raises exception if condition does not hold

json_value_from_path(searchPath, jsonData, defaultValue)[source]

Get JSON value from JSON data referenced by searchPath

searchPath should be a JSON path as supported by the jmespath package (see https://jmespath.org/)

Parameters:
  • searchPath – A jmespath compatible JSON search path

  • jsonData – The json data to search (string representation of the JSON data)

  • defaultValue – The default value to be returned if the value was not found

Returns:

Returns the json value if present, otherwise returns the default value

mkBoundsList(x, default)[source]

make a bounds list from supplied parameter - otherwise use default

Parameters:
  • x – integer or list of 2 values that define bounds list

  • default – default value if X is None

Returns:

list of form [x,y]

parse_time_interval(spec)[source]

parse time interval from string

split_list_matching_condition(lst, cond)[source]

Split a list on elements that match a condition

This will find all matches of a specific condition in the list and split the list into sub lists around the element that matches this condition.

It will handle multiple matches performing splits on each match.

For example, the following code will produce the results below:

x = [‘id’, ‘city_name’, ‘id’, ‘city_id’, ‘city_pop’, ‘id’, ‘city_id’, ‘city_pop’,’city_id’, ‘city_pop’,’id’] splitListOnCondition(x, lambda el: el == ‘id’)

Result: `[[‘id’], [‘city_name’], [‘id’], [‘city_id’, ‘city_pop’],

[‘id’], [‘city_id’, ‘city_pop’, ‘city_id’, ‘city_pop’], [‘id’]]`

Parameters:
  • lst – list of items to perform condition matches against

  • cond – lambda function or function taking single argument and returning True or False

Returns:

list of sublists

strip_margins(s, marginChar)[source]

Python equivalent of Scala stripMargins method

Takes a string (potentially multiline) and strips all chars up and including the first occurrence of marginChar. Used to control the formatting of generated text

strip_margins(“one |two |three”, ‘|’)

will produce

`` one two three ``

Parameters:
  • s – string to strip margins from

  • marginChar – character to strip

Returns:

modified string

system_time_millis()[source]

return system time as milliseconds since start of epoch

Returns:

system time millis as long

topologicalSort(sources, initial_columns=None, flatten=True)[source]

Perform a topological sort over sources

Used to compute the column test data generation order of the column generation dependencies.

The column generation dependencies are based on the value of the baseColumn attribute for withColumn or withColumnSpec statements in the data generator specification.

Parameters:
  • sources – list of (name, set(names of dependencies)) pairs

  • initial_columns – force initial_columns to be computed first

  • flatten – if true, flatten output list

Returns:

list of names in dependency order separated into build phases

Note

The algorith will give preference to retaining order of inbound sequence over modifying order to produce a lower number of build phases.

Overall the effect is that the input build order should be retained unless there are forward references