dbldatagen.text_generators module

This file defines various text generation classes and methods

class ILText(paragraphs=None, sentences=None, words=None, extendedWordList=None)[source]

Bases: TextGenerator

Class to generate Ipsum Lorem text paragraphs, words and sentences

Parameters:
  • paragraphs – Number of paragraphs to generate. If tuple will generate random number in range

  • sentences – Number of sentences to generate. If tuple will generate random number in tuple range

  • words – Number of words per sentence to generate. If tuple, will generate random number in tuple range

classicGenerateText(v)[source]

classic udf entry point for text generation

Parameters:

v – base value to control generation of random numbers

generateText(baseValues, rowCount=1)[source]

generate text for seed based on configuration parameters.

As it uses numpy, repeatability is restricted depending on version of the runtime

Parameters:
  • baseValues – list or array-like list of baseValues

  • rowCount – number of rows

Returns:

list or Pandas series of generated strings of same size as input seed

pandasGenerateText(v)[source]

pandas udf entry point for text generation

Parameters:

v – pandas series of base values for random text generation

Returns:

Pandas series of generated strings

class TemplateGenerator(template, escapeSpecialChars=False, extendedWordList=None)[source]

Bases: TextGenerator

This class handles the generation of text from templates

Parameters:
  • template – template string to use in text generation

  • escapeSpecialChars – By default special chars in the template have special meaning if unescaped If set to true, then the special meaning requires escape char \

  • extendedWordList – if provided, use specified word list instead of default word list

The template generator generates text from a template to allow for generation of synthetic account card numbers, VINs, IBANs and many other structured codes.

The base value is passed to the template generation and may be used in the generated text. The base value is the value the column would have if the template generation had not been applied.

It uses the following special chars:

Chars

Meaning

\

Apply escape to next char.

v0,v1,..v9

Use base value as an array of values and substitute the nth element ( 0 .. 9). Always escaped.

x

Insert a random lowercase hex digit

X

Insert an uppercase random hex digit

d

Insert a random lowercase decimal digit

D

Insert an uppercase random decimal digit

a

Insert a random lowercase alphabetical character

A

Insert a random uppercase alphabetical character

k

Insert a random lowercase alphanumeric character

K

Insert a random uppercase alphanumeric character

n

Insert a random number between 0 .. 255 inclusive. This option must always be escaped

N

Insert a random number between 0 .. 65535 inclusive. This option must always be escaped

w

Insert a random lowercase word from the ipsum lorem word set. Always escaped

W

Insert a random uppercase word from the ipsum lorem word set. Always escaped

Note

If escape is used and`escapeSpecialChars` is False, then the following char is assumed to have no special meaning.

If the escapeSpecialChars option is set to True, then the following char only has its special meaning when preceded by an escape.

Some options must be always escaped for example \v, \n and \w.

A special case exists for \v - if immediately followed by a digit 0 - 9, the underlying base value is interpreted as an array of values and the nth element is retrieved where n is the digit specified.

In all other cases, the char itself is used.

The setting of the escapeSpecialChars determines how templates generate data.

If set to False, then the template r"\dr_\v" will generate the values "dr_0""dr_999" when applied to the values zero to 999. This conforms to earlier implementations for backwards compatibility.

If set to True, then the template r"dr_\v" will generate the values "dr_0""dr_999" when applied to the values zero to 999. This conforms to the preferred style going forward

classicGenerateText(v)[source]

entry point to use for classic udfs

pandasGenerateText(v)[source]

entry point to use for pandas udfs

Implementation uses vectorized implementation of process

Parameters:

v – Pandas series of values passed as base values

Returns:

Pandas series of expanded templates

property templates

Get effective templates for text generator

class TextGenerator[source]

Bases: object

Base class for text generation classes

static compactNumpyTypeForValues(listValues)[source]

determine smallest numpy type to represent values

Parameters:

listValues – list or np.ndarray of values to get np.dtype for

Returns:

np.dtype that is most compact representation for values provided

static getAsTupleOrElse(v, defaultValue, valueName)[source]

get value v as tuple or return default value

Parameters:
  • v – value to test

  • defaultValue – value to use as a default if value of v is None. Must be a tuple.

  • valueName – name of value for debugging and logging purposes

Returns:

return v as tuple if not None or value of default_v if v is None. If v is a single value, returns the tuple (v, v)

getNPRandomGenerator(forceNewInstance=False)[source]

Get numpy random number generator

Returns:

returns random number generator initialized from previously supplied random seed

property randomSeed

Get random seed for text generator

withRandomSeed(seed)[source]

Set the random seed for the text generator

Parameters:

seed – seed value to set

Returns:

self