dbldatagen.text_generators module
This file defines various text generation classes and methods
- class ILText(paragraphs=None, sentences=None, words=None, extendedWordList=None)[source]
Bases:
TextGenerator
Class to generate Ipsum Lorem text paragraphs, words and sentences
- Parameters:
paragraphs – Number of paragraphs to generate. If tuple will generate random number in range
sentences – Number of sentences to generate. If tuple will generate random number in tuple range
words – Number of words per sentence to generate. If tuple, will generate random number in tuple range
- classicGenerateText(v)[source]
classic udf entry point for text generation
- Parameters:
v – base value to control generation of random numbers
- generateText(baseValues, rowCount=1)[source]
generate text for seed based on configuration parameters.
As it uses numpy, repeatability is restricted depending on version of the runtime
- Parameters:
baseValues – list or array-like list of baseValues
rowCount – number of rows
- Returns:
list or Pandas series of generated strings of same size as input seed
- class TemplateGenerator(template, escapeSpecialChars=False, extendedWordList=None)[source]
Bases:
TextGenerator
This class handles the generation of text from templates
- Parameters:
template – template string to use in text generation
escapeSpecialChars – By default special chars in the template have special meaning if unescaped If set to true, then the special meaning requires escape char
\
extendedWordList – if provided, use specified word list instead of default word list
The template generator generates text from a template to allow for generation of synthetic account card numbers, VINs, IBANs and many other structured codes.
The base value is passed to the template generation and may be used in the generated text. The base value is the value the column would have if the template generation had not been applied.
It uses the following special chars:
Chars
Meaning
\
Apply escape to next char.
v0,v1,..v9
Use base value as an array of values and substitute the nth element ( 0 .. 9). Always escaped.
x
Insert a random lowercase hex digit
X
Insert an uppercase random hex digit
d
Insert a random lowercase decimal digit
D
Insert an uppercase random decimal digit
a
Insert a random lowercase alphabetical character
A
Insert a random uppercase alphabetical character
k
Insert a random lowercase alphanumeric character
K
Insert a random uppercase alphanumeric character
n
Insert a random number between 0 .. 255 inclusive. This option must always be escaped
N
Insert a random number between 0 .. 65535 inclusive. This option must always be escaped
w
Insert a random lowercase word from the ipsum lorem word set. Always escaped
W
Insert a random uppercase word from the ipsum lorem word set. Always escaped
Note
If escape is used and`escapeSpecialChars` is False, then the following char is assumed to have no special meaning.
If the escapeSpecialChars option is set to True, then the following char only has its special meaning when preceded by an escape.
Some options must be always escaped for example
\v
,\n
and\w
.A special case exists for
\v
- if immediately followed by a digit 0 - 9, the underlying base value is interpreted as an array of values and the nth element is retrieved where n is the digit specified.In all other cases, the char itself is used.
The setting of the escapeSpecialChars determines how templates generate data.
If set to False, then the template
r"\dr_\v"
will generate the values"dr_0"
…"dr_999"
when applied to the values zero to 999. This conforms to earlier implementations for backwards compatibility.If set to True, then the template
r"dr_\v"
will generate the values"dr_0"
…"dr_999"
when applied to the values zero to 999. This conforms to the preferred style going forward- pandasGenerateText(v)[source]
entry point to use for pandas udfs
Implementation uses vectorized implementation of process
- Parameters:
v – Pandas series of values passed as base values
- Returns:
Pandas series of expanded templates
- property templates
Get effective templates for text generator
- class TextGenerator[source]
Bases:
object
Base class for text generation classes
- static compactNumpyTypeForValues(listValues)[source]
determine smallest numpy type to represent values
- Parameters:
listValues – list or np.ndarray of values to get np.dtype for
- Returns:
np.dtype that is most compact representation for values provided
- static getAsTupleOrElse(v, defaultValue, valueName)[source]
get value v as tuple or return default value
- Parameters:
v – value to test
defaultValue – value to use as a default if value of v is None. Must be a tuple.
valueName – name of value for debugging and logging purposes
- Returns:
return v as tuple if not None or value of default_v if v is None. If v is a single value, returns the tuple (v, v)
- getNPRandomGenerator(forceNewInstance=False)[source]
Get numpy random number generator
- Returns:
returns random number generator initialized from previously supplied random seed
- property randomSeed
Get random seed for text generator