Generating and Manipulating Text Data

There are a number of ways to generate and manipulate text data.

  • Generating values from a specific set of values

  • Formatting text based on an existing value

  • Using a SQL expression to transform an existing or random value

  • Using the Ipsum Lorem text generator

  • Using the general purpose text generator

Generating data from a specific set of values

You can specify a specific set of values for a column - these can be of the same type as the column data type, or if not, at runtime, they will be cast to the column data type automatically.

This is the simplest way to specify a small set of discrete values for a column.

The following example illustrates generating data for specific ranges of values:

import dbldatagen as dg
df_spec = (
    dg.DataGenerator(sparkSession=spark, name="test_data_set1", rows=100000,
                   partitions=4, randomSeedMethod="hash_fieldname")
   .withIdOutput()
   .withColumn("code3", "string", values=['online', 'offline', 'unknown'])
   .withColumn("code4", "string", values=['a', 'b', 'c'], random=True, percentNulls=0.05)
   .withColumn("code5", "string", values=['a', 'b', 'c'], random=True, weights=[9, 1, 1])
)

Generating text from existing values

We can also generate text data from existing values whether numeric, or other data types via a set of transformations:

This can include:
  • adding a prefix

  • adding a suffix

  • formatting an existing value as a string

  • using a custom SQL expression to generate a string

The root value in these cases is taken from the base_column or in simple cases, may be specified as part of the basic column generation.

Formatting text based on an existing value

Often we want to generate a text value based on some numeric value
  • by combining a prefix

  • by adding a suffix

  • by addeding arbitrary formatting or other transformations.

See the online documentation for the ColumnSpecOptions class for more details.

Using a SQL expression to transform existing or random values

The expr attribute can be used to generate data values from arbitrary SQL expressions. These can include expressions such as concat that generate text results.

See the online documentation for the ColumnSpecOptions class for more details.

Using Text Generators

The Data Generation framework provides a number of classes for general purpose text generation

### Using the Ipsum Lorem text generator The Ipsum lorem text generator generates sequences of words, sentances, and paragraphs following the Ipsum Lorem convention used in UI mockups. It originates from a technique used in type setting.

See Wikipedia article on Lorem Ipsum

The following example illustrates its use:

import dbldatagen as dg
df_spec = (
   dg.DataGenerator(sparkSession=spark, name="test_data_set1", rows=100000,
                  partitions=4, randomSeedMethod="hash_fieldname")
   .withIdOutput()
   .withColumn("sample_text", "string", text=dg.ILText(paragraphs=(1, 4),
                   sentences=(2, 6)))
)

df = df_spec.build()
num_rows=df.count()

Using the general purpose text generator

The template attribute allows specification of templated text generation.

Note

The template option is shorthand for text=dg.TemplateGenerator(template=...)

This can be specified with different options covering how escapes are handled and customizing the word list - see the TemplateGenerator documentation for more details.

import dbldatagen as dg
df_spec = (
     dg.DataGenerator(sparkSession=spark, name="test_data_set1", rows=100000,
                      partitions=4, randomSeedMethod="hash_fieldname")
    .withIdOutput()
    .withColumn("email", "string",
                    template=r'\w.\w@\w.com|\w@\w.co.u\k')
    .withColumn("ip_addr", "string",
                     template=r'\n.\n.\n.\n')
    .withColumn("phone", "string",
                     template=r'(ddd)-ddd-dddd|1(ddd) ddd-dddd|ddd ddddddd')

    # the following implements the same pattern as for `phone` but using the `TemplateGenerator` class
    .withColumn("phone2", "string",
                     text=dg.TemplateGenerator(r'(ddd)-ddd-dddd|1(ddd) ddd-dddd|ddd ddddddd'))
    )

df = df_spec.build()
num_rows=df.count()

The implementation of the template expansion uses the underlying TemplateGenerator class.

TemplateGenerator options

The template generator generates text from a template to allow for generation of synthetic credit card numbers, VINs, IBANs and many other structured codes.

The base value is passed to the template generation and may be used in the generated text. The base value is the value the column would have if the template generation had not been applied.

It uses the following special chars:

Chars

Meaning

\

Apply escape to next char.

v0,..v9

Use base value as an array of values and substitute the nth element ( 0 .. 9). Always escaped.

x

Insert a random lowercase hex digit

X

Insert an uppercase random hex digit

d

Insert a random lowercase decimal digit

D

Insert an uppercase random decimal digit

a

Insert a random lowercase alphabetical character

A

Insert a random uppercase alphabetical character

k

Insert a random lowercase alphanumeric character

K

Insert a random uppercase alphanumeric character

n

Insert a random number between 0 .. 255 inclusive. This option must always be escaped

N

Insert a random number between 0 .. 65535 inclusive. This option must always be escaped

w

Insert a random lowercase word from the ipsum lorem word set. Always escaped

W

Insert a random uppercase word from the ipsum lorem word set. Always escaped

In all other cases, the char itself is used.

The setting of the escapeSpecialChars determines how the template generate interprets the special chars.

If set to False, which defaults to False, then the special char does not need to be escaped to have its special meaning. But the special char must be escaped to be treated as a literal char.

So the template r"\dr_\v" will generate the values "dr_0""dr_999" when used via the template option and applied to the values zero to 999. Here the the character d is escaped to avoid interpretation as a special character.

If set to True, then the special char only has its special meaning when preceded by an escape.

So the option text=dg.TemplateGenerator(r’dr_v’, escapeSpecialChars=True) will generate the values "dr_0""dr_999" when applied to the values zero to 999.

This conforms to earlier implementations for backwards compatibility.

Note

Setting the argument escapeSpecialChars=False means that the special char does not need to be escaped to be treated as a special char. But it must be escaped to be treated as a literal char.

If the escapeSpecialChars option is set to True, then the following char only has its special meaning when preceded by an escape.

Some options must be always escaped for example \\v, \\n and \\w.

A special case exists for \\v - if immediately followed by a digit 0 - 9, the underlying base value is interpreted as an array of values and the nth element is retrieved where n is the digit specified.

The escapeSpecialChars is set to False by default for backwards compatibility.

To use the escapeSpecialChars option, use the variant text=dg.TemplateGenerator(template=..., escapeSpecialChars=True)

Using a custom word list

The template generator allows specification of a custom word list also. This is a list of words that can be used in the template generation. The default word list is the ipsum lorem word list.

While the values option allows for the specification of a list of categorical values, this is transmitted as part of the generated SQL. The use of the TemplateGenerator object with a custom word list allows for specification of much larger lists of possible values without the need to transmit them as part of the generated SQL.

For example the following code snippet illustrates the use of a custom word list:

import dbldatagen as dg

names = ['alpha', 'beta', 'gamma', 'lambda', 'theta']

df_spec = (
     dg.DataGenerator(sparkSession=spark, name="test_data_set1", rows=100000,
                      partitions=4, randomSeedMethod="hash_fieldname")
    .withIdOutput()
    .withColumn("email", "string",
                    template=r'\w.\w@\w.com|\w@\w.co.u\k')
    .withColumn("ip_addr", "string",
                     template=r'\n.\n.\n.\n')
    .withColumn("phone", "string",
                     template=r'(ddd)-ddd-dddd|1(ddd) ddd-dddd|ddd ddddddd')

    # implements the same pattern as for `phone` but using the `TemplateGenerator` class
    .withColumn("phone2", "string",
                     text=dg.TemplateGenerator(r'(ddd)-ddd-dddd|1(ddd) ddd-dddd|ddd ddddddd'))

    # uses a custom word list
    .withColumn("name", "string",
                     text=dg.TemplateGenerator(r'\w \w|\w \w \w|\w \a. \w',
                                               escapeSpecialChars=True,
                                               extendedWordList=names))
    )

df = df_spec.build()
display(df)

Here the names variable is a list of names that can be used in the template generation.

While this is short list in this case, it could be a much larger list of names either specified as a literal, or read from another dataframe, file, table or produced from another source.

As this is not transmitted as part of the generated SQL, it allows for much larger lists of possible values.

Other forms of text value lookup

The use of the values option and the template option with a TemplateGenerator instance allow for generation of data when the range of possible values is known.

But what about scenarios when the list of data is read from a different table or some other form of lookup?

As the output of the data generation build() method is a regular PySpark DataFrame, it is possible to join the generated data with other data sources to generate the required data.

In these cases, the generator can be specified to produce lookup keys that can be used to join with the other data sources.