Generating and Manipulating Text Data

There are a number of ways to generate and manipulate text data.

  • Generating values from a specific set of values

  • Formatting text based on an existing value

  • Using a SQL expression to transform an existing or random value

  • Using the Ipsum Lorem text generator

  • Using the general purpose text generator

Generating data from a specific set of values

You can specify a specific set of values for a column - these can be of the same type as the column data type, or if not, at runtime, they will be cast to the column data type automatically.

This is the simplest way to specify a small set of discrete values for a column.

The following example illustrates generating data for specific ranges of values:

import dbldatagen as dg
df_spec = (
    dg.DataGenerator(sparkSession=spark, name="test_data_set1", rows=100000,
                   partitions=4, randomSeedMethod="hash_fieldname")
   .withIdOutput()
   .withColumn("code3", StringType(), values=['online', 'offline', 'unknown'])
   .withColumn("code4", StringType(), values=['a', 'b', 'c'], random=True, percentNulls=0.05)
   .withColumn("code5", StringType(), values=['a', 'b', 'c'], random=True, weights=[9, 1, 1])
)

Generating text from existing values

We can also generate text data from existing values whether numeric, or other data types via a set of transformations:

This can include:
  • adding a prefix

  • adding a suffix

  • formatting an existing value as a string

  • using a custom SQL expression to generate a string

The root value in these cases is taken from the base_column or in simple cases, may be specified as part of the basic column generation.

Formatting text based on an existing value

Often we want to generate a text value based on some numeric value
  • by combining a prefix

  • by adding a suffix

  • by addeding arbitrary formatting or other transformations.

See the online documentation for the ColumnSpecOptions class for more details.

Using a SQL expression to transform existing or random values

The expr attribute can be used to generate data values from arbitrary SQL expressions. These can include expressions such as concat that generate text results.

See the online documentation for the ColumnSpecOptions class for more details.

Using Text Generators

The Data Generation framework provides a number of classes for general purpose text generation

### Using the Ipsum Lorem text generator The Ipsum lorem text generator generates sequences of words, sentances, and paragraphs following the Ipsum Lorem convention used in UI mockups. It originates from a technique used in type setting.

See Wikipedia article on Lorem Ipsum

The following example illustrates its use:

import dbldatagen as dg
df_spec = (
   dg.DataGenerator(sparkSession=spark, name="test_data_set1", rows=100000,
                  partitions=4, randomSeedMethod="hash_fieldname")
   .withIdOutput()
   .withColumnSpec("sample_text", text=dg.ILText(paragraphs=(1, 4),
                   sentences=(2, 6)))
)

df = df_spec.build()
num_rows=df.count()

Using the general purpose text generator

The template attribute allows specification of templated text generation.

Here are some examples of its use to generate dummy email addresses, ip addressed and phone numbers

import dbldatagen as dg
df_spec = (
     dg.DataGenerator(sparkSession=spark, name="test_data_set1", rows=100000,
                      partitions=4, randomSeedMethod="hash_fieldname")
    .withIdOutput()
    .withColumnSpec("email",
                    template=r'\w.\w@\w.com|\w@\w.co.u\k')
    .withColumnSpec("ip_addr",
                     template=r'\n.\n.\n.\n')
    .withColumnSpec("phone",
                     template=r'(ddd)-ddd-dddd|1(ddd) ddd-dddd|ddd ddddddd')
    )

df = df_spec.build()
num_rows=df.count()

The implementation of the template expansion uses the underlying TemplateGenerator class.

Note

The template option is shorthand for text=dg.TemplateGenerator(template=...)

This can be specified in multiple modes - see the TemplateGenerator documentation for more details.

TemplateGenerator options

The template generator generates text from a template to allow for generation of synthetic credit card numbers, VINs, IBANs and many other structured codes.

The base value is passed to the template generation and may be used in the generated text. The base value is the value the column would have if the template generation had not been applied.

It uses the following special chars:

Chars

Meaning

\

Apply escape to next char.

v0,..v9

Use base value as an array of values and substitute the nth element ( 0 .. 9). Always escaped.

x

Insert a random lowercase hex digit

X

Insert an uppercase random hex digit

d

Insert a random lowercase decimal digit

D

Insert an uppercase random decimal digit

a

Insert a random lowercase alphabetical character

A

Insert a random uppercase alphabetical character

k

Insert a random lowercase alphanumeric character

K

Insert a random uppercase alphanumeric character

n

Insert a random number between 0 .. 255 inclusive. This option must always be escaped

N

Insert a random number between 0 .. 65535 inclusive. This option must always be escaped

w

Insert a random lowercase word from the ipsum lorem word set. Always escaped

W

Insert a random uppercase word from the ipsum lorem word set. Always escaped

Note

If escape is used and escapeSpecialChars is False, then the following char is assumed to have no special meaning.

If the escapeSpecialChars option is set to True, then the following char only has its special meaning when preceded by an escape.

Some options must be always escaped for example \\v, \\n and \\w.

A special case exists for \\v - if immediately followed by a digit 0 - 9, the underlying base value is interpreted as an array of values and the nth element is retrieved where n is the digit specified.

The escapeSpecialChars is set to False by default for backwards compatibility.

To use the escapeSpecialChars option, use the variant text=dg.TemplateGenerator(template=...), escapeSpecialChars=True

In all other cases, the char itself is used.

The setting of the escapeSpecialChars determines how templates generate data.

If set to False, then the template r"\\dr_\\v" will generate the values "dr_0""dr_999" when applied to the values zero to 999. This conforms to earlier implementations for backwards compatibility.

If set to True, then the template r"dr_\\v" will generate the values "dr_0""dr_999" when applied to the values zero to 999. This conforms to the preferred style going forward