dbldatagen.text_generator_plugins module

This file defines the text generator plugin class PyfuncText

class FakerTextFactory(locale=None, providers=None, name='FakerText', lib=None, rootClass=None)[source]

Bases: PyfuncTextFactory

Factory object for Faker text generator flavored PyfuncText objects

Parameters:
  • locale – list of locales. If empty, defaults to en-US

  • providers – list of providers

  • name – name of generated objects. Defaults to FakerText

  • lib – library import name of Faker library. If none passed, uses faker

  • rootClass – name of root object class If none passed, uses Faker

..note ::

Both the library name and root object class can be overridden - this is primarily for internal testing purposes.

class PyfuncText(fn, init=None, initPerBatch=False, name=None, rootProperty=None)[source]

Bases: TextGenerator

Text generator that supports generating text from arbitrary Python function

Parameters:
  • fn – function to call to generate text.

  • init – function to call to initialize context

  • initPerBatch – if init per batch is set to True, initialization of context is performed on every Pandas udf call. Default is False.

  • name – String representing name of text generator when converted to string via repr or str

The two functions define the plugin model

The first function, fn is called whenever text should be generated for a single column of a single row

It is called with the signature fn(context, value) unless a root property is set, in which the signature is fn(rootProperty) with rootProperty having the value of the root property of the context.

Context is the stored context containing instances of random number generators, 3rd party client library objects etc.

The initFn is called to initialize the function call context. The plugin code can store arbitrary properties in the context following normal Python object rules.

The context is initialized with the property textGenerator prior to being initialized which is a reference to the enclosing text generator.

Note

There are no expectations of repeatability of data generation when using external code or external libraries to generate text.

However, custom code can call the base class method to get a Numpy random number generator instance. This will have been seeded using the dbldatagen random number seed if one was specified, so random numbers generated from this will be repeatable.

The custom code may call the property randomSeed on the text generator object to get the random seed which may be used to seed library specific initialization.

This random seed property may have the values None or -1 which should be treated as meaning dont use a random seed.

The code does not guarantee thread or cross process safety. If a new instance of the random number generator is needed, you may call the base class method with the argument forceNewInstance set to True.

pandasGenerateText(v)[source]

Called to generate text via Pandas UDF mechanism

Parameters:

v – base value of column as Pandas Series

class PyfuncTextFactory(name=None)[source]

Bases: object

PyfuncTextFactory applies syntactic wrapping around creation of PyfuncText objects

Parameters:

name – name of generated object (when converted to string via str)

It allows the use of the following constructs:

# initialization (for Faker for example)

# setup use of Faker
def initFaker(ctx):
  ctx.faker = Faker(locale="en_US")
  ctx.faker.add_provider(internet)

FakerText = (PyfuncTextFactory(name="FakerText")
            .withInit(initFaker)        # determines how context should be initialized
            .withRootProperty("faker")  # determines what context property is passed to fn
            )

# later use ...
.withColumn("fake_name", text=FakerText("sentence", ext_word_list=my_word_list) )
.withColumn("fake_sentence", text=FakerText("sentence", ext_word_list=my_word_list) )

# translates to generation of lambda function with keyword arguments
# or without as needed
.withColumn("fake_name",
            text=FakerText( (lambda faker: faker.name( )),
                            init=initFaker,
                            rootProperty="faker",
                            name="FakerText"))
.withColumn("fake_sentence",
            text=FakerText( (lambda faker:
                                faker.sentence( **{ "ext_word_list" : my_word_list} )),
                            init=initFaker,
                            rootProperty="faker",
                            name="FakerText"))
withInit(fn)[source]

Specifies context initialization function

Parameters:

fn – function pointer or lambda function for initialization signature should initFunction(context)

Note

This variation initializes the context once per worker process per text generator instance.

withInitPerBatch(fn)[source]

Specifies context initialization function

Parameters:

fn – function pointer or lambda function for initialization signature should initFunction(context)

Note

This variation initializes the context once per internal pandas UDF call. The UDF call will be called once per 10,000 rows if system is configured using defaults. Setting the pandas batch size as an argument to the DataSpec creation will change the default batch size.

withRootProperty(prop)[source]

If called, specifies the property of the context to be passed to the text generation function. If not called, the context object itself will be passed to the text generation function.

fakerText(mname, *args, _lib=None, _rootClass=None, **kwargs)[source]

Generate faker text generator object using default FakerTextFactory instance

Parameters:
  • mname – method name to invoke

  • args – positional args to be passed to underlying Faker instance

  • _lib – internal only param - library to load

  • _rootClass – internal only param - root class to create

:returns : instance of PyfuncText for use with Faker

fakerText("sentence") is same as FakerTextFactory()("sentence")