dbldatagen.text_generator_plugins module
This file defines the text generator plugin class PyfuncText
- class FakerTextFactory(locale=None, providers=None, name='FakerText', lib=None, rootClass=None)[source]
Bases:
PyfuncTextFactory
Factory object for Faker text generator flavored
PyfuncText
objects- Parameters:
locale – list of locales. If empty, defaults to
en-US
providers – list of providers
name – name of generated objects. Defaults to
FakerText
lib – library import name of Faker library. If none passed, uses
faker
rootClass – name of root object class If none passed, uses
Faker
- ..note ::
Both the library name and root object class can be overridden - this is primarily for internal testing purposes.
- class PyfuncText(fn, init=None, initPerBatch=False, name=None, rootProperty=None)[source]
Bases:
TextGenerator
Text generator that supports generating text from arbitrary Python function
- Parameters:
fn – function to call to generate text.
init – function to call to initialize context
initPerBatch – if init per batch is set to True, initialization of context is performed on every Pandas udf call. Default is False.
name – String representing name of text generator when converted to string via
repr
orstr
The two functions define the plugin model
The first function,
fn
is called whenever text should be generated for a single column of a single rowIt is called with the signature
fn(context, value)
unless a root property is set, in which the signature isfn(rootProperty)
with rootProperty having the value of the root property of the context.Context is the stored context containing instances of random number generators, 3rd party client library objects etc.
The
initFn
is called to initialize the function call context. The plugin code can store arbitrary properties in the context following normal Python object rules.The context is initialized with the property textGenerator prior to being initialized which is a reference to the enclosing text generator.
Note
There are no expectations of repeatability of data generation when using external code or external libraries to generate text.
However, custom code can call the base class method to get a Numpy random number generator instance. This will have been seeded using the
dbldatagen
random number seed if one was specified, so random numbers generated from this will be repeatable.The custom code may call the property
randomSeed
on the text generator object to get the random seed which may be used to seed library specific initialization.This random seed property may have the values
None
or-1
which should be treated as meaning dont use a random seed.The code does not guarantee thread or cross process safety. If a new instance of the random number generator is needed, you may call the base class method with the argument forceNewInstance set to True.
- class PyfuncTextFactory(name=None)[source]
Bases:
object
PyfuncTextFactory applies syntactic wrapping around creation of PyfuncText objects
- Parameters:
name – name of generated object (when converted to string via
str
)
It allows the use of the following constructs:
# initialization (for Faker for example) # setup use of Faker def initFaker(ctx): ctx.faker = Faker(locale="en_US") ctx.faker.add_provider(internet) FakerText = (PyfuncTextFactory(name="FakerText") .withInit(initFaker) # determines how context should be initialized .withRootProperty("faker") # determines what context property is passed to fn ) # later use ... .withColumn("fake_name", text=FakerText("sentence", ext_word_list=my_word_list) ) .withColumn("fake_sentence", text=FakerText("sentence", ext_word_list=my_word_list) ) # translates to generation of lambda function with keyword arguments # or without as needed .withColumn("fake_name", text=FakerText( (lambda faker: faker.name( )), init=initFaker, rootProperty="faker", name="FakerText")) .withColumn("fake_sentence", text=FakerText( (lambda faker: faker.sentence( **{ "ext_word_list" : my_word_list} )), init=initFaker, rootProperty="faker", name="FakerText"))
- withInit(fn)[source]
Specifies context initialization function
- Parameters:
fn – function pointer or lambda function for initialization signature should
initFunction(context)
Note
This variation initializes the context once per worker process per text generator instance.
- withInitPerBatch(fn)[source]
Specifies context initialization function
- Parameters:
fn – function pointer or lambda function for initialization signature should
initFunction(context)
Note
This variation initializes the context once per internal pandas UDF call. The UDF call will be called once per 10,000 rows if system is configured using defaults. Setting the pandas batch size as an argument to the DataSpec creation will change the default batch size.
- fakerText(mname, *args, _lib=None, _rootClass=None, **kwargs)[source]
Generate faker text generator object using default FakerTextFactory instance
- Parameters:
mname – method name to invoke
args – positional args to be passed to underlying Faker instance
_lib – internal only param - library to load
_rootClass – internal only param - root class to create
:returns : instance of PyfuncText for use with Faker
fakerText("sentence")
is same asFakerTextFactory()("sentence")