.. Databricks Labs Data Generator documentation master file, created by sphinx-quickstart on Sun Jun 21 10:54:30 2020. Extending Text Generation ========================= This feature should be considered ``Experimental``. The ``PyfuncText``, ``PyfuncTextFactory`` and ``FakerTextFactory`` classes provide a mechanism to expand text generation to include the use of arbitrary Python functions and 3rd party data generation libraries. The following example illustrates extension with the open source Faker library using the extended syntax. .. code-block:: python from dbldatagen import DataGenerator, fakerText from faker.providers import internet shuffle_partitions_requested = 8 partitions_requested = 8 data_rows = 100000 # partition parameters etc. spark.conf.set("spark.sql.shuffle.partitions", shuffle_partitions_requested) my_word_list = [ 'danish','cheesecake','sugar', 'Lollipop','wafer','Gummies', 'sesame','Jelly','beans', 'pie','bar','Ice','oat' ] fakerDataspec = (DataGenerator(spark, rows=data_rows, partitions=partitions_requested) .withColumn("name", percentNulls=0.1, text=fakerText("name") ) .withColumn("address", text=fakerText("address" )) .withColumn("email", text=fakerText("ascii_company_email") ) .withColumn("ip_address", text=fakerText("ipv4_private" )) .withColumn("faker_text", text=fakerText("sentence", ext_word_list=my_word_list)) ) dfFakerOnly = fakerDataspec.build() dfFakerOnly.write.format("delta").mode("overwrite").save("/tmp/test-output") Lets look at the various features provided to do this. Extended text generation with Python functions ---------------------------------------------- The ``PyfuncText`` object supports extending text generation with Python functions. It allows specification of two functions - a context initialization function to initialize shared state - a text generation function to generate text for a specific column value This allows integration of both arbitrary Python code and of 3rd party libraries into the text generation process. For more information, see :data:`~dbldatagen.text_generator_plugins.PyfuncText` .. note:: The performance of text generation using external libraries or Python functions may be substantially slower than the base text generation capabilities. However it should be sufficient for generation of tables of up to 100 million rows on a medium sized cluster. Note that we do not test compatibility with specific libraries and no expectations are made on the repeatability of data when generated using external functions or libraries. Example 1: Using a custom Python function ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The following code shows use of a custom Python function to generate text: .. code-block:: python from dbldatagen import DataGenerator, PyfuncText partitions_requested = 4 data_rows = 100 * 1000 # the initialization function def initPluginContext(context): context.prefix = "testing" # the data generation function text_generator = (lambda context, value: context.prefix + str(value)) pluginDataspec = (DataGenerator(spark, rows=data_rows, partitions=partitions_requested, randomSeedMethod="hash_fieldname") .withColumn("text", text=PyfuncText(text_generator, initFn=initPluginContext)) ) dfPlugin = pluginDataspec.build() dfPlugin.show() Extended text generation with 3rd party libraries ------------------------------------------------- The same mechanism can be used to make use of the capabilities of 3rd party libraries. The ``context`` object can be initialized with any arbitrary properties that may be referenced during the execution of the text generation function. This can include use of session or connection objects, lookup dictionaries etc. As a separate context instance is created for each worker node process for each PyfuncText text generator, the object does not have to be pickled or serialized across process boundaries. By default the context is shared across calls to the underlying Pandas UDF that generates the text. If the context properties cannot be shared across multiple calls, you can specify that the context is recreated for each Pandas UDF call. Example 2: Using an external text data generation library ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The following code shows use of an external text generation library to generate text. In this case, the example is using the ``Faker`` library. .. note :: The ``Faker`` library is not shipped as part of the data generator and the user is responsible for installing it on a cluster or workspace, if using. There is no testing of specific 3rd party libraries for compatibility and some features may not function correctly or at scale. To install ``Faker`` in a Databricks notebook, you can use the ``%pip`` instruction in a notebook cell. For example: .. code-block:: %pip install Faker The following code makes use the of ``Faker`` library to generate synthetic names, email addresses, IP addresses and credit card numbers. .. code-block:: python from dbldatagen import DataGenerator, PyfuncText from faker import Faker from faker.providers import internet shuffle_partitions_requested = 36 partitions_requested = 96 data_rows = 10 * 1000 * 1000 spark.conf.set("spark.sql.shuffle.partitions", shuffle_partitions_requested) def initFaker(context): context.faker = Faker(locale="en_US") context.faker.add_provider(internet) ip_address_generator = (lambda context, v : context.faker.ipv4_private()) name_generator = (lambda context, v : context.faker.name()) address_generator = (lambda context, v : context.faker.address()) email_generator = (lambda context, v : context.faker.ascii_company_email()) fakerDataspec = (DataGenerator(spark, rows=data_rows, partitions=partitions_requested) .withColumn("name", percentNulls=0.1, text=PyfuncText(name_generator , initFn=initFaker)) .withColumn("address", text=PyfuncText(address_generator, initFn=initFaker)) .withColumn("email", text=PyfuncText(email_generator, initFn=initFaker)) .withColumn("ip_address", text=PyfuncText(ip_address_generator , initFn=initFaker)) ) df1 = fakerDataspec.build() df1.write.format("delta").mode("overwrite").save("/tmp/dbldatagen/fakerData") Supporting extended syntax for 3rd party library integration ------------------------------------------------------------ Use of the `PyfuncTextFactory` class allows the use of the following constructs: .. code-block:: python # initialization (for Faker for example) # setup use of Faker def initFaker(ctx): ctx.faker = Faker(locale="en_US") ctx.faker.add_provider(internet) FakerText = (PyfuncTextFactory(name="FakerText") .withInit(initFaker) # determines how context should be initialized .withRootProperty("faker") # determines what context property is passed to fn ) # later use ... .withColumn("fake_name", text=FakerText("name") ) .withColumn("fake_sentence", text=FakerText("sentence", ext_word_list=my_word_list) ) # translates to generation of lambda function with keyword arguments # or without as needed .withColumn("fake_name", text=FakerText( (lambda faker: faker.name( )), init=initFaker, rootProperty="faker", name="FakerText")) .withColumn("fake_sentence", text=FakerText( (lambda faker: faker.sentence( **{ "ext_word_list" : my_word_list} )), init=initFaker, rootProperty="faker", name="FakerText")) By default, when the text generation function is called, the context object is passed to the text generation function. However, if a root property is specified, it is interpreted the name of a property of the context to be passed to the text generation function. How does the string based access work? If a string is specified to the PyfuncTextFactory in place of a text generation function or lambda function, it is interpreted as the name of a method or property to access on the root object. By default, the string is interpreted as the name of a method. But if you need to access a property of the root object, you can use the syntax below (example is hypothetical and does not refer to any specific library). .. code-block:: python .withColumn("my_property", text=MyLibraryText("myCustomProperty", isProperty=True) ) For more information, see :data:`~dbldatagen.text_generator_plugins.PyfuncTextFactory` Faker specific library integration ---------------------------------- Finally, the ``FakerTextFactory`` provides a Faker specific version of the ``PyfuncTextFactory`` class that initializes the Faker library and allows specification of locales and providers. You will still need to install Faker as it is not included in the binaries. If you are not customizing the FakerTextFactory, you can use ``fakerText`` to get the default faker text factory. The following example will generate Italian localized text (where the underlying Faker provider supports it) interspersed with use of the default faker text factory. .. code-block:: python from dbldatagen import FakerTextFactory, DataGenerator, fakerText from faker.providers import internet shuffle_partitions_requested = 8 partitions_requested = 8 data_rows = 100000 # setup use of Faker FakerTextIT = FakerTextFactory(locale=['it_IT'], providers=[internet]) # partition parameters etc. spark.conf.set("spark.sql.shuffle.partitions", shuffle_partitions_requested) my_word_list = [ 'danish','cheesecake','sugar', 'Lollipop','wafer','Gummies', 'sesame','Jelly','beans', 'pie','bar','Ice','oat' ] fakerDataspec = (DataGenerator(spark, rows=data_rows, partitions=partitions_requested) .withColumn("italian_name", percentNulls=0.1, text=FakerTextIT("name") ) .withColumn("name", percentNulls=0.1, text=fakerText("name") ) # uses default .withColumn("address", text=FakerTextIT("address" )) .withColumn("email", text=FakerTextIT("ascii_company_email") ) .withColumn("ip_address", text=FakerTextIT("ipv4_private" )) .withColumn("faker_text", text=FakerTextIT("sentence") ) ) dfFakerOnly = fakerDataspec.build() dfFakerOnly.write.format("delta").mode("overwrite").save("/tmp/test-output-IT") For more information, see :data:`~dbldatagen.text_generator_plugins.FakerTextFactory`