dbldatagen.datasets_object module

This module defines the Datasets class.

This module supports the addition of standard datasets to the Synthetic Data Generator

These are standard datasets that can be synthesized with minimal coding to handle a variety of situations for testing, benchmarking and other uses.

As the APIs return a data generation specification rather than a dataframe, additional columns can be added and further manipulation can be performed before generation of actual data.

class Datasets(sparkSession, name=None, streaming=False)[source]

Bases: object

This class is used to generate standard data sets based on a plugin provider model.

It allows for quick generation of data for common scenarios.

Parameters:

sparkSession – Spark session instance to use when performing spark operations
name – Dataset name to use

Dataset names are registered with the DatasetProvider class. By convention, dataset names should be hierarchical and separated by slashes (‘/’)

For example, the dataset name ‘sales/retail’ would indicate that the dataset is a retail dataset within the sales category.

The dataset name is used to look up the provider class that will be used to generate the data.

If a dataset provider supports multiple tables, the name of the table to retrieve is passed to the get method, along with any parameters that are required to generate the data.

class NavigatorNode(datasets, providerName=None, tableName=None, location=None)[source]

Bases: object

Dataset Navigator class for navigating datasets

This class is used to navigate datasets and their tables via dotted notation.

Ie X.dataset_grouping.dataset.table where X is an intance of the dataset navigator.

The navigator is initialized with a set of paths and objects (usually providers) that are registered with the DatasetProvider class.

When accessed via dotted notation, the navigator will use the pathSegment to locate the provider and create it.

Any remaining pathSegment traversed will be used to locate the table within the provider.

Overall, this just provides a syntactic layering over the creation of the provider instance and table generation.

addEntry(datasets, providerName, tableName)[source]

find(attributePath)[source]

isFinal()[source]

classmethod describe(name)[source]

This method lists the registered datasets It filters the list by a regular expression pattern if provided

Parameters:: name – name of dataset to describe

get(table=None, rows=-1, partitions=-1, **kwargs)[source]

Get a table generator from the dataset provider

These are DataGenerator instances that can be used to generate the data. The dataset providers also optionally can provide supporting tables which are computed tables based on parameters. These are retrieved using the getAssociatedDataset method

If the dataset supports multiple tables, the table may be specified in the table parameter. If none is specified, the primary table is used.

Parameters:

table – name of table to retrieve
rows – number of rows to generate. if -1, provider should compute defaults.
partitions – number of partitions to use.If -1, the number of partitions is computed automatically

table size and partitioning.If applied to a dataset with only a single table, this is ignored. :param kwargs: additional keyword arguments to pass to the provider

If rows or partitions are not specified, default values are supplied by the provider.

For multi-table datasets, the table name must be specified. For single table datasets, the table name may be optionally supplied.

Additionally, for multi-table datasets, the table name must be one of the tables supported by the provider. Default number of rows for multi-table datasets may differ - for example a ‘customers’ table may have a 100,000 rows while a ‘sales’ table may have 1,000,000 rows.

getAssociatedDataset(*, table, rows=-1, partitions=-1, **kwargs)[source]

Get a table generator from the dataset provider

These are DataGenerator instances that can be used to generate the data. The dataset providers also optionally can provide supporting tables which are computed tables based on parameters. These are retrieved using the getAssociatedDataset method

If the dataset supports multiple tables, the table may be specified in the table parameter. If none is specified, the primary table is used.

Parameters:

table – name of table to retrieve
rows – number of rows to generate. if -1, provider should compute defaults.
partitions – number of partitions to use.If -1, the number of partitions is computed automatically

table size and partitioning.If applied to a dataset with only a single table, this is ignored. :param kwargs: additional keyword arguments to pass to the provider

If rows or partitions are not specified, default values are supplied by the provider.

For multi-table datasets, the table name must be specified. For single table datasets, the table name may be optionally supplied.

Additionally, for multi-table datasets, the table name must be one of the tables supported by the provider. Default number of rows for multi-table datasets may differ - for example a ‘customers’ table may have a 100,000 rows while a ‘sales’ table may have 1,000,000 rows.

Note

This method may also be invoked via the aliased names - getSupportingDataset and getCombinedDataset

getCombinedDataset(*, table, rows=-1, partitions=-1, **kwargs)

Get a table generator from the dataset provider

These are DataGenerator instances that can be used to generate the data. The dataset providers also optionally can provide supporting tables which are computed tables based on parameters. These are retrieved using the getAssociatedDataset method

If the dataset supports multiple tables, the table may be specified in the table parameter. If none is specified, the primary table is used.

Parameters:

table – name of table to retrieve
rows – number of rows to generate. if -1, provider should compute defaults.
partitions – number of partitions to use.If -1, the number of partitions is computed automatically

table size and partitioning.If applied to a dataset with only a single table, this is ignored. :param kwargs: additional keyword arguments to pass to the provider

If rows or partitions are not specified, default values are supplied by the provider.

For multi-table datasets, the table name must be specified. For single table datasets, the table name may be optionally supplied.

Additionally, for multi-table datasets, the table name must be one of the tables supported by the provider. Default number of rows for multi-table datasets may differ - for example a ‘customers’ table may have a 100,000 rows while a ‘sales’ table may have 1,000,000 rows.

Note

This method may also be invoked via the aliased names - getSupportingDataset and getCombinedDataset

getEnrichedDataset(*, table, rows=-1, partitions=-1, **kwargs)

Get a table generator from the dataset provider

These are DataGenerator instances that can be used to generate the data. The dataset providers also optionally can provide supporting tables which are computed tables based on parameters. These are retrieved using the getAssociatedDataset method

If the dataset supports multiple tables, the table may be specified in the table parameter. If none is specified, the primary table is used.

Parameters:

table – name of table to retrieve
rows – number of rows to generate. if -1, provider should compute defaults.
partitions – number of partitions to use.If -1, the number of partitions is computed automatically

table size and partitioning.If applied to a dataset with only a single table, this is ignored. :param kwargs: additional keyword arguments to pass to the provider

If rows or partitions are not specified, default values are supplied by the provider.

For multi-table datasets, the table name must be specified. For single table datasets, the table name may be optionally supplied.

Additionally, for multi-table datasets, the table name must be one of the tables supported by the provider. Default number of rows for multi-table datasets may differ - for example a ‘customers’ table may have a 100,000 rows while a ‘sales’ table may have 1,000,000 rows.

Note

This method may also be invoked via the aliased names - getSupportingDataset and getCombinedDataset

classmethod getProviderDefinitions(name=None, pattern=None, supportsStreaming=False)[source]

Get provider definitions for one or more datasets

Parameters:

name – name of dataset to get provider for, if None, returns all providers
pattern – pattern to match dataset name, if None, returns all providers optionally matching name
supportsStreaming – If true, filters out dataset providers that don’t support streaming

Returns:

list of provider definitions matching name and pattern

Each entry will be of the form DatasetProvider.DatasetProviderDefinition

getSummaryDataset(*, table, rows=-1, partitions=-1, **kwargs)

Get a table generator from the dataset provider

These are DataGenerator instances that can be used to generate the data. The dataset providers also optionally can provide supporting tables which are computed tables based on parameters. These are retrieved using the getAssociatedDataset method

If the dataset supports multiple tables, the table may be specified in the table parameter. If none is specified, the primary table is used.

Parameters:

table – name of table to retrieve
rows – number of rows to generate. if -1, provider should compute defaults.
partitions – number of partitions to use.If -1, the number of partitions is computed automatically

table size and partitioning.If applied to a dataset with only a single table, this is ignored. :param kwargs: additional keyword arguments to pass to the provider

If rows or partitions are not specified, default values are supplied by the provider.

For multi-table datasets, the table name must be specified. For single table datasets, the table name may be optionally supplied.

Additionally, for multi-table datasets, the table name must be one of the tables supported by the provider. Default number of rows for multi-table datasets may differ - for example a ‘customers’ table may have a 100,000 rows while a ‘sales’ table may have 1,000,000 rows.

Note

This method may also be invoked via the aliased names - getSupportingDataset and getCombinedDataset

getSupportingDataset(*, table, rows=-1, partitions=-1, **kwargs)

Get a table generator from the dataset provider

These are DataGenerator instances that can be used to generate the data. The dataset providers also optionally can provide supporting tables which are computed tables based on parameters. These are retrieved using the getAssociatedDataset method

If the dataset supports multiple tables, the table may be specified in the table parameter. If none is specified, the primary table is used.

Parameters:

table – name of table to retrieve
rows – number of rows to generate. if -1, provider should compute defaults.
partitions – number of partitions to use.If -1, the number of partitions is computed automatically

table size and partitioning.If applied to a dataset with only a single table, this is ignored. :param kwargs: additional keyword arguments to pass to the provider

If rows or partitions are not specified, default values are supplied by the provider.

For multi-table datasets, the table name must be specified. For single table datasets, the table name may be optionally supplied.

Additionally, for multi-table datasets, the table name must be one of the tables supported by the provider. Default number of rows for multi-table datasets may differ - for example a ‘customers’ table may have a 100,000 rows while a ‘sales’ table may have 1,000,000 rows.

Note

This method may also be invoked via the aliased names - getSupportingDataset and getCombinedDataset

classmethod list(pattern=None, supportsStreaming=False)[source]

This method lists the registered datasets It filters the list by a regular expression pattern if provided

Parameters:

pattern – Pattern to match dataset names. If None, all datasets are listed
supportsStreaming – if True, only return providerDefinitions that supportStreaming