dbldatagen.datasets_object module
This module defines the Datasets
class.
This module supports the addition of standard datasets to the Synthetic Data Generator
These are standard datasets that can be synthesized with minimal coding to handle a variety of situations for testing, benchmarking and other uses.
As the APIs return a data generation specification rather than a dataframe, additional columns can be added and further manipulation can be performed before generation of actual data.
- class Datasets(sparkSession, name=None, streaming=False)[source]
Bases:
object
This class is used to generate standard data sets based on a plugin provider model.
It allows for quick generation of data for common scenarios.
- Parameters:
sparkSession – Spark session instance to use when performing spark operations
name – Dataset name to use
Dataset names are registered with the DatasetProvider class. By convention, dataset names should be hierarchical and separated by slashes (‘/’)
For example, the dataset name ‘sales/retail’ would indicate that the dataset is a retail dataset within the sales category.
The dataset name is used to look up the provider class that will be used to generate the data.
If a dataset provider supports multiple tables, the name of the table to retrieve is passed to the get method, along with any parameters that are required to generate the data.
Bases:
object
Dataset Navigator class for navigating datasets
This class is used to navigate datasets and their tables via dotted notation.
Ie X.dataset_grouping.dataset.table where X is an intance of the dataset navigator.
The navigator is initialized with a set of paths and objects (usually providers) that are registered with the DatasetProvider class.
When accessed via dotted notation, the navigator will use the pathSegment to locate the provider and create it.
Any remaining pathSegment traversed will be used to locate the table within the provider.
Overall, this just provides a syntactic layering over the creation of the provider instance and table generation.
- classmethod describe(name)[source]
This method lists the registered datasets It filters the list by a regular expression pattern if provided
- Parameters:
name – name of dataset to describe
- get(table=None, rows=-1, partitions=-1, **kwargs)[source]
Get a table generator from the dataset provider
These are DataGenerator instances that can be used to generate the data. The dataset providers also optionally can provide supporting tables which are computed tables based on parameters. These are retrieved using the getAssociatedDataset method
If the dataset supports multiple tables, the table may be specified in the table parameter. If none is specified, the primary table is used.
- Parameters:
table – name of table to retrieve
rows – number of rows to generate. if -1, provider should compute defaults.
partitions – number of partitions to use.If -1, the number of partitions is computed automatically
table size and partitioning.If applied to a dataset with only a single table, this is ignored. :param kwargs: additional keyword arguments to pass to the provider
If rows or partitions are not specified, default values are supplied by the provider.
For multi-table datasets, the table name must be specified. For single table datasets, the table name may be optionally supplied.
Additionally, for multi-table datasets, the table name must be one of the tables supported by the provider. Default number of rows for multi-table datasets may differ - for example a ‘customers’ table may have a 100,000 rows while a ‘sales’ table may have 1,000,000 rows.
- getAssociatedDataset(*, table, rows=-1, partitions=-1, **kwargs)[source]
Get a table generator from the dataset provider
These are DataGenerator instances that can be used to generate the data. The dataset providers also optionally can provide supporting tables which are computed tables based on parameters. These are retrieved using the getAssociatedDataset method
If the dataset supports multiple tables, the table may be specified in the table parameter. If none is specified, the primary table is used.
- Parameters:
table – name of table to retrieve
rows – number of rows to generate. if -1, provider should compute defaults.
partitions – number of partitions to use.If -1, the number of partitions is computed automatically
table size and partitioning.If applied to a dataset with only a single table, this is ignored. :param kwargs: additional keyword arguments to pass to the provider
If rows or partitions are not specified, default values are supplied by the provider.
For multi-table datasets, the table name must be specified. For single table datasets, the table name may be optionally supplied.
Additionally, for multi-table datasets, the table name must be one of the tables supported by the provider. Default number of rows for multi-table datasets may differ - for example a ‘customers’ table may have a 100,000 rows while a ‘sales’ table may have 1,000,000 rows.
Note
This method may also be invoked via the aliased names - getSupportingDataset and getCombinedDataset
- getCombinedDataset(*, table, rows=-1, partitions=-1, **kwargs)
Get a table generator from the dataset provider
These are DataGenerator instances that can be used to generate the data. The dataset providers also optionally can provide supporting tables which are computed tables based on parameters. These are retrieved using the getAssociatedDataset method
If the dataset supports multiple tables, the table may be specified in the table parameter. If none is specified, the primary table is used.
- Parameters:
table – name of table to retrieve
rows – number of rows to generate. if -1, provider should compute defaults.
partitions – number of partitions to use.If -1, the number of partitions is computed automatically
table size and partitioning.If applied to a dataset with only a single table, this is ignored. :param kwargs: additional keyword arguments to pass to the provider
If rows or partitions are not specified, default values are supplied by the provider.
For multi-table datasets, the table name must be specified. For single table datasets, the table name may be optionally supplied.
Additionally, for multi-table datasets, the table name must be one of the tables supported by the provider. Default number of rows for multi-table datasets may differ - for example a ‘customers’ table may have a 100,000 rows while a ‘sales’ table may have 1,000,000 rows.
Note
This method may also be invoked via the aliased names - getSupportingDataset and getCombinedDataset
- getEnrichedDataset(*, table, rows=-1, partitions=-1, **kwargs)
Get a table generator from the dataset provider
These are DataGenerator instances that can be used to generate the data. The dataset providers also optionally can provide supporting tables which are computed tables based on parameters. These are retrieved using the getAssociatedDataset method
If the dataset supports multiple tables, the table may be specified in the table parameter. If none is specified, the primary table is used.
- Parameters:
table – name of table to retrieve
rows – number of rows to generate. if -1, provider should compute defaults.
partitions – number of partitions to use.If -1, the number of partitions is computed automatically
table size and partitioning.If applied to a dataset with only a single table, this is ignored. :param kwargs: additional keyword arguments to pass to the provider
If rows or partitions are not specified, default values are supplied by the provider.
For multi-table datasets, the table name must be specified. For single table datasets, the table name may be optionally supplied.
Additionally, for multi-table datasets, the table name must be one of the tables supported by the provider. Default number of rows for multi-table datasets may differ - for example a ‘customers’ table may have a 100,000 rows while a ‘sales’ table may have 1,000,000 rows.
Note
This method may also be invoked via the aliased names - getSupportingDataset and getCombinedDataset
- classmethod getProviderDefinitions(name=None, pattern=None, supportsStreaming=False)[source]
Get provider definitions for one or more datasets
- Parameters:
name – name of dataset to get provider for, if None, returns all providers
pattern – pattern to match dataset name, if None, returns all providers optionally matching name
supportsStreaming – If true, filters out dataset providers that don’t support streaming
- Returns:
list of provider definitions matching name and pattern
Each entry will be of the form DatasetProvider.DatasetProviderDefinition
- getSummaryDataset(*, table, rows=-1, partitions=-1, **kwargs)
Get a table generator from the dataset provider
These are DataGenerator instances that can be used to generate the data. The dataset providers also optionally can provide supporting tables which are computed tables based on parameters. These are retrieved using the getAssociatedDataset method
If the dataset supports multiple tables, the table may be specified in the table parameter. If none is specified, the primary table is used.
- Parameters:
table – name of table to retrieve
rows – number of rows to generate. if -1, provider should compute defaults.
partitions – number of partitions to use.If -1, the number of partitions is computed automatically
table size and partitioning.If applied to a dataset with only a single table, this is ignored. :param kwargs: additional keyword arguments to pass to the provider
If rows or partitions are not specified, default values are supplied by the provider.
For multi-table datasets, the table name must be specified. For single table datasets, the table name may be optionally supplied.
Additionally, for multi-table datasets, the table name must be one of the tables supported by the provider. Default number of rows for multi-table datasets may differ - for example a ‘customers’ table may have a 100,000 rows while a ‘sales’ table may have 1,000,000 rows.
Note
This method may also be invoked via the aliased names - getSupportingDataset and getCombinedDataset
- getSupportingDataset(*, table, rows=-1, partitions=-1, **kwargs)
Get a table generator from the dataset provider
These are DataGenerator instances that can be used to generate the data. The dataset providers also optionally can provide supporting tables which are computed tables based on parameters. These are retrieved using the getAssociatedDataset method
If the dataset supports multiple tables, the table may be specified in the table parameter. If none is specified, the primary table is used.
- Parameters:
table – name of table to retrieve
rows – number of rows to generate. if -1, provider should compute defaults.
partitions – number of partitions to use.If -1, the number of partitions is computed automatically
table size and partitioning.If applied to a dataset with only a single table, this is ignored. :param kwargs: additional keyword arguments to pass to the provider
If rows or partitions are not specified, default values are supplied by the provider.
For multi-table datasets, the table name must be specified. For single table datasets, the table name may be optionally supplied.
Additionally, for multi-table datasets, the table name must be one of the tables supported by the provider. Default number of rows for multi-table datasets may differ - for example a ‘customers’ table may have a 100,000 rows while a ‘sales’ table may have 1,000,000 rows.
Note
This method may also be invoked via the aliased names - getSupportingDataset and getCombinedDataset
- classmethod list(pattern=None, supportsStreaming=False)[source]
This method lists the registered datasets It filters the list by a regular expression pattern if provided
- Parameters:
pattern – Pattern to match dataset names. If None, all datasets are listed
supportsStreaming – if True, only return providerDefinitions that supportStreaming