dbldatagen.datasets.dataset_provider module
This file defines the DatasetProvider class
- class DatasetProvider[source]
Bases:
ABC
The DatasetProvider class acts as a base class for all dataset providers
Implementors should override name,summary, description etc using the dataset_definition decorator. Subclasses should not require the constructor to take any arguments - arguments for the dataset should be passed to the getTableGenerator method.
If no table name is specified, it defaults to a table name corresponding to the DEFAULT_TABLE_NAME constant.
Dataset providers that produce multiple tables or need to explicitly name the table should list which tables are provided and what the primary table to be retrieved would be if no table name is specified.
The intent of the supportsStreaming flag is mark a dataset as being streaming compatible. It does not require specific streaming support - only to note that the dataset provider does not have any operations that would be disallowed or very inefficient for a streaming dataframe.
Examples of operations that might prevent a dataframe from being used in streaming include: - use of some types of windowing expressions - use of drop duplicates (not disallowed but can only be applied efficiently at the streaming microbatch level)
Note that the DatasetDecoratorUtils inner class will be used as a decorator to subclasses of this and will overwrite the constants _DATASET_NAME, _DATASET_TABLES, _DATASET_DESCRIPTION, _DATASET_SUMMARY and _DATASET_SUPPORTS_STREAMING
Derived DatasetProvider classes need to be registered with the DatasetProvider class to be used in the system (at least to be discoverable and creatable via the Datasets object). This is done by calling the registerDataset method with the dataset definition and the class object.
The Dataset provider is responsible for determining what the default number of rows should be when the caller passes None or -1 as the default parameter.
Registration can be done manually or automatically by setting the autoRegister flag in the decorator to True.
By default, all DatasetProvider classes should support batch usage. If a dataset provider supports streaming usage, the flag supportsStreaming should be set to True in the decorator.
- DEFAULT_PARTITIONS = 4
- DEFAULT_ROWS = 100000
- DEFAULT_TABLE_NAME = 'main'
- class DatasetDecoratorUtils(cls=None, *, name=None, tables=None, primaryTable=None, summary=None, description=None, associatedDatasets=None, supportsStreaming=False)[source]
Bases:
object
Defines the predefined_dataset decorator
- Parameters:
cls – target class to apply decorator to
name – name of the dataset
tables – list of tables produced by the dataset provider, if None, defaults to [ DEFAULT_TABLE_NAME ]
primaryTable – primary table provided by dataset. Defaults to first table of table list
summary – Summary information for the dataset. If None, will be derived from target class name
description – Detailed description of the class. If None, will use the target class doc string
associatedDatasets – list of associated datasets produced by the dataset provider
supportsStreaming – Whether data set can be used in streaming scenarios
- class DatasetDefinition(name: str, tables: list[str], primaryTable: str, summary: str, description: str, supportsStreaming: bool, providerClass: type, associatedDatasets: list[str])[source]
Bases:
object
Dataset Definition class - stores the attributes related to the dataset for use by the implementation of the decorator.
This stores the name of the dataset (e.g. basic/user), the list of tables provided by the dataset, the primary table, a summary of the dataset, a detailed description of the dataset, whether the dataset supports streaming, and the provider class.
It also allows specification of supporting tables which are tables computed from existing dataframes that can be provided by the dataset provider
- associatedDatasets: list[str]
- description: str
- name: str
- primaryTable: str
- providerClass: type
- summary: str
- supportsStreaming: bool
- tables: list[str]
- class NoAssociatedDatasetsMixin[source]
Bases:
ABC
Use this mixin to provide default implementation for data provider when it does not provide any associated datasets
- static allowed_options(options=None)[source]
Decorator to enforce allowed options
Used to document and enforce what options are allowed for each dataset provider implementation If the signature of the getTableGenerator method changes, change the DEFAULT_OPTIONS constant to include options that are always allowed
- autoComputePartitions(rows, columns)[source]
Compute the number of partitions based on rows and columns
- Parameters:
rows – number of rows
columns – number of columns
- Returns:
number of partitions
The equations is based on the number of rows and columns. It will produce 4 partitions as a minimum with 12 partitions with 5,000,000 rows and 100 columns.
For very large tables such as 1 billion rows and 10 columns, it will produce 18 partitions and increase logarithmically with the number of rows and columns.
Implementors of standard datasets can chose to scale this value or use their own calculation.
- checkOptions(options, allowedOptions)[source]
Check that options are valid
- Parameters:
options – options to check as dict
allowedOptions – allowed options as list of strings
- Returns:
self
- abstract getAssociatedDataset(sparkSession, *, tableName=None, rows=-1, partitions=-1, **options)[source]
Gets associated datasets that are used in conjunction with the provider datasets. These may be associated lookup tables, tables that execute benchmarks or exercise key features as part of their use
- Parameters:
sparkSession – Spark session to use
tableName – Name of table to provide
rows – Number of rows requested
partitions – Number of partitions requested
autoSizePartitions – Whether to automatically size the partitions from the number of rows
options – Options passed to generate the table
- Returns:
DataGenerator instance to generate table if successful, throws error otherwise
Implementors of the individual data providers are responsible for sizing partitions for the datasets based on the number of rows and columns. The number of partitions can be computed based on the number of rows and columns using the autoComputePartitions method.
- classmethod getRegisteredDatasets()[source]
Get the registered dataset definitions :return: A dictionary of registered datasets metadata objects
- classmethod getRegisteredDatasetsVersion()[source]
Get the registered datasets version indicator :return: A dictionary of registered datasets
- abstract getTableGenerator(sparkSession, *, tableName=None, rows=-1, partitions=-1, **options)[source]
Gets data generation instance that will produce table for named table
- Parameters:
sparkSession – Spark session to use
tableName – Name of table to provide
rows – Number of rows requested
partitions – Number of partitions requested
autoSizePartitions – Whether to automatically size the partitions from the number of rows
options – Options passed to generate the table
- Returns:
DataGenerator instance to generate table if successful, throws error otherwise
Implementors of the individual data providers are responsible for sizing partitions for the datasets based on the number of rows and columns. The number of partitions can be computed based on the number of rows and columns using the autoComputePartitions method.
- classmethod isValidDataProviderType(candidateDataProvider)[source]
Check if object is a valid data provider type
- Parameters:
candidateDataProvider – potential Dataset provider class
- Returns:
True if valid DatasetProvider type, False otherwise
- classmethod registerDataset(datasetProvider)[source]
Register the dataset provider type using metadata defined in the dataset provider
- Parameters:
datasetProvider – Dataset provider class
- Returns:
None
The dataset provider argument should be a subclass of the DatasetProvider class.
It will retrieve the DatasetDefinition populated during creation by the decorator and should contain the name of the dataset, the list of tables provided by the dataset, the primary table, a summary of the dataset, a detailed description of the dataset, whether the dataset supports streaming, and the provider class.
- dataset_definition(cls=None, *args, autoRegister=False, **kwargs)[source]
decorator to define standard dataset definition
This is intended to be applied classes derived from DatasetProvider to simplify the implementation of the predefined datasets.
- Parameters:
cls – class object for subclass of DatasetProvider
args – positional args
autoRegister – whether to automatically register the dataset
kwargs – keyword args
- Returns:
either instance of DatasetDecoratorUtils or function which will produce instance of this when called
This function is intended to be used as a decorator.
When applied without arguments, it will return a nested wrapper function which will take the subsequent class object and apply the DatasetDecoratorUtils to it.
When applied with arguments, the arguments will be applied to the construct of the DatasetDecoratorUtils.
This allows for the use of either syntax for decorators
` @dataset_definition class X(DatasetProvider) `
or` @dataset_definition(name="basic/basic", tables=["primary"]) class X(DatasetProvider) `