dbldatagen.datasets.dataset_provider module

This file defines the DatasetProvider class

class DatasetProvider[source]

Bases: ABC

The DatasetProvider class acts as a base class for all dataset providers

Implementors should override name,summary, description etc using the dataset_definition decorator. Subclasses should not require the constructor to take any arguments - arguments for the dataset should be passed to the getTableGenerator method.

If no table name is specified, it defaults to a table name corresponding to the DEFAULT_TABLE_NAME constant.

Dataset providers that produce multiple tables or need to explicitly name the table should list which tables are provided and what the primary table to be retrieved would be if no table name is specified.

The intent of the supportsStreaming flag is mark a dataset as being streaming compatible. It does not require specific streaming support - only to note that the dataset provider does not have any operations that would be disallowed or very inefficient for a streaming dataframe.

Examples of operations that might prevent a dataframe from being used in streaming include: - use of some types of windowing expressions - use of drop duplicates (not disallowed but can only be applied efficiently at the streaming microbatch level)

Note that the DatasetDecoratorUtils inner class will be used as a decorator to subclasses of this and will overwrite the constants _DATASET_NAME, _DATASET_TABLES, _DATASET_DESCRIPTION, _DATASET_SUMMARY and _DATASET_SUPPORTS_STREAMING

Derived DatasetProvider classes need to be registered with the DatasetProvider class to be used in the system (at least to be discoverable and creatable via the Datasets object). This is done by calling the registerDataset method with the dataset definition and the class object.

The Dataset provider is responsible for determining what the default number of rows should be when the caller passes None or -1 as the default parameter.

Registration can be done manually or automatically by setting the autoRegister flag in the decorator to True.

By default, all DatasetProvider classes should support batch usage. If a dataset provider supports streaming usage, the flag supportsStreaming should be set to True in the decorator.

DEFAULT_PARTITIONS = 4

DEFAULT_ROWS = 100000

DEFAULT_TABLE_NAME = 'main'

class DatasetDecoratorUtils(cls=None, *, name=None, tables=None, primaryTable=None, summary=None, description=None, associatedDatasets=None, supportsStreaming=False)[source]

Bases: object

Defines the predefined_dataset decorator

Parameters:

cls – target class to apply decorator to
name – name of the dataset
tables – list of tables produced by the dataset provider, if None, defaults to [ DEFAULT_TABLE_NAME ]
primaryTable – primary table provided by dataset. Defaults to first table of table list
summary – Summary information for the dataset. If None, will be derived from target class name
description – Detailed description of the class. If None, will use the target class doc string
associatedDatasets – list of associated datasets produced by the dataset provider
supportsStreaming – Whether data set can be used in streaming scenarios

mkClass(autoRegister=False)[source]

make the modified class for the Data Provider

Applies the decorator args as a metadata object on the class. This is done at the class level as there is no instance of the target class at this point.

Returns:: Returns the target class object

class DatasetDefinition(name: str, tables: list[str], primaryTable: str, summary: str, description: str, supportsStreaming: bool, providerClass: type, associatedDatasets: list[str])[source]

Bases: object

Dataset Definition class - stores the attributes related to the dataset for use by the implementation of the decorator.

This stores the name of the dataset (e.g. basic/user), the list of tables provided by the dataset, the primary table, a summary of the dataset, a detailed description of the dataset, whether the dataset supports streaming, and the provider class.

It also allows specification of supporting tables which are tables computed from existing dataframes that can be provided by the dataset provider

associatedDatasets: list[str]

description: str

name: str

primaryTable: str

providerClass: type

summary: str

supportsStreaming: bool

tables: list[str]

class NoAssociatedDatasetsMixin[source]

Bases: ABC

Use this mixin to provide default implementation for data provider when it does not provide any associated datasets

getAssociatedDataset(sparkSession, *, tableName=None, rows=-1, partitions=-1, **options)[source]

static allowed_options(options=None)[source]

Decorator to enforce allowed options

Used to document and enforce what options are allowed for each dataset provider implementation If the signature of the getTableGenerator method changes, change the DEFAULT_OPTIONS constant to include options that are always allowed

autoComputePartitions(rows, columns)[source]

Compute the number of partitions based on rows and columns

Parameters:

rows – number of rows
columns – number of columns

Returns:

number of partitions

The equations is based on the number of rows and columns. It will produce 4 partitions as a minimum with 12 partitions with 5,000,000 rows and 100 columns.

For very large tables such as 1 billion rows and 10 columns, it will produce 18 partitions and increase logarithmically with the number of rows and columns.

Implementors of standard datasets can chose to scale this value or use their own calculation.

checkOptions(options, allowedOptions)[source]

Check that options are valid

Parameters:

options – options to check as dict
allowedOptions – allowed options as list of strings

Returns:

self

abstract getAssociatedDataset(sparkSession, *, tableName=None, rows=-1, partitions=-1, **options)[source]

Gets associated datasets that are used in conjunction with the provider datasets. These may be associated lookup tables, tables that execute benchmarks or exercise key features as part of their use

Parameters:

sparkSession – Spark session to use
tableName – Name of table to provide
rows – Number of rows requested
partitions – Number of partitions requested
autoSizePartitions – Whether to automatically size the partitions from the number of rows
options – Options passed to generate the table

Returns:

DataGenerator instance to generate table if successful, throws error otherwise

Implementors of the individual data providers are responsible for sizing partitions for the datasets based on the number of rows and columns. The number of partitions can be computed based on the number of rows and columns using the autoComputePartitions method.

classmethod getDatasetDefinition()[source]: Get the dataset definition for the class

classmethod getDatasetTables()[source]: Get the dataset tables list for the class

classmethod getRegisteredDatasets()[source]: Get the registered dataset definitions :return: A dictionary of registered datasets metadata objects

classmethod getRegisteredDatasetsVersion()[source]: Get the registered datasets version indicator :return: A dictionary of registered datasets

abstract getTableGenerator(sparkSession, *, tableName=None, rows=-1, partitions=-1, **options)[source]

Gets data generation instance that will produce table for named table

Parameters:

sparkSession – Spark session to use
tableName – Name of table to provide
rows – Number of rows requested
partitions – Number of partitions requested
autoSizePartitions – Whether to automatically size the partitions from the number of rows
options – Options passed to generate the table

Returns:

DataGenerator instance to generate table if successful, throws error otherwise

Implementors of the individual data providers are responsible for sizing partitions for the datasets based on the number of rows and columns. The number of partitions can be computed based on the number of rows and columns using the autoComputePartitions method.

classmethod isValidDataProviderType(candidateDataProvider)[source]

Check if object is a valid data provider type

Parameters:: candidateDataProvider – potential Dataset provider class
Returns:: True if valid DatasetProvider type, False otherwise

classmethod registerDataset(datasetProvider)[source]

Register the dataset provider type using metadata defined in the dataset provider

Parameters:: datasetProvider – Dataset provider class
Returns:: None

The dataset provider argument should be a subclass of the DatasetProvider class.

It will retrieve the DatasetDefinition populated during creation by the decorator and should contain the name of the dataset, the list of tables provided by the dataset, the primary table, a summary of the dataset, a detailed description of the dataset, whether the dataset supports streaming, and the provider class.

classmethod unregisterDataset(name)[source]

Unregister the dataset with the specified name

Parameters:: name – Name of the dataset to unregister

dataset_definition(cls=None, *args, autoRegister=False, **kwargs)[source]

decorator to define standard dataset definition

This is intended to be applied classes derived from DatasetProvider to simplify the implementation of the predefined datasets.

Parameters:

cls – class object for subclass of DatasetProvider
args – positional args
autoRegister – whether to automatically register the dataset
kwargs – keyword args

Returns:

either instance of DatasetDecoratorUtils or function which will produce instance of this when called

This function is intended to be used as a decorator.

When applied without arguments, it will return a nested wrapper function which will take the subsequent class object and apply the DatasetDecoratorUtils to it.

When applied with arguments, the arguments will be applied to the construct of the DatasetDecoratorUtils.

This allows for the use of either syntax for decorators ` @dataset_definition class X(DatasetProvider) ` or

` @dataset_definition(name="basic/basic", tables=["primary"]) class X(DatasetProvider) `