dbldatagen.data_analyzer module

This module defines the DataAnalyzer class.

This code is experimental and both APIs and code generated is liable to change in future versions.

class DataAnalyzer(df=None, sparkSession=None, debug=False, verbose=False)[source]

Bases: object

This class is used to analyze an existing data set to assist in generating a test data set with similar characteristics, and to generate code from existing schemas and data

Parameters:

df – Spark dataframe to analyze
sparkSession – Spark session instance to use when performing spark operations
debug – If True, additional debug information is logged
verbose – If True, additional information is logged

Warning

Experimental

scriptDataGeneratorFromData(suppressOutput=False, name=None)[source]

Generate outline data generator code from an existing dataframe

This will generate a data generator spec from an existing dataframe. The resulting code can be used to generate a data generation specification.

Note at this point in time, the code generated is stub code only. For most uses, it will require further modification - however it provides a starting point for generation of the specification for a given data set

The dataframe to be analyzed is the Spark dataframe passed to the constructor of the DataAnalyzer object

Parameters:

suppressOutput – Suppress printing of generated code if True
name – Optional name for data generator

Returns:

String containing skeleton code

classmethod scriptDataGeneratorFromSchema(schema, suppressOutput=False, name=None)[source]

Generate outline data generator code from an existing dataframe

This will generate a data generator spec from an existing dataframe. The resulting code can be used to generate a data generation specification.

Note at this point in time, the code generated is stub code only. For most uses, it will require further modification - however it provides a starting point for generation of the specification for a given data set.

The dataframe to be analyzed is the dataframe passed to the constructor of the DataAnalyzer object.

Parameters:

schema – Pyspark schema - i.e manually constructed StructType or return value from dataframe.schema
suppressOutput – Suppress printing of generated code if True
name – Optional name for data generator

Returns:

String containing skeleton code

summarize(suppressOutput=False)[source]

Generate summary analysis of data set and return / print summary results

Parameters:: suppressOutput – If False, prints results to console also
Returns:: Summary results as string

summarizeToDF()[source]

Generate summary analysis of data set as dataframe

Returns:: Summary results as dataframe

The resulting dataframe can be displayed with the display function in a notebook environment or with the show method.

The output is also used in code generation to generate more accurate code.