dbldatagen.data_analyzer module
This module defines the DataAnalyzer
class.
This code is experimental and both APIs and code generated is liable to change in future versions.
- class DataAnalyzer(df=None, sparkSession=None, debug=False, verbose=False)[source]
Bases:
object
This class is used to analyze an existing data set to assist in generating a test data set with similar characteristics, and to generate code from existing schemas and data
- Parameters:
df – Spark dataframe to analyze
sparkSession – Spark session instance to use when performing spark operations
debug – If True, additional debug information is logged
verbose – If True, additional information is logged
Warning
Experimental
- scriptDataGeneratorFromData(suppressOutput=False, name=None)[source]
Generate outline data generator code from an existing dataframe
This will generate a data generator spec from an existing dataframe. The resulting code can be used to generate a data generation specification.
Note at this point in time, the code generated is stub code only. For most uses, it will require further modification - however it provides a starting point for generation of the specification for a given data set
The dataframe to be analyzed is the Spark dataframe passed to the constructor of the DataAnalyzer object
- Parameters:
suppressOutput – Suppress printing of generated code if True
name – Optional name for data generator
- Returns:
String containing skeleton code
- classmethod scriptDataGeneratorFromSchema(schema, suppressOutput=False, name=None)[source]
Generate outline data generator code from an existing dataframe
This will generate a data generator spec from an existing dataframe. The resulting code can be used to generate a data generation specification.
Note at this point in time, the code generated is stub code only. For most uses, it will require further modification - however it provides a starting point for generation of the specification for a given data set.
The dataframe to be analyzed is the dataframe passed to the constructor of the DataAnalyzer object.
- Parameters:
schema – Pyspark schema - i.e manually constructed StructType or return value from dataframe.schema
suppressOutput – Suppress printing of generated code if True
name – Optional name for data generator
- Returns:
String containing skeleton code
- summarize(suppressOutput=False)[source]
Generate summary analysis of data set and return / print summary results
- Parameters:
suppressOutput – If False, prints results to console also
- Returns:
Summary results as string
- summarizeToDF()[source]
Generate summary analysis of data set as dataframe
- Returns:
Summary results as dataframe
The resulting dataframe can be displayed with the
display
function in a notebook environment or with theshow
method.The output is also used in code generation to generate more accurate code.