dbldatagen.schema_parser module

This file defines the SchemaParser class

class SchemaParser[source]

Bases: object

SchemaParser class

Creates pyspark SQL datatype from string

classmethod columnTypeFromString(type_string)[source]

Generate a Spark SQL data type from a string

Allowable options for type_string parameter are:
  • string, varchar, char, nvarchar,

  • int, integer,

  • bigint, long,

  • bool, boolean,

  • smallint, short

  • binary

  • tinyint, byte

  • date

  • timestamp, datetime,

  • double, float, date, short, byte,

  • decimal or decimal(p) or decimal(p, s) or number(p, s)

  • map<type1, type2> where type1 and type2 are type definitions of form accepted by parser

  • array<type1> where type1 is type definitions of form accepted by parser

  • struct<a:binary, b:int, c:float>

Type definitions may be nested recursively - for example, the following are valid type definitions: * array<array<int>> * `struct<a:array<int>, b:int, c:float> * map<string, struct<a:array<int>, b:int, c:float>>

Parameters:

type_string – String representation of SQL type such as ‘integer’ etc.

Returns:

Spark SQL type

classmethod columnsReferencesFromSQLString(sql_string, filterItems=None)[source]

Generate a list of possible column references from a SQL string

This method finds all condidate references to SQL columnn ids in the string

To avoid the overhead of a full SQL parser, the implementation will simply look for possible field names

Further improvements may eliminate some common syntax but in current form, reserved words will also be returned as possible column references.

So any uses of this must not assume that all possible references are valid column references

Parameters:
  • sql_string – String representation of SQL expression

  • filterItems – filter results to only results in items listed

Returns:

list of possible column references

classmethod getTypeDefinitionParser()[source]

Define a pyparsing based parser for Spark SQL type definitions

Allowable constructs for generated type parser are:
  • string, varchar, char, nvarchar,

  • int, integer,

  • bigint, long,

  • bool, boolean,

  • smallint, short

  • binary

  • tinyint, byte

  • date

  • timestamp, datetime,

  • double, float, date, short, byte,

  • decimal or decimal(p) or decimal(p, s) or number(p, s)

  • map<type1, type2> where type1 and type2 are type definitions of form accepted by parser

  • array<type1> where type1 is type definitions of form accepted by parser

  • struct<a:binary, b:int, c:float>

Type definitions may be nested recursively - for example, the following are valid type definitions: * array<array<int>> * `struct<a:array<int>, b:int, c:float> * map<string, struct<a:array<int>, b:int, c:float>>

Returns:

parser

See the package pyparsing for details of how the parser mechanism works https://pypi.org/project/pyparsing/

classmethod parseCreateTable(sparkSession, source_schema)[source]

Parse a schema from a schema string

Parameters:
  • sparkSession – spark session to use

  • source_schema – should be a table definition minus the create table statement

Returns:

Spark SQL schema instance