Skip to main content

Extending BladeBridge Configurations

Overview

The BladeBridge transpiler relies heavily on rules defined inside configuration files provided with the converter. These configurations are comprised of a set of layered json files and code templates that drive the generation of output files and application of conversion rules.

Similar configuration concepts are applicable across all BladeBridge conversion paths, although the structure of SQL-to-SQL configuration files and ETL-to-Pyspark/SparkSQL/DBSQL is somewhat different, since ETL conversions typically deal with both the ETL logic translations, as well as the translation of embedded SQL statements (sourcing data, pre/post/inline SQL statements)

In some migration projects, users may want to augment/override the conversion rules provided with the BladeBridge converter. For this reason, engineers should know how to:

  • Extend the converter logic
  • Provide your own conversion rules
  • Custom-control the output
  • Troubleshoot issues

In this page, we'll provide you with the information necessary to do just that.

Supplying Custom Configuration File

When running the Bladebridge converter from Lakebridge, a custom configuration file can be supplied to the converter. In order to register a custom configuration file for transpilation, we need to execute the install-transpile command and at one of the prompts specify the custom configuration file path:

databricks labs lakebridge install-transpile
Do you want to override the existing installation? (default: no): yes
Specify the config file to override the default[Bladebridge] config - press <enter> for none (default: <none>):
<local_full_path>/custom_<source>2databricks.json

Please read the subsequent sections of this document to learn how to create and extend BladeBridge configurations

Creating the custom configuration file

If you want to start from scratch and only use your new custom file, but not the provided configurations, go ahead and create an empty .json file and specify the configurations needed following the guidance in the following sections.

If instead you want to augment and/or override the existing configurations:

  • Take note of where the provided configurations are: <user_home_directory>/.databricks./labs/remorph-transpilers/bladebridge/lib/.venv/lib/python3.10/site-packages/databricks/labs/bladebridge/Converter/Configs

  • In your new configuration file, you need to specify that this files inherits from the supplied configuration. This enables layered rule definitions and promotes reuse and consistency across configurations.

    inherit_from is an array pointing to JSON filenames that the current file inherits from. Multiple file inheritances are allowed. If the full path is supplied with a forward slash, then the converter will try to read the file specified with the full path. Otherwise, the converter will look for the file in the same folder as the current JSON file.

    Example:

    "inherit_from":["/Users/user.name/.databricks/labs/remorph-transpilers/bladebridge/lib/.venv/lib/python3.10/site-packages/databricks/labs/bladebridge/Converter/Configs/base_oracle2databricks_sql.json"]
    Name of the various files per source you might want to inherit from:
- INFAPC:
- Target SPARKSQL: "base_infapc2databricks_sparksql.json",
- Target PYSPARK : "base_infapc2databricks_pyspark.json"
- DATASTAGE:
- Target SPARKSQL : "base_datastage2databricks_sparksql.json",
- Target PYSPARK : "base_datastage2databricks_pyspark.json"
- SYNAPSE: "base_synapse2databricks_sql.json"
- ORACLE: "base_oracle2databricks_sql.json"
- MSSQL: "base_sqlserver2databricks_sql.json"
- NETEZZA: "base_netezza2databricks_sql.json"
- TERADATA "base_teradata2databricks_sql.json"

Basic Converter Rules

When it comes to converting individual SQL code snippets or ETL expressions, BladeBridge uses basic set of rules to capture certain source patterns and produce converted output. The three main types of syntax manipulation rules are:

  • line_subst
  • block_subst
  • function_subst

They are executed in the order given above. Within each of these sections, there is an array of rules that get executed based on the order they are listed. Longer and more specific patterns should typically preceed shorter and more generic patterns. E.g.: "from" : "varchar" should be listed before "from" : "char", since varchar is a longer pattern than char.

Sample configuration file snippet:

{
"line_subst": [
{ "from": "\bvarchar\b", "to": "string" },
{ "from": "\bSYSDATE\b", "to": "CURRENT_TIMESTAMP()" }
],
"block_subst": [
{ "from": "\bSET\\s+\w+\s+ON\b", "to": "" },
{ "from": "\bCREATE\s+VIEW\b", "to": "CREATE OR REPLACE VIEW" }
],
"function_subst": [
{ "from": "CONVERT", "output_template": "CAST($2 AS $1)", "num_args": 2 }, // here, $2 and $3 refer to the 2nd and 3rd arguments of the function call `convert`
{ "from": "ISNULL", "to": "COALESCE" }
]
}

line_subst

Points to an array of substitution instructions to be performed on a single line. When this directive is given, the converter will apply all substitutions for each line. Each element of the array is a structure with the following elements.

AttributePurposeExample
fromSpecifies the pattern to capture. Use parentheses to capture tokens and substitute them."CREATE TABLE\s+(\w+)"
In this case (\w+) represents the 1st token
toSpecifies the pattern to replace with. Dollar variables from $1 to $9 are used to replace captured tokens."CREATE OR REPLACE TABLE $1"
Will plug in the table name from the example above
statement_categoriesArray of statement categories the rule is applicable to. If omitted, the rule applies to all matching patterns.["TABLE_DDL", "VIEW_DDL"]
exact_matchIf set to "1", performs case-insensitive exact match instead of regex matching.
first_matchIf set to "1" and a match occurs, stops checking for subsequent patterns in the line_subst array.
exclude_categoriesList of excluded categories. Reverse of statement_categories.
extension_callInvokes an external routine instead of using the to pattern. Useful for complex logic. See the advanced section for details.
relative_fragment_patternAllows the pattern to be searched within specific code fragments supplied by relative_fragment_offset."relative_fragment_pattern": "ACTIVITYCOUNT = 0"
relative_fragment_offsetList of fragment offsets to search in when using relative_fragment_pattern."relative_fragment_offset": "1,2"
upcase_stringUpcases the output string{"from" : "#(\w+)#", "to" : "${$1}", "upcase_string" : true}

Note: In the to attribute, you can use tokens $1 to $9 to refer to regex match groups.

Example:

  • Source: #my_var# + 10 + p_curr_date_of_month

  • Rule:

    {"from": "#(\\w+)#", "to": "${$1}", "upcase_string": true}

  • Result: ${MY_VAR} + 10 + p_curr_date_of_month

block_subst

Points to an array of substitution instructions to be performed on a statement block. When this directive is given, the converter will apply all substitutions for each statement block. This instruction is useful when there is a need to restructure a block of code – e.g. move around clauses of a statement, or when there's a high likelihood that a pattern will span multiple lines. Note that a block of code is typically referred to an entire SQL statement, such a DML inside a stored procedure or a view or table definition. Prior to processing a SQL content, converter will split up a multi-statement content into series of blocks and operate on each block of code individually.

Note: while block_subst rules operate in a very similar manner to line_subst rules, they are costlier in terms of compute time. For this reason, it is better not to overload the block_subst section with rules that are simple and typically defined on a single line. E.g.: converting varchar to string should be done in line_subst, and not in block_subst.

Each element of the rules array is a structure with the following elements.

AttributePurposeExample
fromSpecifies the pattern to capture. Use parentheses to capture tokens and substitute them."CREATE TABLE\s+(\w+)"
In this case (\w+) represents the 1st token
toSpecifies the pattern to replace with. Dollar variables from $1 to $9 can be used to replace captured tokens."CREATE OR REPLACE TABLE $1"
Will plug in the name of the table from the example
statement_categoriesArray of statement categories the rule is applicable to. If omitted, the rule applies to all matching patterns.["TABLE_DDL", "VIEW_DDL"]
first_matchIf set to "1" and a match occurs, stops checking for subsequent patterns in the line_subst array.
extension_callInvokes an external custom routine instead of substituting with the target pattern. Useful for complex logic. See the advanced section for more details.
force_alias_usageEnforces usage of aliases in WHERE, SELECT, and JOIN clauses. Ensures references to base objects are replaced with aliases where required by the target database.
relative_fragment_patternSearches for the pattern within specific code fragments defined using relative_fragment_offset."relative_fragment_pattern": "ACTIVITYCOUNT = 0"
relative_fragment_offsetA list of fragment offsets to search when using relative_fragment_pattern."relative_fragment_offset": "1,2"
debug_tagSpecifies a debug tag, which will show in the log when running the converter in verbose mode{"from" : "stringA", "to" : "stringB", "debug_tab" : "RULE001}

Example:

  • Source: SELECT V_TOTAL = COUNT(*) FROM orders;

  • Rule:

    {"from" : "\bSELECT\s+(V_\w+)\s*\=(.*)\;", "to" : "SET $1 = (SELECT $2 limit 1);"}

  • Result: SET V_TOTAL = (SELECT COUNT(*) FROM orders limit 1);

function_subst

Points to an array of instructions responsible for altering function calls. This section is used when function translations are required, and/or function arguments (function signature) have to be altered. Each element of the array is a structure with the following elements.

AttributePurposeExample
fromSpecifies the name of the function in the source code."ISNULL"
toSpecifies the name of the function to change to. Use this if function name changes but the signature does not (args stay as they are)"COALESCE"
output_templateA template for constructing the output function. Can use $1, $2, %ALL_ARGS%, etc. This should be used when the "to" attribute cannot be used."CAST($2 AS $1)"
statement_categoriesArray of statement categories the rule is applicable to. If omitted, the rule applies to all matching patterns.["SELECT", "UPDATE"]
placementPlacement of the converted clause. Allowed values: append_inside_ddl, append_after_ddl."append_inside_ddl"
extension_callInvokes an external routine to generate the new function string. Useful for complex cases."custom_subst_routine"
num_argsOnly applies the rule when the number of arguments matches.2
date_format_argSpecifies the argument position containing a date part keyword (e.g., "MM", "YYYY"). Works with datepart_translations.2
arg_patternHash where keys are argument positions (1-based) and values are regex patterns they must match.{ 1: "^\d+$" }
upcase_argsList of argument positions to convert to uppercase.[1, 2]
lowcase_argsList of argument positions to convert to lowercase.[3]
skip_filesList of filenames to skip for this rule. This is the basename, as opposed to the full path["legacy_job.sql"]
relative_fragment_patternRestricts rule application to when this pattern is found in a nearby fragment."ACTIVITYCOUNT = 0"
relative_fragment_offsetComma-separated offsets (relative to the current line) to search for relative_fragment_pattern."1,2"
arg_placementHash to remap or reposition arguments; supports fallbacks like 2||default.
NOTE: Use output_template instead of this
{ 1: "2", 2: "1||NULL" }
full_substFull substitution template using placeholders like __ARG1__, __ARG2__. Overrides normal argument logic."IFNULL(ARG1, ARG2)"
arg_token_outputSpecifies token positions to output when using token-based splitting."1,3"
split_stringDelimiter to split arguments when using arg_token_output.";"
new_arg_separatorOverrides the default , separator when joining arguments in the output."
each_arg_routineA Perl routine to apply to each argument (e.g., transformation, cleaning)."uc"
endingA string to append after the function call (e.g., for syntax closure).";"

Special Keywords:

__BLANK__ - blanks out the entire call including the function name.

Example (removing INDEX clause):

{"from": "INDEX", "to": "__BLANK__", "statement_categories" : ["TABLE_DDL","TABLE_DDL_LIKE","TABLE_DDL_AS_SELECT"]},

Note: In the output_template attribute, you can use tokens $1 to $9 to refer to regex match groups. These token represent direct arguments to the function call being processed, which could include expressions or other nested function calls.

Example when processing function SUBSTR:

SELECT SUBSTR(
UPPER( first_name || ' ' last_name ), -- arg $1
10, -- arg $2
20 -- arg $3
)

__ELIMINATE_CALL__ - gets rid of the function name and surrounding parenthesis, leaving only the arguments in place.

Example (removing TRANSLATE call and leaving args):

{"from": "TRANSLATE", "to" : "__ELIMINATE_CALL__"}, //get rid of the function call with the parens, but leave the inner part

Extended Converter Rules

stmt_categorization_patterns

As the converter processes statements, in some cases it needs to be aware of the type of the statement being processed, so he can include/exclude some rules or dispatch the processing to a custom routine.

stmt_categorization_patterns is an array of entries that associate a coding pattern to the category.

Example:

"stmt_categorization_patterns": [
{"category": "TABLE_DDL_AS_SELECT", "patterns" : ["CREATE(.*?)TABLE(.*?)AS\s*(.*SELECT", "CREATE(.*?)TABLE(.*?)AS\s*SELECT"]},
{"category": "TABLE_DDL_LIKE", "patterns" : ["CREATE(.*?)TABLE(.*?)AS(.*?)WITH\s+NO\s+DATA", "CREATE(.*?)TABLE(.*?)LIKE(.*)"]},
{"category": "TABLE_DDL", "patterns" : ["CREATE(.*?)\sTABLE"]},
{"category": "TABLE_DROP", "patterns" : ["DROP(.*?)\sTABLE"]},
{"category": "VIEW_DDL", "patterns" : ["CREATE(.*?)VIEW", "REPLACE(.*?)VIEW"]}
]

Each category can support multiple patterns. stmt_categorization_patternscan be repeated and extended in inherited files.

This tag is currently provided in the base JSON file general_sql_specs.json.

datepart_translations

Contains specifications on how to translate date part formatting specifications. Sample specifications:

"datepart_translations" : {
"YYYY" : "yyyy",
"mm" : "MM",
"DD" : "dd",
"hh24" : "hh",
"HH" : "hh",
"mi" : "mm",
"MI" : "mm",
"FF" : "SSSS",
"SS" : "ss",
"AM" : "a"
}

Note: this is a case sensitive instruction. It will be processed lengthier pattern first (yyyy goes before DD). This section is to be used with the function_subst modifier “date_format_arg”.

Advanced Conversion Rules

Besides operating with prebuilt core functionality, the converter can delegate conversion logic to externally defined subroutines. This is often needed when converting not only SQL dialects, but also wrapper or flow control elements. An example of such requirement would be converting Netezza or Oracle procedures with conditional statements, loops and variables to Snowflake’s Javascript procedures. The language supported by the callback subroutines is basic-level Perl, making it easy to perform string operations and assemble and disassemble the code to be converted. The external subroutines are defined in a file or sets of files outside of the converter and can be registered in the converter configuration file using this instruction:

"CUSTOM_CONVERTER_MODULES" : ["my_handlers.pl", "globals.pl"]

The Callback mechanism uses the instruction “fragment_handling” in the configuration file, which tells the converter how to dispatch the subroutine calls based on the category of a statement:

"fragment_handling": {
"PROGRAM_DECLARATION": "::create_procedure_from_oracle",
"CREATE_PROCEDURE": "::create_procedure_from_oracle",
"END_PROCEDURE": "::end_procedure",
"COMMENT": "::convert_comment",
"VAR_ASSIGNMENT": "::convert_assignment",
"EXECUTE_INTO": "::execute_into",
"READ_DML_INTO_VAR": "::convert_assignment",
"WRITE_DML": "::convert_dml",
"UTIL_CALL": "::convert_dml",
"TABLE_DDL": "::convert_dml",
"DEFAULT_HANDLER": "::oracle_default_statement_handler"
}

The names of the subroutines should be prefixed with two colons. This indicates to the processor that the subroutines live in the main namespace, as opposed to inside classes.

Note that you can extend the section stmt_categorization_patterns anywhere in the inherited files to support custom fragment categories.

The sections below list additional directives that are responsible for interacting with callback subs.

Hook and Extension Configuration

initialize_hooks_call

Calls the designated subroutine and passes two entries: the configuration structure and an instance of the converter class.

"initialize_hooks_call": "::init_hooks"

This will invoke the init_hooks subroutine and pass:

{
CONFIG => $config_entries_pointer,
CONVERTER => $converter_class_instance
}

Sample implementation:

sub init_hooks {
my $param = shift;
%CFG = %{$param->{CONFIG}};
$CONVERTER = $param->{CONVERTER};
print "INIT_HOOKS Called. config:\n" . Dumper(%CFG);
}

prescan_and_collect_info_hook

Used to pre-scan the input file before executing individual code fragment handling routines. This can be useful for extracting procedure parameters or other metadata from the file.

"prescan_and_collect_info_hook": "::prescan_code_oracle"

Sample implementation:

sub prescan_code_oracle {
my $filename = shift;
my $cf = shift;
print "******** prescan_code_oracle $filename *********\n";
# Open and analyze the file...
}

Fragment Handling Custom Subroutines

Each routine defined under the fragment_handling directive receives a pointer to an array of code lines relevant to a specific statement category.

Example:

For the statement:

UPDATE DIM_CUST
SET CUST_FULL_NAME = FIRSTNAME || ' ' || LASTNAME

Assuming it's categorized as WRITE_DML, the handler could be:

sub convert_dml {
my $ar = shift; # pointer to array of code lines
my $sql = join("\n", @$ar); # full SQL block
# Custom logic here
}

The SQL Converter comes with working extension samples that can serve as templates for writing your own.


pre_finalization_handler

Specifies a subroutine to run after all fragment-handling routines are complete.

"pre_finalization_handler": "::finalize_content"

post_conversion_adjustment_hook

Specifies a subroutine to run after pre_finalization_handler completes.

"post_conversion_adjustment_hook": "::post_conversion_adjustment"

preprocess_file

Enables pre-processing of the input file by triggering the routine defined in preprocess_routine.

"preprocess_file": "1"

preprocess_routine

Specifies the subroutine to be used when preprocess_file is enabled.

"preprocess_routine": "::mssql_preprocess"

ETL Configuration Files

ETL configuration files are typically more complex than plain SQL configuration files due to the nature of what they must control and generate.

These files may contain:

  • Instructions for output code generation, including support for multiple output languages such as Spark SQL or PySpark
  • Rules for styling the output, such as:
    • Including or omitting header comments
    • Replicating the ETL job's original folder or directory structure
  • Guidance for handling external sources and targets, such as flat files or external tables
  • Embedded logic that instructs the converter how to assemble the generated code

In addition to these structural and behavioral controls, ETL configuration files often contain pointers to additional JSON files that define how to process:

  • ETL expressions (e.g., transformation functions or variable assignments)
  • Embedded SQL inside the ETL components, such as:
    • SQL within a SELECT statement of a source component
    • pre-SQL or post-SQL snippets executed before or after data movement

These supporting configuration files closely resemble the structure and purpose of SQL configuration files, but are scoped to fragment-level transformations, function handling, and data manipulation tasks commonly found in visual ETL platforms such as Informatica PowerCenter or IBM DataStage.

ETL configuration thus serves as the orchestration layer that combines rule-based transformation with output formatting, system integration, and extensibility.

ETL Configuration Tags

AttributeDescriptionSample
code_generation_moduleSpecifies which module to use for code generation."CodeGeneration::SQL" or "CodeGeneration::PySpark"
target_file_extensionSets the extension of the generated code file."py"
use_notebook_mdIndicates whether Databricks notebook markdown should be used.1
script_headerAdds a code block at the start of the generated script, often for imports or metadata.# Databricks notebook source\n from datetime import datetime
script_footerAdds a code block at the end of the generated scriptquit()
rowid_expressionSpecifies the expression used to compute a row ID. This is needed for InfaPC because of the way InfaPC links nodes in a mappingxxhash64(%DELIMITED_COLUMN_LIST%) as %ROWID_COL_NAME%
rowid_column_nameName of the column containing rowidsource_record_id
dataset_creation_methodIndicates whether datasets are created as CTEs or tables.
"TABLE" is typically used for lift and shift, but "CTE" can be used for custom dbt outputs
TABLE or CTE
table_creation_statementTemplate for creating a temporary table from a SQL block.%TABLE_NAME% = spark.sql (rf"""%INNER_SQL%"""%FORMAT_SPEC%)
%TABLE_NAME%. createOrReplaceTempView("%TABLE_NAME%")
ddl_statement_wrapWraps DDL statements in a Spark SQL invocation.spark.sql(f"""%INNER_SQL%"""%FORMAT_SPEC%).display()
etl_converter_config_filePoints to a secondary config file for ETL expression conversion.infa2databricks.json
commandssection on how to generate various read and write statements for different system types
system_type_classsystem type classifications
conform_source_columnsinstructs the writer to generate column-conforming statement for sources
conform_columns_call_templatetemplate for conforming the column list"%DF%_conformed_cols = [%COLUMN_LIST%]\n%DF% = DatabricksConversionSupplements. conform_df_columns(%DF%,%DF%_conformed_cols)
mapplet_class_nameClass name used for mapplet functionsMapplets
mapplet_function_nameFunction name format for mapplets%MAPPLET_NAME%
mapplet_code_indentGeneral code indentation for mapplets(8 spaces)
mapplet_pyspark_code_indentIndentation for multiline SQL inside mapplets(4 spaces)
mapplet_header_templatePath to the file containing the header template for the mappletpython_mapplet_header_template.py
mapplet_input_declarationFormat string to define the Python function declaration for the mapplet\n def %MAPPLET_NAME%(%INPUT%):
mapplet_conclusionCode snippet added to conclude the mapplet implementation#Implementation %MAPPLET_NAME% concluded\n\n
mapplet_object_var_inject_formatFormat for injecting object variable, typically for wrapping dynamic names in code""" + %OBJECT_NAME% + """
mapplet_function_invocationFormat used to invoke the mapplet functionMapplets.%MAPPLET_NAME%(%INPUT%)
mapplet_instance_prefixesList of instance name prefixes to help identify and generate mapplet connection info["sc_"]

system_type_class section

This section represents a mapping of a system type to a class name, which will be later used in the commands section

"system_type_class" : {
"ORACLE" : "RELATIONAL",
"MySQL" : "RELATIONAL",
"HIVE" : "RELATIONAL",
"DB2" : "RELATIONAL",
"TERADATA" : "RELATIONAL",
"REDSHIFT" : "RELATIONAL",
"Salesforce" : "SALEFORCE",
"TOOLKIT" : "RELATIONAL",
"FLATFILE" : "FILE_DELIMITED",
"FLAT FILE":"FILE",
"FLAT_FILE":"FILE",
"FILE WRITER":"FILE",
"DEFAULT" : "FILE_DELIMITED"
}

commands section

This section holds templates on read and write instructions for each system class

"commands" : {
"READER_FILE_DELIMITED": "spark.read.format('csv').option('header', 'true').load(rf'''%PATH%''')",
"READER_FILE_DELIMITED_EXTERNAL": "%NODE_NAME%_External = spark.read.format('csv').option('header', 'true').load(%PATH%)",
"READER_FILE_FIXED_WIDTH" : "raw_%NODE_NAME% = spark.read.text(f\"%PATH%\")\n%NODE_NAME% = raw_%NODE_NAME%.select(%SUBSTRING_SPEC%)",
"READER_RELATIONAL": "%NODE_NAME% = %SQL%\n%NODE_NAME% = spark.sql(%NODE_NAME%)",
"WRITER_FILE_DELIMITED": "%DF%.write.format('csv').option('header','%HEADER%').mode('overwrite').option('sep','%DELIMITER%').csv('%PATH%')",
"WRITER_RELATIONAL": "my_end_point.write_to_db(%DF%, \"%TABLE_NAME%\", username=\"%LOGIN%\", password=\"%PASSWORD%\")"
},

Substitution Tokens

Templates above use the following placeholders that get substituted at conversion time:

  • %DF% - name of the dataframe
  • %SQL% - SQL content, such as the SELECT statement
  • %PATH% - Path of the file being processed
  • %SUBSTRING_SPEC% - Specifications generated by converter on splitting up a positional string into columns
  • %HEADER% - Header specification
  • %LOGIN% - System login
  • %PASSWORD% - System login

Mapplet, Joblet and Shared Containers Handling

In ETL systems, it is common to encapsulate reusable logic into modular components such as maplets, joblets, or shared containers. To preserve this reusability during conversion, the converter attempts to replicate the structure and behavior of these components in the generated code.

Configuration tags prefixed with mapplet define how this reusable logic should be represented and rendered in the output.

The converter will generate a single class—specified by the JSON tag mapplet_class_name—in which all reusable code will be consolidated. This class serves as the container for all maplet-level function definitions and logic.