Extending BladeBridge Configurations
Overview
The BladeBridge transpiler relies heavily on rules defined inside configuration files provided with the converter. These configurations are comprised of a set of layered json files and code templates that drive the generation of output files and application of conversion rules.
Similar configuration concepts are applicable across all BladeBridge conversion paths, although the structure of SQL-to-SQL configuration files and ETL-to-Pyspark/SparkSQL/DBSQL is somewhat different, since ETL conversions typically deal with both the ETL logic translations, as well as the translation of embedded SQL statements (sourcing data, pre/post/inline SQL statements)
In some migration projects, users may want to augment/override the conversion rules provided with the BladeBridge converter. For this reason, engineers should know how to:
- Extend the converter logic
- Provide your own conversion rules
- Custom-control the output
- Troubleshoot issues
In this page, we'll provide you with the information necessary to do just that.
Supplying Custom Configuration File
When running the Bladebridge converter from Lakebridge, a custom configuration file can be supplied to the converter.
In order to register a custom configuration file for transpilation,
we need to execute the install-transpile
command and at one of the prompts specify the custom configuration file path:
databricks labs lakebridge install-transpile
Do you want to override the existing installation? (default: no): yes
Specify the config file to override the default[Bladebridge] config - press <enter> for none (default: <none>):
<local_full_path>/custom_<source>2databricks.json
Please read the subsequent sections of this document to learn how to create and extend BladeBridge configurations
Creating the custom configuration file
If you want to start from scratch and only use your new custom file, but not the provided configurations, go ahead and create an empty .json file and specify the configurations needed following the guidance in the following sections.
If instead you want to augment and/or override the existing configurations:
-
Take note of where the provided configurations are:
<user_home_directory>/.databricks./labs/remorph-transpilers/bladebridge/lib/.venv/lib/python3.10/site-packages/databricks/labs/bladebridge/Converter/Configs
-
In your new configuration file, you need to specify that this files inherits from the supplied configuration. This enables layered rule definitions and promotes reuse and consistency across configurations.
inherit_from
is an array pointing to JSON filenames that the current file inherits from. Multiple file inheritances are allowed. If the full path is supplied with a forward slash, then the converter will try to read the file specified with the full path. Otherwise, the converter will look for the file in the same folder as the current JSON file.Example:
"inherit_from":["/Users/user.name/.databricks/labs/remorph-transpilers/bladebridge/lib/.venv/lib/python3.10/site-packages/databricks/labs/bladebridge/Converter/Configs/base_oracle2databricks_sql.json"]
Name of the various files per source you might want to inherit from:
- INFAPC:
- Target SPARKSQL: "base_infapc2databricks_sparksql.json",
- Target PYSPARK : "base_infapc2databricks_pyspark.json"
- DATASTAGE:
- Target SPARKSQL : "base_datastage2databricks_sparksql.json",
- Target PYSPARK : "base_datastage2databricks_pyspark.json"
- SYNAPSE: "base_synapse2databricks_sql.json"
- ORACLE: "base_oracle2databricks_sql.json"
- MSSQL: "base_sqlserver2databricks_sql.json"
- NETEZZA: "base_netezza2databricks_sql.json"
- TERADATA "base_teradata2databricks_sql.json"
Basic Converter Rules
When it comes to converting individual SQL code snippets or ETL expressions, BladeBridge uses basic set of rules to capture certain source patterns and produce converted output. The three main types of syntax manipulation rules are:
- line_subst
- block_subst
- function_subst
They are executed in the order given above. Within each of these sections, there is an array of rules that get executed based on the order they are listed. Longer and more specific patterns should typically preceed shorter and more generic patterns. E.g.: "from" : "varchar"
should be listed before "from" : "char"
, since varchar
is a longer pattern than char
.
Sample configuration file snippet:
{
"line_subst": [
{ "from": "\bvarchar\b", "to": "string" },
{ "from": "\bSYSDATE\b", "to": "CURRENT_TIMESTAMP()" }
],
"block_subst": [
{ "from": "\bSET\\s+\w+\s+ON\b", "to": "" },
{ "from": "\bCREATE\s+VIEW\b", "to": "CREATE OR REPLACE VIEW" }
],
"function_subst": [
{ "from": "CONVERT", "output_template": "CAST($2 AS $1)", "num_args": 2 }, // here, $2 and $3 refer to the 2nd and 3rd arguments of the function call `convert`
{ "from": "ISNULL", "to": "COALESCE" }
]
}
line_subst
Points to an array of substitution instructions to be performed on a single line. When this directive is given, the converter will apply all substitutions for each line. Each element of the array is a structure with the following elements.
Attribute | Purpose | Example |
---|---|---|
from | Specifies the pattern to capture. Use parentheses to capture tokens and substitute them. | "CREATE TABLE\s+(\w+)" In this case (\w+) represents the 1st token |
to | Specifies the pattern to replace with. Dollar variables from $1 to $9 are used to replace captured tokens. | "CREATE OR REPLACE TABLE $1" Will plug in the table name from the example above |
statement_categories | Array of statement categories the rule is applicable to. If omitted, the rule applies to all matching patterns. | ["TABLE_DDL", "VIEW_DDL"] |
exact_match | If set to "1", performs case-insensitive exact match instead of regex matching. | |
first_match | If set to "1" and a match occurs, stops checking for subsequent patterns in the line_subst array. | |
exclude_categories | List of excluded categories. Reverse of statement_categories . | |
extension_call | Invokes an external routine instead of using the to pattern. Useful for complex logic. See the advanced section for details. | |
relative_fragment_pattern | Allows the pattern to be searched within specific code fragments supplied by relative_fragment_offset. | "relative_fragment_pattern": "ACTIVITYCOUNT = 0" |
relative_fragment_offset | List of fragment offsets to search in when using relative_fragment_pattern. | "relative_fragment_offset": "1,2" |
upcase_string | Upcases the output string | {"from" : "#(\w+)#", "to" : "${$1}", "upcase_string" : true} |
Note: In the
to
attribute, you can use tokens$1
to$9
to refer to regex match groups.
Example:
-
Source:
#my_var# + 10 + p_curr_date_of_month
-
Rule:
{"from": "#(\\w+)#", "to": "${$1}", "upcase_string": true}
-
Result:
${MY_VAR} + 10 + p_curr_date_of_month
block_subst
Points to an array of substitution instructions to be performed on a statement block. When this directive is given, the converter will apply all substitutions for each statement block. This instruction is useful when there is a need to restructure a block of code – e.g. move around clauses of a statement, or when there's a high likelihood that a pattern will span multiple lines. Note that a block of code is typically referred to an entire SQL statement, such a DML inside a stored procedure or a view or table definition. Prior to processing a SQL content, converter will split up a multi-statement content into series of blocks and operate on each block of code individually.
Note: while block_subst rules operate in a very similar manner to line_subst rules, they are costlier in terms of compute time. For this reason, it is better not to overload the block_subst section with rules that are simple and typically defined on a single line. E.g.: converting varchar to string should be done in line_subst, and not in block_subst.
Each element of the rules array is a structure with the following elements.
Attribute | Purpose | Example |
---|---|---|
from | Specifies the pattern to capture. Use parentheses to capture tokens and substitute them. | "CREATE TABLE\s+(\w+)" In this case (\w+) represents the 1st token |
to | Specifies the pattern to replace with. Dollar variables from $1 to $9 can be used to replace captured tokens. | "CREATE OR REPLACE TABLE $1" Will plug in the name of the table from the example |
statement_categories | Array of statement categories the rule is applicable to. If omitted, the rule applies to all matching patterns. | ["TABLE_DDL", "VIEW_DDL"] |
first_match | If set to "1" and a match occurs, stops checking for subsequent patterns in the line_subst array. | |
extension_call | Invokes an external custom routine instead of substituting with the target pattern. Useful for complex logic. See the advanced section for more details. | |
force_alias_usage | Enforces usage of aliases in WHERE, SELECT, and JOIN clauses. Ensures references to base objects are replaced with aliases where required by the target database. | |
relative_fragment_pattern | Searches for the pattern within specific code fragments defined using relative_fragment_offset . | "relative_fragment_pattern": "ACTIVITYCOUNT = 0" |
relative_fragment_offset | A list of fragment offsets to search when using relative_fragment_pattern . | "relative_fragment_offset": "1,2" |
debug_tag | Specifies a debug tag, which will show in the log when running the converter in verbose mode | {"from" : "stringA", "to" : "stringB", "debug_tab" : "RULE001} |
Example:
-
Source:
SELECT V_TOTAL = COUNT(*) FROM orders;
-
Rule:
{"from" : "\bSELECT\s+(V_\w+)\s*\=(.*)\;", "to" : "SET $1 = (SELECT $2 limit 1);"}
-
Result:
SET V_TOTAL = (SELECT COUNT(*) FROM orders limit 1);
function_subst
Points to an array of instructions responsible for altering function calls. This section is used when function translations are required, and/or function arguments (function signature) have to be altered. Each element of the array is a structure with the following elements.
Attribute | Purpose | Example |
---|---|---|
from | Specifies the name of the function in the source code. | "ISNULL" |
to | Specifies the name of the function to change to. Use this if function name changes but the signature does not (args stay as they are) | "COALESCE" |
output_template | A template for constructing the output function. Can use $1 , $2 , %ALL_ARGS% , etc. This should be used when the "to" attribute cannot be used. | "CAST($2 AS $1)" |
statement_categories | Array of statement categories the rule is applicable to. If omitted, the rule applies to all matching patterns. | ["SELECT", "UPDATE"] |
placement | Placement of the converted clause. Allowed values: append_inside_ddl , append_after_ddl . | "append_inside_ddl" |
extension_call | Invokes an external routine to generate the new function string. Useful for complex cases. | "custom_subst_routine" |
num_args | Only applies the rule when the number of arguments matches. | 2 |
date_format_arg | Specifies the argument position containing a date part keyword (e.g., "MM", "YYYY"). Works with datepart_translations . | 2 |
arg_pattern | Hash where keys are argument positions (1-based) and values are regex patterns they must match. | { 1: "^\d+$" } |
upcase_args | List of argument positions to convert to uppercase. | [1, 2] |
lowcase_args | List of argument positions to convert to lowercase. | [3] |
skip_files | List of filenames to skip for this rule. This is the basename, as opposed to the full path | ["legacy_job.sql"] |
relative_fragment_pattern | Restricts rule application to when this pattern is found in a nearby fragment. | "ACTIVITYCOUNT = 0" |
relative_fragment_offset | Comma-separated offsets (relative to the current line) to search for relative_fragment_pattern . | "1,2" |
arg_placement | Hash to remap or reposition arguments; supports fallbacks like 2||default . NOTE: Use output_template instead of this | { 1: "2", 2: "1||NULL" } |
full_subst | Full substitution template using placeholders like __ARG1__ , __ARG2__ . Overrides normal argument logic. | "IFNULL(ARG1, ARG2)" |
arg_token_output | Specifies token positions to output when using token-based splitting. | "1,3" |
split_string | Delimiter to split arguments when using arg_token_output . | ";" |
new_arg_separator | Overrides the default , separator when joining arguments in the output. | " |
each_arg_routine | A Perl routine to apply to each argument (e.g., transformation, cleaning). | "uc" |
ending | A string to append after the function call (e.g., for syntax closure). | ";" |
Special Keywords:
__BLANK__ - blanks out the entire call including the function name.
Example (removing INDEX clause):
{"from": "INDEX", "to": "__BLANK__", "statement_categories" : ["TABLE_DDL","TABLE_DDL_LIKE","TABLE_DDL_AS_SELECT"]},
Note: In the
output_template
attribute, you can use tokens$1
to$9
to refer to regex match groups. These token represent direct arguments to the function call being processed, which could include expressions or other nested function calls.
Example when processing function SUBSTR:
SELECT SUBSTR(
UPPER( first_name || ' ' last_name ), -- arg $1
10, -- arg $2
20 -- arg $3
)
__ELIMINATE_CALL__ - gets rid of the function name and surrounding parenthesis, leaving only the arguments in place.
Example (removing TRANSLATE call and leaving args):
{"from": "TRANSLATE", "to" : "__ELIMINATE_CALL__"}, //get rid of the function call with the parens, but leave the inner part
Extended Converter Rules
stmt_categorization_patterns
As the converter processes statements, in some cases it needs to be aware of the type of the statement being processed, so he can include/exclude some rules or dispatch the processing to a custom routine.
stmt_categorization_patterns
is an array of entries that associate a coding pattern to the category.
Example:
"stmt_categorization_patterns": [
{"category": "TABLE_DDL_AS_SELECT", "patterns" : ["CREATE(.*?)TABLE(.*?)AS\s*(.*SELECT", "CREATE(.*?)TABLE(.*?)AS\s*SELECT"]},
{"category": "TABLE_DDL_LIKE", "patterns" : ["CREATE(.*?)TABLE(.*?)AS(.*?)WITH\s+NO\s+DATA", "CREATE(.*?)TABLE(.*?)LIKE(.*)"]},
{"category": "TABLE_DDL", "patterns" : ["CREATE(.*?)\sTABLE"]},
{"category": "TABLE_DROP", "patterns" : ["DROP(.*?)\sTABLE"]},
{"category": "VIEW_DDL", "patterns" : ["CREATE(.*?)VIEW", "REPLACE(.*?)VIEW"]}
]
Each category can support multiple patterns.
stmt_categorization_patterns
can be repeated and extended in inherited files.
This tag is currently provided in the base JSON file general_sql_specs.json.
datepart_translations
Contains specifications on how to translate date part formatting specifications. Sample specifications:
"datepart_translations" : {
"YYYY" : "yyyy",
"mm" : "MM",
"DD" : "dd",
"hh24" : "hh",
"HH" : "hh",
"mi" : "mm",
"MI" : "mm",
"FF" : "SSSS",
"SS" : "ss",
"AM" : "a"
}
Note: this is a case sensitive instruction. It will be processed lengthier pattern first (yyyy goes before DD). This section is to be used with the function_subst modifier “date_format_arg”.
Advanced Conversion Rules
Besides operating with prebuilt core functionality, the converter can delegate conversion logic to externally defined subroutines. This is often needed when converting not only SQL dialects, but also wrapper or flow control elements. An example of such requirement would be converting Netezza or Oracle procedures with conditional statements, loops and variables to Snowflake’s Javascript procedures. The language supported by the callback subroutines is basic-level Perl, making it easy to perform string operations and assemble and disassemble the code to be converted. The external subroutines are defined in a file or sets of files outside of the converter and can be registered in the converter configuration file using this instruction:
"CUSTOM_CONVERTER_MODULES" : ["my_handlers.pl", "globals.pl"]
The Callback mechanism uses the instruction “fragment_handling” in the configuration file, which tells the converter how to dispatch the subroutine calls based on the category of a statement:
"fragment_handling": {
"PROGRAM_DECLARATION": "::create_procedure_from_oracle",
"CREATE_PROCEDURE": "::create_procedure_from_oracle",
"END_PROCEDURE": "::end_procedure",
"COMMENT": "::convert_comment",
"VAR_ASSIGNMENT": "::convert_assignment",
"EXECUTE_INTO": "::execute_into",
"READ_DML_INTO_VAR": "::convert_assignment",
"WRITE_DML": "::convert_dml",
"UTIL_CALL": "::convert_dml",
"TABLE_DDL": "::convert_dml",
"DEFAULT_HANDLER": "::oracle_default_statement_handler"
}
The names of the subroutines should be prefixed with two colons. This indicates to the processor that the subroutines live in the main namespace, as opposed to inside classes.
Note that you can extend the section stmt_categorization_patterns anywhere in the inherited files to support custom fragment categories.
The sections below list additional directives that are responsible for interacting with callback subs.
Hook and Extension Configuration
initialize_hooks_call
Calls the designated subroutine and passes two entries: the configuration structure and an instance of the converter class.
"initialize_hooks_call": "::init_hooks"
This will invoke the init_hooks
subroutine and pass:
{
CONFIG => $config_entries_pointer,
CONVERTER => $converter_class_instance
}
Sample implementation:
sub init_hooks {
my $param = shift;
%CFG = %{$param->{CONFIG}};
$CONVERTER = $param->{CONVERTER};
print "INIT_HOOKS Called. config:\n" . Dumper(%CFG);
}
prescan_and_collect_info_hook
Used to pre-scan the input file before executing individual code fragment handling routines. This can be useful for extracting procedure parameters or other metadata from the file.
"prescan_and_collect_info_hook": "::prescan_code_oracle"
Sample implementation:
sub prescan_code_oracle {
my $filename = shift;
my $cf = shift;
print "******** prescan_code_oracle $filename *********\n";
# Open and analyze the file...
}
Fragment Handling Custom Subroutines
Each routine defined under the fragment_handling
directive receives a pointer to an array of code lines relevant to a specific statement category.
Example:
For the statement:
UPDATE DIM_CUST
SET CUST_FULL_NAME = FIRSTNAME || ' ' || LASTNAME
Assuming it's categorized as WRITE_DML
, the handler could be:
sub convert_dml {
my $ar = shift; # pointer to array of code lines
my $sql = join("\n", @$ar); # full SQL block
# Custom logic here
}
The SQL Converter comes with working extension samples that can serve as templates for writing your own.
pre_finalization_handler
Specifies a subroutine to run after all fragment-handling routines are complete.
"pre_finalization_handler": "::finalize_content"
post_conversion_adjustment_hook
Specifies a subroutine to run after pre_finalization_handler
completes.
"post_conversion_adjustment_hook": "::post_conversion_adjustment"
preprocess_file
Enables pre-processing of the input file by triggering the routine defined in preprocess_routine
.
"preprocess_file": "1"
preprocess_routine
Specifies the subroutine to be used when preprocess_file
is enabled.
"preprocess_routine": "::mssql_preprocess"
ETL Configuration Files
ETL configuration files are typically more complex than plain SQL configuration files due to the nature of what they must control and generate.
These files may contain:
- Instructions for output code generation, including support for multiple output languages such as Spark SQL or PySpark
- Rules for styling the output, such as:
- Including or omitting header comments
- Replicating the ETL job's original folder or directory structure
- Guidance for handling external sources and targets, such as flat files or external tables
- Embedded logic that instructs the converter how to assemble the generated code
In addition to these structural and behavioral controls, ETL configuration files often contain pointers to additional JSON files that define how to process:
- ETL expressions (e.g., transformation functions or variable assignments)
- Embedded SQL inside the ETL components, such as:
- SQL within a
SELECT
statement of a source component pre-SQL
orpost-SQL
snippets executed before or after data movement
- SQL within a
These supporting configuration files closely resemble the structure and purpose of SQL configuration files, but are scoped to fragment-level transformations, function handling, and data manipulation tasks commonly found in visual ETL platforms such as Informatica PowerCenter or IBM DataStage.
ETL configuration thus serves as the orchestration layer that combines rule-based transformation with output formatting, system integration, and extensibility.
ETL Configuration Tags
Attribute | Description | Sample |
---|---|---|
code_generation_module | Specifies which module to use for code generation. | "CodeGeneration::SQL" or "CodeGeneration::PySpark" |
target_file_extension | Sets the extension of the generated code file. | "py" |
use_notebook_md | Indicates whether Databricks notebook markdown should be used. | 1 |
script_header | Adds a code block at the start of the generated script, often for imports or metadata. | # Databricks notebook source\n from datetime import datetime |
script_footer | Adds a code block at the end of the generated script | quit() |
rowid_expression | Specifies the expression used to compute a row ID. This is needed for InfaPC because of the way InfaPC links nodes in a mapping | xxhash64(%DELIMITED_COLUMN_LIST%) as %ROWID_COL_NAME% |
rowid_column_name | Name of the column containing rowid | source_record_id |
dataset_creation_method | Indicates whether datasets are created as CTEs or tables. "TABLE" is typically used for lift and shift, but "CTE" can be used for custom dbt outputs | TABLE or CTE |
table_creation_statement | Template for creating a temporary table from a SQL block. | %TABLE_NAME% = spark.sql (rf"""%INNER_SQL%"""%FORMAT_SPEC%) %TABLE_NAME%. createOrReplaceTempView(" %TABLE_NAME% ") |
ddl_statement_wrap | Wraps DDL statements in a Spark SQL invocation. | spark.sql(f"""%INNER_SQL%"""%FORMAT_SPEC%).display() |
etl_converter_config_file | Points to a secondary config file for ETL expression conversion. | infa2databricks.json |
commands | section on how to generate various read and write statements for different system types | |
system_type_class | system type classifications | |
conform_source_columns | instructs the writer to generate column-conforming statement for sources | |
conform_columns_call_template | template for conforming the column list | "%DF%_conformed_cols = [%COLUMN_LIST%]\n%DF% = DatabricksConversionSupplements. conform_df_columns(%DF%,%DF%_conformed_cols) |
mapplet_class_name | Class name used for mapplet functions | Mapplets |
mapplet_function_name | Function name format for mapplets | %MAPPLET_NAME% |
mapplet_code_indent | General code indentation for mapplets | (8 spaces) |
mapplet_pyspark_code_indent | Indentation for multiline SQL inside mapplets | (4 spaces) |
mapplet_header_template | Path to the file containing the header template for the mapplet | python_mapplet_header_template.py |
mapplet_input_declaration | Format string to define the Python function declaration for the mapplet | \n def %MAPPLET_NAME%(%INPUT%): |
mapplet_conclusion | Code snippet added to conclude the mapplet implementation | #Implementation %MAPPLET_NAME% concluded\n\n |
mapplet_object_var_inject_format | Format for injecting object variable, typically for wrapping dynamic names in code | """ + %OBJECT_NAME% + """ |
mapplet_function_invocation | Format used to invoke the mapplet function | Mapplets.%MAPPLET_NAME%(%INPUT%) |
mapplet_instance_prefixes | List of instance name prefixes to help identify and generate mapplet connection info | ["sc_"] |
system_type_class section
This section represents a mapping of a system type to a class name, which will be later used in the commands
section
"system_type_class" : {
"ORACLE" : "RELATIONAL",
"MySQL" : "RELATIONAL",
"HIVE" : "RELATIONAL",
"DB2" : "RELATIONAL",
"TERADATA" : "RELATIONAL",
"REDSHIFT" : "RELATIONAL",
"Salesforce" : "SALEFORCE",
"TOOLKIT" : "RELATIONAL",
"FLATFILE" : "FILE_DELIMITED",
"FLAT FILE":"FILE",
"FLAT_FILE":"FILE",
"FILE WRITER":"FILE",
"DEFAULT" : "FILE_DELIMITED"
}
commands section
This section holds templates on read and write instructions for each system class
"commands" : {
"READER_FILE_DELIMITED": "spark.read.format('csv').option('header', 'true').load(rf'''%PATH%''')",
"READER_FILE_DELIMITED_EXTERNAL": "%NODE_NAME%_External = spark.read.format('csv').option('header', 'true').load(%PATH%)",
"READER_FILE_FIXED_WIDTH" : "raw_%NODE_NAME% = spark.read.text(f\"%PATH%\")\n%NODE_NAME% = raw_%NODE_NAME%.select(%SUBSTRING_SPEC%)",
"READER_RELATIONAL": "%NODE_NAME% = %SQL%\n%NODE_NAME% = spark.sql(%NODE_NAME%)",
"WRITER_FILE_DELIMITED": "%DF%.write.format('csv').option('header','%HEADER%').mode('overwrite').option('sep','%DELIMITER%').csv('%PATH%')",
"WRITER_RELATIONAL": "my_end_point.write_to_db(%DF%, \"%TABLE_NAME%\", username=\"%LOGIN%\", password=\"%PASSWORD%\")"
},
Substitution Tokens
Templates above use the following placeholders that get substituted at conversion time:
%DF%
- name of the dataframe%SQL%
- SQL content, such as the SELECT statement%PATH%
- Path of the file being processed%SUBSTRING_SPEC%
- Specifications generated by converter on splitting up a positional string into columns%HEADER%
- Header specification%LOGIN%
- System login%PASSWORD%
- System login
Mapplet, Joblet and Shared Containers Handling
In ETL systems, it is common to encapsulate reusable logic into modular components such as maplets, joblets, or shared containers. To preserve this reusability during conversion, the converter attempts to replicate the structure and behavior of these components in the generated code.
Configuration tags prefixed with mapplet define how this reusable logic should be represented and rendered in the output.
The converter will generate a single class—specified by the JSON tag mapplet_class_name—in which all reusable code will be consolidated. This class serves as the container for all maplet-level function definitions and logic.