Customizing Switch

Customizable Prompts

You can create custom conversion prompts for Switch to handle new SQL dialects or specialized conversion requirements.

Creating Custom Conversion Prompts

To create a custom conversion prompt:

Create a YAML file with the required structure
Place it in your Databricks workspace
Specify the full path in the conversion_prompt_yaml parameter

Custom conversion prompts require two main sections:

Required Structure

Here's a simple example showing the basic structure of a custom conversion prompt:

system_message: |
  Convert SQL code to Python code that runs on Databricks according to the following instructions:

  # Input and Output
  - Input: A single SQL file containing one or multiple T-SQL statements
  - Output: Python code with Python comments (in {comment_lang}) explaining the code

  ${common_python_instructions_and_guidelines}

  # Additional Instructions
  1. Convert SQL queries to spark.sql() format
  2. Add clear Python comments explaining the code
  3. Use DataFrame operations instead of loops when possible
  4. Handle errors using try-except blocks

few_shots:
- role: user
  content: |
    SELECT name, age
    FROM users
    WHERE active = 1;
- role: assistant
  content: |
    # Get names and ages of active users
    active_users = spark.sql("""
        SELECT name, age
        FROM users
        WHERE active = 1
    """)
    display(active_users)

Key Elements

system_message Section:

Clear explanation of the conversion purpose
Definition of input and output formats
Additional instructions for specific conversions
(Optional) Comment language specification ({comment_lang} will be replaced automatically by Switch)
(Optional) Common instructions placeholder (${common_python_instructions_and_guidelines} will be replaced automatically by Switch with built-in conversion guidelines)

few_shots Section (Optional but recommended):

Include examples ranging from simple to complex cases
Each example demonstrates specific patterns for LLM understanding
Shows typical conversion patterns for your SQL dialect

Reference: Built-in YAML Files

Switch includes built-in YAML configuration files for supported conversion types, including SQL dialects and other source formats. When creating custom prompts, these built-in configurations serve as excellent starting points - even for supported conversion types, customizing the default prompts based on your specific input patterns can significantly improve conversion accuracy.

Location: You can find built-in YAML files in the /Workspace/Users/{user}/.lakebridge/switch/resources/builtin_prompts/ directory within the Switch installation. These files demonstrate the proper structure and provide conversion-specific examples that you can adapt for your custom requirements.

LLM Configuration

Supported LLM Endpoints

Switch offers flexible model selection - choose any Model Serving endpoint through configuration parameters.

Default Setup Pre-configured with an latest Claude model via Foundation Model API for immediate use (no configuration required). This ensures strong code comprehension and large context handling out of the box.

Model Options Switch works with Foundation Model APIs that have large context windows and strong code comprehension capabilities:

Pay-per-token pricing: For light to moderate usage
Provisioned Throughput: For high-volume workloads with predictable cost and guaranteed performance

Advanced Configuration For complex code transformations and intricate business logic, Claude’s extended thinking mode can significantly improve conversion accuracy. This mode allows the model to reason through complex transformations more thoroughly, though it increases processing time and token usage. Configure via request_params:

{"max_tokens": 64000, "thinking": {"type": "enabled", "budget_tokens": 16000}}

Token Management

LLMs have limits on how much text they can process at once. Switch uses a configurable threshold approach to ensure stable processing.

Token Counting and Thresholds

Switch sets a default threshold of 20,000 tokens per file. Token estimation varies by model: Claude models use approximately 3.4 characters per token (meaning 20,000 tokens ≈ 68,000 characters), while other models use the tiktoken library with o200k_base encoding. For SQL sources, Switch removes comments and compresses whitespace before counting tokens to provide more accurate estimates.

Adjusting the Threshold

This base configuration of 20,000 tokens has been tested with Claude 3.7 Sonnet's 128k context window. However, you should adjust this threshold based on your specific needs:

Lower the threshold (e.g., 8,000 tokens) for complex transformations using extended thinking mode
Raise the threshold when using larger context models like Claude Sonnet 4 (200k context)

Handling Oversized Files

When files exceed the token threshold, Switch automatically excludes them during the analyze phase and marks them with status "Not converted". These files require manual splitting before processing.

Consider these logical splitting strategies:

Separate stored procedures into individual files
Split by functional modules or business domains
Maintain referential integrity across split files

Performance Optimization

Concurrency Settings:

Default concurrency: Set to 4 based on testing with Claude models for stable operation
Model-specific considerations: Different foundation models have varying rate limits and optimal concurrency levels
Scaling for large workloads: For processing many files simultaneously, consider:
- Increased concurrency: Test higher values with your chosen model to find optimal settings
- Provisioned Throughput: Deploy dedicated Foundation Model API capacity with guaranteed throughput

Monitoring:

Watch for rate limiting or throttling responses from model endpoints
Consider enabling Inference Tables to automatically capture requests and responses for detailed monitoring and debugging

State Management

Switch uses a Delta table to track conversion progress and results. Each conversion job creates a timestamped table: {catalog}.{schema}.lakebridge_switch_{timestamp}_{random} (e.g., main.default.lakebridge_switch_20250115143022_7ka9)

The table stores input file information (path, content, token counts), conversion results (generated notebooks, token usage, processing time), error details when conversions fail, and syntax check results from validation stages. This allows you to monitor which files were processed successfully and investigate any issues that occurred during conversion.

Conversion Result Table Schema

Switch creates Delta tables with the following complete schema:

Column	Type	Description
`input_file_number`	int	Unique integer identifier for each input file (starts from 1)
`input_file_path`	string	Full path to the input file
`input_file_encoding`	string	Detected encoding of the input file (e.g., UTF-8)
`tokenizer_type`	string	Type of tokenizer used (claude or openai)
`tokenizer_model`	string	Specific tokenizer model/encoding used
`input_file_token_count`	int	Total number of tokens in the input file
`input_file_token_count_preprocessed`	int	Token count of preprocessed content (SQL comments removed for SQL files, original count for generic files)
`input_file_content`	string	Entire content of the input file
`input_file_content_preprocessed`	string	Preprocessed content (SQL comments removed for SQL files, original content for generic files)
`is_conversion_target`	boolean	Whether file should be processed (updated during conversion)
`model_serving_endpoint_for_conversion`	string	Model endpoint used for conversion
`model_serving_endpoint_for_fix`	string	Model endpoint used for syntax error fixing
`request_params_for_conversion`	string	Conversion request parameters in JSON format
`request_params_for_fix`	string	Fix request parameters in JSON format
`result_content`	string	Generated notebook content (initially null)
`result_prompt_tokens`	int	Number of prompt tokens used (initially null)
`result_completion_tokens`	int	Number of completion tokens generated (initially null)
`result_total_tokens`	int	Total tokens used (prompt + completion, initially null)
`result_processing_time_seconds`	float	Processing time in seconds (initially null)
`result_timestamp`	timestamp	UTC timestamp when processing completed (initially null)
`result_error`	string	Any conversion errors encountered (initially null)
`result_python_parse_error`	string	Python syntax errors found using ast.parse (initially null)
`result_extracted_sqls`	array<string>	SQL statements extracted from Python code (initially null)
`result_sql_parse_errors`	array<string>	SQL syntax errors found using EXPLAIN (initially null)
`export_output_path`	string	Path to the exported file (initially null)
`export_status`	string	Export processing status (initially null)
`export_error`	string	Export error information (initially null)
`export_timestamp`	timestamp	UTC timestamp when export completed (initially null)
`export_content_size_bytes`	long	Size of exported content in bytes (initially null)

Customizable Prompts​

Creating Custom Conversion Prompts​

Required Structure​

Key Elements​

Reference: Built-in YAML Files​

LLM Configuration​

Supported LLM Endpoints​

Token Management​

Token Counting and Thresholds​

Adjusting the Threshold​

Handling Oversized Files​

Performance Optimization​

State Management​

Conversion Result Table Schema​