Skip to main content

Amazon Redshift to Databricks

Conversion information

  • Transpiler: BladeBridge - Target: Databricks SQL (experimental)
  • Transpiler: Switch - Target: Databricks Notebook ()experimental)
  • Input format: SQL files (.sql)

Supported Redshift Versions

  • Amazon Redshift (all versions)
  • Amazon Redshift Serverless

Input Requirements

Export your Redshift SQL scripts:

  1. Schema Scripts: DDL statements (CREATE TABLE, CREATE VIEW, etc.)
  2. Stored Procedures: Procedural SQL code
  3. Query Scripts: SELECT, INSERT, UPDATE, DELETE statements
  4. ETL Scripts: Data transformation logic

Export from Redshift:

-- Export table DDL
SELECT
schemaname,
tablename,
ddl
FROM pg_get_table_def('schema_name', 'table_name');

-- Export view definitions
SELECT
schemaname,
viewname,
definition
FROM pg_views
WHERE schemaname = 'your_schema';

SQL Features

Data Types - Supported

Redshift TypeDatabricks TypeNotes
SMALLINT / INT2INT2-byte integer
INTEGER / INT / INT4INT4-byte integer
BIGINT / INT8BIGINT8-byte integer
DECIMAL / NUMERICNUMERICPrecision preserved
REAL / FLOAT4FLOATSingle precision
DOUBLE PRECISION / FLOAT8 / FLOATFLOATDouble precision
BOOLEAN / BOOLBOOLEANTrue/false
CHAR / CHARACTERSTRINGFixed-length string
VARCHAR / CHARACTER VARYINGSTRINGVariable-length string
NCHAR / NVARCHARSTRINGUnicode strings
TEXTSTRINGUnlimited length
DATEDATEDate only
TIMESTAMPTIMESTAMPDate and time
TIMESTAMPTZTIMESTAMPTimestamp with timezone converted
TIMETZTIMESTAMPTime with timezone converted
VARBYTE / VARBINARYBINARYBinary data
IDENTITY columnsGENERATED AS IDENTITYAuto-incrementing columns

Data Types - Unsupported

Redshift TypeReasonWorkaround
SUPERSemi-structured data typeUse STRING or VARIANT with manual parsing
HLLSKETCHHyperLogLog sketchesUse approx_count_distinct() or custom UDFs
GEOMETRYSpatial dataMigrate to Databricks spatial functions with STRING representation

SQL Functions

Date & Time Functions - Supported

Redshift FunctionDatabricks EquivalentNotes
GETDATE()CURRENT_TIMESTAMPCurrent timestamp
SYSDATECURRENT_TIMESTAMP()Current timestamp
DATE_PART('year', date)YEAR(date)Extract year
DATE_PART('month', date)MONTH(date)Extract month
DATE_PART('day', date)DAY(date)Extract day
DATE_PART('hour', date)HOUR(date)Extract hour
DATE_PART('minute', date)MINUTE(date)Extract minute
DATE_PART('second', date)SECOND(date)Extract second
DATE_PART('quarter', date)QUARTER(date)Extract quarter
DATE_PART('week', date)WEEKOFYEAR(date)Week number
DATE_PART('dow', date)DAYOFWEEK(date)Day of week
DATE_PART('doy', date)DAYOFYEAR(date)Day of year
DATE_PART('epoch', date)UNIX_TIMESTAMP(date)Seconds since epoch
DATEDIFF('day', start, end)DATEDIFF(DAY, start, end)Difference in days
DATEDIFF('month', start, end)MONTH(end) - MONTH(start) + 12 * (YEAR(end) - YEAR(start))Difference in months
DATEDIFF('year', start, end)YEAR(end) - YEAR(start)Difference in years
DATEDIFF('hour', start, end)(UNIX_TIMESTAMP(end) - UNIX_TIMESTAMP(start)) / 3600Difference in hours
DATEDIFF('second', start, end)UNIX_TIMESTAMP(end) - UNIX_TIMESTAMP(start)Difference in seconds
DATEADD('day', n, date)DATE_ADD(date, n)Add days
DATEADD('month', n, date)ADD_MONTHS(date, n)Add months
DATEADD('year', n, date)ADD_MONTHS(date, n * 12)Add years
DATEADD('hour', n, date)TIMESTAMPADD(HOUR, n, date)Add hours
EXTRACT(YEAR FROM date)EXTRACT(YEAR FROM date)Extract year
TRUNC(date)TRUNC(date, 'DD')Truncate to day
CONVERT_TIMEZONE(tz, ts)FROM_UTC_TIMESTAMP(ts, tz)Timezone conversion (2 args)
CONVERT_TIMEZONE(src_tz, tgt_tz, ts)FROM_UTC_TIMESTAMP(TO_UTC_TIMESTAMP(ts, src_tz), tgt_tz)Timezone conversion (3 args)

String Functions - Supported

Redshift FunctionDatabricks EquivalentNotes
LEN(str)CHAR_LENGTH(str)String length
CHAR_LENGTH(str)CHAR_LENGTH(str)String length
TEXTLEN(str)CHAR_LENGTH(str)String length
STRPOS(str, substr)POSITION(substr IN str)Find substring position
CHARINDEX(substr, str)INSTR(str, substr)Find substring position
REPLICATE(str, n)REPEAT(str, n)Repeat string
REPLACE_CHARS(str, old, new)REPLACE(str, old, new)Replace characters
SPLIT_PART(str, delim, part)SPLIT(str, delim)[part-1]Split and get part
REGEXP_SUBSTR(str, pattern)REGEXP_EXTRACT(str, pattern, 0)Extract using regex
REGEXP_COUNT(str, pattern)LENGTH(str) - LENGTH(REGEXP_REPLACE(str, pattern, ''))Count regex matches
CHR(n)CHAR(n)ASCII code to character
QUOTE_LITERAL(str)CONCAT(''', REGEXP_REPLACE(str, ''', ''''), ''')Quote string literal

Aggregate Functions - Supported

Redshift FunctionDatabricks EquivalentNotes
LISTAGG(col, delim)ARRAY_JOIN(COLLECT_LIST(col), delim)Concatenate strings
LISTAGG(DISTINCT col, delim) WITHIN GROUP (ORDER BY x)ARRAY_JOIN(ARRAY_DISTINCT(...), delim)Distinct concatenation with ordering
MEDIAN(col)PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY col)Median value
STDDEV_POP(col)STDDEV_POP(col)Population standard deviation
STDDEV_SAMP(col)STDDEV_SAMP(col)Sample standard deviation
VAR_POP(col)VAR_POP(col)Population variance
VAR_SAMP(col)VAR_SAMP(col)Sample variance

JSON Functions - Supported

Redshift FunctionDatabricks EquivalentNotes
JSON_EXTRACT_PATH_TEXT(json, 'key')GET_JSON_OBJECT(json, '$.key')Extract JSON value (2 args)
JSON_EXTRACT_PATH_TEXT(json, 'k1', 'k2')GET_JSON_OBJECT(json, '$.k1.k2')Extract nested JSON value
JSON_EXTRACT_ARRAY_ELEMENT_TEXT(json, idx)GET_JSON_OBJECT(json, '$[idx]')Extract array element
JSON_PARSE(str)FROM_JSON(str, schema)Parse JSON string
JSON_QUERY(json, path)JSON_EXTRACT_SCALAR(json, path)Query JSON
IS_VALID_JSON(str)CASE WHEN TRY_PARSE_JSON(str) IS NOT NULL THEN TRUE ELSE FALSE ENDValidate JSON

Conditional Functions - Supported

Redshift FunctionDatabricks EquivalentNotes
NVL(expr1, expr2)COALESCE(expr1, expr2)Null value replacement
ISNULL(expr1, expr2)COALESCE(expr1, expr2)Null value replacement
DECODE(expr, val1, res1, def)CASE WHEN expr=val1 THEN res1 ELSE def ENDConditional expression
DECODE(expr, v1, r1, v2, r2, def)CASE WHEN expr=v1 THEN r1 WHEN expr=v2 THEN r2 ELSE def ENDMultiple conditions

Mathematical Functions - Supported

Redshift FunctionDatabricks EquivalentNotes
DLOG1(x)LN(x)Natural logarithm
CEILING(x)CEILING(x)Round up
FLOOR(x)FLOOR(x)Round down
ROUND(x, n)ROUND(x, n)Round to n decimal places
TRUNC(x, n)TRUNC(x, n)Truncate to n decimal places
ABS(x)ABS(x)Absolute value
SIGN(x)SIGN(x)Sign of number
MOD(x, y)MOD(x, y)Modulo
POWER(x, y)POWER(x, y)Exponentiation

Array Functions - Supported

Redshift FunctionDatabricks EquivalentNotes
ARRAY[1,2,3]ARRAY(1,2,3)Array literal
ARRAY_TO_STRING(arr, delim)ARRAY_JOIN(arr, delim)Array to string
ARRAY_UPPER(arr, dim)SIZE(arr)Array size
GET_ARRAY_LENGTH(arr)SIZE(arr)Array length
SPLIT_TO_ARRAY(str, delim)SPLIT(str, delim)String to array

Window Functions - Supported

Redshift FunctionDatabricks EquivalentNotes
ROW_NUMBER() OVER (...)ROW_NUMBER() OVER (...)Row number
RANK() OVER (...)RANK() OVER (...)Rank
DENSE_RANK() OVER (...)DENSE_RANK() OVER (...)Dense rank
FIRST_VALUE(col) IGNORE NULLSFIRST_VALUE(col) IGNORE NULLSFirst value ignoring nulls
LAST_VALUE(col) IGNORE NULLSLAST_VALUE(col) IGNORE NULLSLast value ignoring nulls
LAG(col, offset) OVER (...)LAG(col, offset) OVER (...)Previous row value
LEAD(col, offset) OVER (...)LEAD(col, offset) OVER (...)Next row value

Utility Functions - Supported

Redshift FunctionDatabricks EquivalentNotes
GENERATE_SERIES(start, end)EXPLODE(SEQUENCE(start, end, 1))Generate integer series
GENERATE_SERIES(start, end, step)EXPLODE(SEQUENCE(start, end, step))Generate integer series with step
URLPARSE(url, part)PARSE_URL(url, part)Parse URL
PG_BACKEND_PID()CURRENT_USER()

DDL Statements

CREATE TABLE - Supported Features

Redshift:

CREATE TABLE customers (
customer_id INT IDENTITY(1,1) PRIMARY KEY,
first_name VARCHAR(50) NOT NULL,
last_name VARCHAR(100) NOT NULL,
email VARCHAR(255) UNIQUE,
birth_date DATE,
registration_date TIMESTAMP DEFAULT GETDATE(),
loyalty_points DECIMAL(10,2) DEFAULT 0,
status VARCHAR(20)
)
DISTSTYLE KEY
DISTKEY (customer_id)
SORTKEY (registration_date, customer_id);

Converted Databricks SQL:

CREATE OR REPLACE TABLE customers (
customer_id INT GENERATED ALWAYS AS IDENTITY (START WITH 1 INCREMENT BY 1),
-- FIXME databricks.migration.unsupported.feature PRIMARY KEY
first_name STRING NOT NULL,
last_name STRING NOT NULL,
email STRING,
-- FIXME databricks.migration.unsupported.feature UNIQUE Constraint
birth_date DATE,
registration_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
loyalty_points NUMERIC(10,2) DEFAULT 0,
status STRING
)
ZORDER BY (registration_date, customer_id);

CREATE TABLE - Conversion Notes

Redshift FeatureDatabricks ConversionStatus
IDENTITY(start, increment)GENERATED ALWAYS AS IDENTITY✅ Supported
PRIMARY KEYCommented with FIXME⚠️ Not enforced (commented)
FOREIGN KEYCommented with FIXME⚠️ Not enforced (commented)
UNIQUE constraintsCommented with FIXME⚠️ Not enforced (commented)
CHECK constraintsCommented with FIXME⚠️ Not enforced (commented)
DEFAULT valuesDEFAULT✅ Supported
NOT NULLNOT NULL✅ Supported
DISTSTYLE KEY/EVEN/ALLRemovedℹ️ Not applicable in Databricks
DISTKEY (column)Removedℹ️ Not applicable in Databricks
SORTKEY (columns)ZORDER BY (columns)✅ Converted to Z-ordering
COMPOUND SORTKEYZORDER BY✅ Converted to Z-ordering
INTERLEAVED SORTKEYZORDER BY✅ Converted to Z-ordering
ENCODE (compression)Removedℹ️ Databricks handles compression automatically

CREATE VIEW

Redshift:

CREATE VIEW active_customers AS
SELECT
customer_id,
first_name || ' ' || last_name AS full_name,
email,
DATEDIFF(year, birth_date, CURRENT_DATE) AS age
FROM customers
WHERE status = 'active';

Converted Databricks SQL:

CREATE OR REPLACE VIEW active_customers AS
SELECT
customer_id,
first_name || ' ' || last_name AS full_name,
email,
YEAR(CURRENT_DATE) - YEAR(birth_date) AS age
FROM customers
WHERE status = 'active';

CREATE TABLE AS SELECT (CTAS)

Redshift:

CREATE TABLE customer_summary AS
SELECT
customer_id,
COUNT(*) AS order_count,
SUM(total_amount) AS total_spent
FROM orders
GROUP BY customer_id;

Converted Databricks SQL:

CREATE OR REPLACE TABLE customer_summary AS
SELECT
customer_id,
COUNT(*) AS order_count,
SUM(total_amount) AS total_spent
FROM orders
GROUP BY customer_id;

CREATE TABLE LIKE

Redshift:

CREATE TABLE customers_backup (LIKE customers INCLUDING DEFAULTS);

Converted Databricks SQL:

CREATE OR REPLACE TABLE customers_backup AS
SELECT * FROM customers;

DML Statements

UPDATE with JOIN - Converted to MERGE

Redshift:

WITH customer_segments AS (
SELECT
customer_id,
UPPER(TRIM(first_name)) || ' ' || UPPER(TRIM(last_name)) AS full_name,
DATEDIFF(year, date_of_birth, CURRENT_DATE) AS age,
CASE
WHEN DATEDIFF(year, date_of_birth, CURRENT_DATE) < 25 THEN 'Young'
WHEN DATEDIFF(year, date_of_birth, CURRENT_DATE) BETWEEN 25 AND 45 THEN 'Adult'
ELSE 'Senior'
END AS age_segment,
DATE_PART(month, registration_date) AS registration_month
FROM customers
WHERE status = 'active'
)
UPDATE customers
SET
first_name = cs.full_name,
updated_at = CURRENT_TIMESTAMP
FROM customer_segments cs
WHERE customers.customer_id = cs.customer_id
AND cs.age_segment = 'Senior'
AND cs.registration_month IN (1, 12);

Converted Databricks SQL:

WITH customer_segments AS (
SELECT
customer_id,
UPPER(TRIM(first_name)) || ' ' || UPPER(TRIM(last_name)) AS full_name,
YEAR(CURRENT_DATE) - YEAR(date_of_birth) AS age,
CASE
WHEN YEAR(CURRENT_DATE) - YEAR(date_of_birth) < 25 THEN 'Young'
WHEN YEAR(CURRENT_DATE) - YEAR(date_of_birth) BETWEEN 25 AND 45 THEN 'Adult'
ELSE 'Senior'
END AS age_segment,
DATE_PART(month, registration_date) AS registration_month
FROM customers
WHERE status = 'active'
)
MERGE INTO customers
USING customer_segments cs
ON customers.customer_id = cs.customer_id
AND cs.age_segment = 'Senior'
AND cs.registration_month IN (1, 12)
WHEN MATCHED THEN UPDATE SET
first_name = cs.full_name,
updated_at = CURRENT_TIMESTAMP;

DELETE with JOIN - Converted to MERGE

Redshift:

DELETE FROM orders
USING customers
WHERE orders.customer_id = customers.customer_id
AND customers.status = 'inactive'
AND orders.order_date < CURRENT_DATE - 365;

Converted Databricks SQL:

MERGE INTO orders
USING customers
ON orders.customer_id = customers.customer_id
AND customers.status = 'inactive'
AND orders.order_date < CURRENT_DATE - INTERVAL 365 DAYS
WHEN MATCHED THEN DELETE;

INSERT INTO SELECT

Redshift:

INSERT INTO customer_archive
SELECT * FROM customers
WHERE status = 'deleted'
AND updated_at < CURRENT_DATE - 730;

Converted Databricks SQL:

INSERT INTO customer_archive
SELECT * FROM customers
WHERE status = 'deleted'
AND updated_at < DATE_ADD(current_date, -730);

Verified Conversion Examples

The following examples are taken directly from the functional test suite and show actual input/output conversions.

Example 1: Date Functions

Source Redshift SQL:

SELECT
o.order_number,
CONCAT(c.first_name, ' ', c.last_name) AS customer_name,
TO_CHAR(o.order_date, 'YYYY-MM-DD') AS formatted_order_date,
DATEDIFF(day, o.order_date, o.estimated_delivery) AS delivery_days
FROM orders o
INNER JOIN customers c ON o.customer_id = c.customer_id
WHERE o.order_date >= DATEADD(month, -1, CURRENT_DATE)
ORDER BY o.order_date DESC;

Converted Databricks SQL:

SELECT
o.order_number,
CONCAT(c.first_name, ' ', c.last_name) AS customer_name,
DATE_FORMAT(o.order_date,'y-MM-dd') AS formatted_order_date,
DATEDIFF(DAY, o.order_date, o.estimated_delivery) AS delivery_days
FROM orders o
INNER JOIN customers c ON o.customer_id = c.customer_id
WHERE o.order_date >= ADD_MONTHS(CURRENT_DATE, -1)
ORDER BY o.order_date DESC;

Key Conversions:

  • TO_CHAR(date, format)DATE_FORMAT(date, format) with format translation
  • DATEDIFF(day, ...)DATEDIFF(DAY, ...)
  • DATEADD(month, n, date)ADD_MONTHS(date, n)

Example 2: UPDATE with FROM → MERGE

Source Redshift SQL:

UPDATE customers
SET last_login = CURRENT_TIMESTAMP,
updated_at = CURRENT_TIMESTAMP
WHERE UPPER(email) LIKE '%@GMAIL.COM'
AND last_login < DATEADD(month, -3, CURRENT_DATE);

Converted Databricks SQL:

UPDATE customers
SET last_login = CURRENT_TIMESTAMP,
updated_at = CURRENT_TIMESTAMP
WHERE UPPER(email) LIKE '%@GMAIL.COM'
AND last_login < ADD_MONTHS(CURRENT_DATE, -3);

Key Conversions:

  • DATEADD(month, -3, date)ADD_MONTHS(date, -3)
  • Simple UPDATE statements remain as UPDATE (no MERGE needed)

Example 3: CTEs with UPDATE FROM → MERGE

Source Redshift SQL:

WITH recent_orders AS (
SELECT
o.customer_id,
COUNT(*) AS order_count,
SUM(o.total_amount) AS total_spent,
MAX(o.order_date) AS last_order_date,
SUBSTRING(o.order_number, 1, 4) AS order_prefix
FROM orders o
WHERE o.order_date >= DATEADD(month, -6, CURRENT_DATE)
AND o.status IN ('completed', 'shipped')
GROUP BY o.customer_id, SUBSTRING(o.order_number, 1, 4)
HAVING COUNT(*) >= 3
),
loyalty_updates AS (
SELECT
ro.customer_id,
FLOOR(ro.total_spent / 100) * 10 AS bonus_points,
CONCAT('VIP-', LPAD(ro.customer_id::VARCHAR, 6, '0')) AS new_customer_code
FROM recent_orders ro
)
UPDATE customers
SET
loyalty_points = loyalty_points + lu.bonus_points,
customer_code = lu.new_customer_code,
updated_at = CURRENT_TIMESTAMP
FROM loyalty_updates lu
WHERE customers.customer_id = lu.customer_id;

Converted Databricks SQL:

WITH recent_orders AS (
SELECT
o.customer_id,
COUNT(*) AS order_count,
SUM(o.total_amount) AS total_spent,
MAX(o.order_date) AS last_order_date,
SUBSTRING(o.order_number, 1, 4) AS order_prefix
FROM orders o
WHERE o.order_date >= ADD_MONTHS(CURRENT_DATE, -6)
AND o.status IN ('completed', 'shipped')
GROUP BY o.customer_id, SUBSTRING(o.order_number, 1, 4)
HAVING COUNT(*) >= 3
),
loyalty_updates AS (
SELECT
ro.customer_id,
FLOOR(ro.total_spent / 100) * 10 AS bonus_points,
CONCAT('VIP-', LPAD(ro.customer_id::STRING, 6, '0')) AS new_customer_code
FROM recent_orders ro
)
MERGE INTO customers
USING loyalty_updates lu
ON customers.customer_id = lu.customer_id
WHEN MATCHED THEN UPDATE SET
loyalty_points = loyalty_points + lu.bonus_points,
customer_code = lu.new_customer_code,
updated_at = CURRENT_TIMESTAMP;

Key Conversions:

  • UPDATE ... FROM ...MERGE INTO ... USING ... WHEN MATCHED
  • ::VARCHAR::STRING
  • DATEADD(month, -6, date)ADD_MONTHS(date, -6)
  • CTEs preserved with updated syntax

Example 4: REGEXP_COUNT Function

Source Redshift SQL:

UPDATE customers
SET loyalty_points = loyalty_points +
CASE
WHEN REGEXP_COUNT(email, '[A-Z]') > 3 THEN 100
WHEN REGEXP_COUNT(email, '[0-9]') > 2 THEN 50
ELSE 25
END,
status = DECODE(
REGEXP_COUNT(phone, '[0-9]'),
10, 'VERIFIED',
11, 'VERIFIED',
'PENDING'
)
WHERE customer_id IN (
SELECT DISTINCT o.customer_id
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
WHERE o.order_date >= CURRENT_DATE - INTERVAL '30 days'
AND o.total_amount > 500
);

Converted Databricks SQL:

UPDATE customers
SET loyalty_points = loyalty_points +
CASE
WHEN LENGTH(email) - LENGTH(REGEXP_REPLACE(email, '[A-Z]', '')) > 3 THEN 100
WHEN LENGTH(email) - LENGTH(REGEXP_REPLACE(email, '[0-9]', '')) > 2 THEN 50
ELSE 25
END,
status = case when LENGTH(phone) - LENGTH(REGEXP_REPLACE(phone, '[0-9]', ''))=10 then 'VERIFIED' when LENGTH(phone) - LENGTH(REGEXP_REPLACE(phone, '[0-9]', ''))=11 then 'VERIFIED' else 'PENDING' END
WHERE customer_id IN (
SELECT DISTINCT o.customer_id
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
WHERE o.order_date >= CURRENT_DATE - INTERVAL '30 days'
AND o.total_amount > 500
);

Key Conversions:

  • REGEXP_COUNT(str, pattern)LENGTH(str) - LENGTH(REGEXP_REPLACE(str, pattern, ''))
  • DECODE(expr, v1, r1, v2, r2, default)CASE WHEN expr=v1 THEN r1 WHEN expr=v2 THEN r2 ELSE default END

Example 5: INSERT with REGEXP_COUNT and Complex Functions

Source Redshift SQL:

INSERT INTO addresses (customer_id, address_type, street_address, city, state_province, postal_code, country, is_primary, created_at)
SELECT
c.customer_id,
'BILLING',
TRANSLATE(c.first_name || ' ' || c.last_name, 'AEIOU', '12345') || ' Street',
CASE
WHEN REGEXP_COUNT(c.email, '@gmail') > 0 THEN 'New York'
WHEN REGEXP_COUNT(c.email, '@yahoo') > 0 THEN 'Los Angeles'
ELSE 'Chicago'
END,
DECODE(
REGEXP_COUNT(c.phone, '^1'),
1, 'NY',
'CA'
),
LPAD(ABS(RANDOM() * 99999)::INT, 5, '0'),
'USA',
1,
CURRENT_TIMESTAMP
FROM customers c
WHERE c.customer_id NOT IN (
SELECT DISTINCT customer_id
FROM addresses
WHERE address_type = 'BILLING'
)
AND c.registration_date >= CURRENT_DATE - INTERVAL '90 days';

Converted Databricks SQL:

INSERT INTO addresses (customer_id, address_type, street_address, city, state_province, postal_code, country, is_primary, created_at)
SELECT
c.customer_id,
'BILLING',
TRANSLATE(c.first_name || ' ' || c.last_name, 'AEIOU', '12345') || ' Street',
CASE
WHEN LENGTH(c.email) - LENGTH(REGEXP_REPLACE(c.email, '@gmail', '')) > 0 THEN 'New York'
WHEN LENGTH(c.email) - LENGTH(REGEXP_REPLACE(c.email, '@yahoo', '')) > 0 THEN 'Los Angeles'
ELSE 'Chicago'
END,
case when LENGTH(c.phone) - LENGTH(REGEXP_REPLACE(c.phone, '^1', ''))=1 then 'NY' else 'CA' END,
LPAD(ABS(RANDOM() * 99999)::INT, 5, '0'),
'USA',
1,
CURRENT_TIMESTAMP
FROM customers c
WHERE c.customer_id NOT IN (
SELECT DISTINCT customer_id
FROM addresses
WHERE address_type = 'BILLING'
)
AND c.registration_date >= CURRENT_DATE - INTERVAL '90 days';

Key Conversions:

  • Multiple REGEXP_COUNT() conversions in CASE statements
  • DECODE() with 3 parameters → inline CASE
  • TRANSLATE() function preserved (supported in Databricks)
  • LPAD(), ABS(), RANDOM() preserved

Example 6: DELETE with CTEs

Source Redshift SQL:

WITH expired_products AS (
SELECT
p.product_id,
p.product_name,
REPLACE(LOWER(p.product_name), ' ', '_') AS slug_name,
DATEDIFF(day, p.launch_date, CURRENT_DATE) AS days_since_launch,
DATE_TRUNC('quarter', p.launch_date) AS launch_quarter
FROM products p
WHERE p.launch_date < DATEADD(year, -2, CURRENT_DATE)
AND p.stock_quantity = 0
AND p.is_active = 1
),
category_info AS (
SELECT
c.category_id,
INITCAP(c.category_name) AS formatted_name
FROM categories c
WHERE c.is_active = 1
)
DELETE FROM products
WHERE product_id IN (
SELECT ep.product_id
FROM expired_products ep
JOIN category_info ci ON ep.product_id IN (
SELECT product_id FROM products WHERE category_id = ci.category_id
)
WHERE ep.days_since_launch > 730
AND CHARINDEX('discontinued', LOWER(ep.product_name)) > 0
);

Converted Databricks SQL:

WITH expired_products AS (
SELECT
p.product_id,
p.product_name,
REPLACE(LOWER(p.product_name), ' ', '_') AS slug_name,
DATEDIFF(DAY, p.launch_date, CURRENT_DATE) AS days_since_launch,
DATE_TRUNC('quarter', p.launch_date) AS launch_quarter
FROM products p
WHERE p.launch_date < DATEADD(year, -2, CURRENT_DATE)
AND p.stock_quantity = 0
AND p.is_active = 1
),
category_info AS (
SELECT
c.category_id,
INITCAP(c.category_name) AS formatted_name
FROM categories c
WHERE c.is_active = 1
)
DELETE FROM products
WHERE product_id IN (
SELECT ep.product_id
FROM expired_products ep
JOIN category_info ci ON ep.product_id IN (
SELECT product_id FROM products WHERE category_id = ci.category_id
)
WHERE ep.days_since_launch > 730
AND INSTR(LOWER(ep.product_name), 'discontinued') > 0
);

Key Conversions:

  • DATEDIFF(day, ...)DATEDIFF(DAY, ...)
  • CHARINDEX(substr, str)INSTR(str, substr) (note: argument order swapped)
  • INITCAP() preserved (supported in Databricks)
  • DATE_TRUNC() preserved
  • CTEs with DELETE preserved

Advanced Features

Common Table Expressions (CTEs)

Redshift:

WITH monthly_sales AS (
SELECT
DATE_TRUNC('month', order_date) AS month,
SUM(total_amount) AS monthly_total
FROM orders
GROUP BY 1
),
customer_orders AS (
SELECT
customer_id,
COUNT(*) AS order_count,
AVG(total_amount) AS avg_order_value
FROM orders
GROUP BY customer_id
)
SELECT
ms.month,
ms.monthly_total,
COUNT(DISTINCT co.customer_id) AS active_customers,
AVG(co.avg_order_value) AS avg_customer_value
FROM monthly_sales ms
CROSS JOIN customer_orders co
GROUP BY ms.month, ms.monthly_total;

Converted Databricks SQL:

WITH monthly_sales AS (
SELECT
DATE_TRUNC('month', order_date) AS month,
SUM(total_amount) AS monthly_total
FROM orders
GROUP BY 1
),
customer_orders AS (
SELECT
customer_id,
COUNT(*) AS order_count,
AVG(total_amount) AS avg_order_value
FROM orders
GROUP BY customer_id
)
SELECT
ms.month,
ms.monthly_total,
COUNT(DISTINCT co.customer_id) AS active_customers,
AVG(co.avg_customer_value) AS avg_customer_value
FROM monthly_sales ms
CROSS JOIN customer_orders co
GROUP BY ms.month, ms.monthly_total;

Window Functions with IGNORE NULLS

Redshift:

SELECT
customer_id,
order_date,
total_amount,
FIRST_VALUE(payment_method) IGNORE NULLS
OVER (PARTITION BY customer_id ORDER BY order_date) AS first_payment_method,
LAST_VALUE(payment_method) IGNORE NULLS
OVER (PARTITION BY customer_id ORDER BY order_date
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS last_payment_method
FROM orders;

Converted Databricks SQL:

SELECT
customer_id,
order_date,
total_amount,
FIRST_VALUE(payment_method) IGNORE NULLS
OVER (PARTITION BY customer_id ORDER BY order_date) AS first_payment_method,
LAST_VALUE(payment_method) IGNORE NULLS
OVER (PARTITION BY customer_id ORDER BY order_date
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS last_payment_method
FROM orders;

LISTAGG with DISTINCT

Redshift:

SELECT
customer_id,
LISTAGG(DISTINCT product_category, ', ')
WITHIN GROUP (ORDER BY product_category) AS categories
FROM order_items
GROUP BY customer_id;

Converted Databricks SQL:

SELECT
customer_id,
ARRAY_JOIN(
ARRAY_DISTINCT(
SPLIT(
LISTAGG(product_category, ', ')
WITHIN GROUP (ORDER BY product_category),
', '
)
),
', '
) AS categories
FROM order_items
GROUP BY customer_id;

Timezone Conversions

Redshift:

SELECT
order_id,
order_timestamp,
order_timestamp AT TIME ZONE 'America/New_York' AS ny_time,
CONVERT_TIMEZONE('America/New_York', order_timestamp) AS ny_converted,
CONVERT_TIMEZONE('UTC', 'America/Los_Angeles', order_timestamp) AS la_time
FROM orders;

Converted Databricks SQL:

SELECT
order_id,
order_timestamp,
FROM_UTC_TIMESTAMP(order_timestamp, 'America/New_York') AS ny_time,
FROM_UTC_TIMESTAMP(order_timestamp, 'America/New_York') AS ny_converted,
FROM_UTC_TIMESTAMP(TO_UTC_TIMESTAMP(order_timestamp, 'UTC'), 'America/Los_Angeles') AS la_time
FROM orders;

Lateral Column References

Redshift allows referencing previously defined columns in the same SELECT clause. Databricks conversion handles this automatically.

Redshift:

SELECT
customer_id,
first_name || ' ' || last_name AS full_name,
LENGTH(full_name) AS name_length, -- Lateral reference to full_name
UPPER(full_name) AS full_name_upper -- Another lateral reference
FROM customers;

Converted Databricks SQL:

SELECT
customer_id,
first_name || ' ' || last_name AS full_name,
LENGTH(first_name || ' ' || last_name) AS name_length,
UPPER(first_name || ' ' || last_name) AS full_name_upper
FROM customers;

Redshift-Specific Features

Distribution Styles

Redshift distribution styles (DISTSTYLE, DISTKEY) are removed during conversion as Databricks handles data distribution automatically.

Redshift:

CREATE TABLE orders (
order_id INT,
customer_id INT,
order_date DATE,
total_amount DECIMAL(10,2)
)
DISTSTYLE KEY
DISTKEY (customer_id);

Converted Databricks SQL:

CREATE OR REPLACE TABLE orders (
order_id INT,
customer_id INT,
order_date DATE,
total_amount NUMERIC(10,2)
);

Sort Keys

Redshift SORTKEY is converted to Databricks ZORDER BY for optimized query performance.

Redshift:

CREATE TABLE events (
event_id BIGINT,
event_date DATE,
user_id INT,
event_type VARCHAR(50)
)
COMPOUND SORTKEY (event_date, user_id);

Converted Databricks SQL:

CREATE OR REPLACE TABLE events (
event_id BIGINT,
event_date DATE,
user_id INT,
event_type STRING
)
ZORDER BY (event_date, user_id);

Compression Encoding

Redshift ENCODE clauses for column compression are removed as Databricks manages compression automatically.

Redshift:

CREATE TABLE products (
product_id INT ENCODE az64,
product_name VARCHAR(200) ENCODE lzo,
description TEXT ENCODE zstd,
price DECIMAL(10,2) ENCODE raw
);

Converted Databricks SQL:

CREATE OR REPLACE TABLE products (
product_id INT,
product_name STRING,
description STRING,
price NUMERIC(10,2)
);

Vacuum and Analyze

Redshift VACUUM and ANALYZE commands are removed as Databricks handles optimization automatically.

Redshift:

VACUUM orders;
ANALYZE customers;
VACUUM FULL products;

Converted Databricks SQL:

-- VACUUM and ANALYZE commands are not needed in Databricks
-- Databricks performs automatic optimization

Known Limitations

Features Not Supported

  1. SUPER Data Type

    • Redshift's semi-structured SUPER type
    • Workaround: Use STRING type and parse with JSON functions, or use VARIANT type
  2. HLLSKETCH Data Type

    • HyperLogLog sketches for approximate counting
    • Workaround: Use APPROX_COUNT_DISTINCT() or custom UDFs
  3. GEOMETRY Data Type

    • Spatial/geographic data
    • Workaround: Store as STRING and use Databricks spatial functions
  4. Referential Integrity Constraints

    • PRIMARY KEY, FOREIGN KEY constraints are commented out
    • Workaround: Implement validation logic in application or ETL code
  5. CHECK Constraints

    • Column-level CHECK constraints are commented out
    • Workaround: Implement validation in application code or use Delta table constraints where applicable
  6. UNIQUE Constraints

    • UNIQUE constraints are commented out
    • Workaround: Use MERGE operations or application-level validation
  7. Cursor Operations

    • DECLARE CURSOR, OPEN, FETCH, CLOSE
    • Workaround: Rewrite using set-based operations or Databricks SQL stored procedures
  8. UNLOAD Command

    • Redshift UNLOAD to S3
    • Workaround: Use Databricks COPY INTO or DataFrame write operations
  9. COPY Command

    • Redshift COPY from S3/files
    • Workaround: Use Databricks COPY INTO or DataFrame read operations
  10. Redshift Spectrum External Tables

    • Querying data directly in S3
    • Workaround: Use Databricks External Tables or Delta Lake

Next Steps

  1. Export Redshift SQL scripts to .sql files
  2. Run conversion using Lakebridge CLI:
    databricks labs lakebridge transpile \
    --source-dialect redshift \
    --input-source /path/to/redshift/scripts \
    --output-folder /output/databricks-sql \
    --target-technology databricks-sql
  3. Review generated SQL for FIXME comments
  4. Address unsupported features (constraints, SUPER types, etc.)
  5. Test converted SQL in Databricks workspace
  6. Optimize with Delta Lake features (OPTIMIZE, Z-ORDER)
  7. Deploy to production

For more information, see: