Amazon Redshift to Databricks
Conversion information
- Transpiler: BladeBridge - Target: Databricks SQL (experimental)
- Transpiler: Switch - Target: Databricks Notebook ()experimental)
- Input format: SQL files (.sql)
Supported Redshift Versions
- Amazon Redshift (all versions)
- Amazon Redshift Serverless
Input Requirements
Export your Redshift SQL scripts:
- Schema Scripts: DDL statements (CREATE TABLE, CREATE VIEW, etc.)
- Stored Procedures: Procedural SQL code
- Query Scripts: SELECT, INSERT, UPDATE, DELETE statements
- ETL Scripts: Data transformation logic
Export from Redshift:
-- Export table DDL
SELECT
schemaname,
tablename,
ddl
FROM pg_get_table_def('schema_name', 'table_name');
-- Export view definitions
SELECT
schemaname,
viewname,
definition
FROM pg_views
WHERE schemaname = 'your_schema';
SQL Features
Data Types - Supported
| Redshift Type | Databricks Type | Notes |
|---|---|---|
| SMALLINT / INT2 | INT | 2-byte integer |
| INTEGER / INT / INT4 | INT | 4-byte integer |
| BIGINT / INT8 | BIGINT | 8-byte integer |
| DECIMAL / NUMERIC | NUMERIC | Precision preserved |
| REAL / FLOAT4 | FLOAT | Single precision |
| DOUBLE PRECISION / FLOAT8 / FLOAT | FLOAT | Double precision |
| BOOLEAN / BOOL | BOOLEAN | True/false |
| CHAR / CHARACTER | STRING | Fixed-length string |
| VARCHAR / CHARACTER VARYING | STRING | Variable-length string |
| NCHAR / NVARCHAR | STRING | Unicode strings |
| TEXT | STRING | Unlimited length |
| DATE | DATE | Date only |
| TIMESTAMP | TIMESTAMP | Date and time |
| TIMESTAMPTZ | TIMESTAMP | Timestamp with timezone converted |
| TIMETZ | TIMESTAMP | Time with timezone converted |
| VARBYTE / VARBINARY | BINARY | Binary data |
| IDENTITY columns | GENERATED AS IDENTITY | Auto-incrementing columns |
Data Types - Unsupported
| Redshift Type | Reason | Workaround |
|---|---|---|
| SUPER | Semi-structured data type | Use STRING or VARIANT with manual parsing |
| HLLSKETCH | HyperLogLog sketches | Use approx_count_distinct() or custom UDFs |
| GEOMETRY | Spatial data | Migrate to Databricks spatial functions with STRING representation |
SQL Functions
Date & Time Functions - Supported
| Redshift Function | Databricks Equivalent | Notes |
|---|---|---|
| GETDATE() | CURRENT_TIMESTAMP | Current timestamp |
| SYSDATE | CURRENT_TIMESTAMP() | Current timestamp |
| DATE_PART('year', date) | YEAR(date) | Extract year |
| DATE_PART('month', date) | MONTH(date) | Extract month |
| DATE_PART('day', date) | DAY(date) | Extract day |
| DATE_PART('hour', date) | HOUR(date) | Extract hour |
| DATE_PART('minute', date) | MINUTE(date) | Extract minute |
| DATE_PART('second', date) | SECOND(date) | Extract second |
| DATE_PART('quarter', date) | QUARTER(date) | Extract quarter |
| DATE_PART('week', date) | WEEKOFYEAR(date) | Week number |
| DATE_PART('dow', date) | DAYOFWEEK(date) | Day of week |
| DATE_PART('doy', date) | DAYOFYEAR(date) | Day of year |
| DATE_PART('epoch', date) | UNIX_TIMESTAMP(date) | Seconds since epoch |
| DATEDIFF('day', start, end) | DATEDIFF(DAY, start, end) | Difference in days |
| DATEDIFF('month', start, end) | MONTH(end) - MONTH(start) + 12 * (YEAR(end) - YEAR(start)) | Difference in months |
| DATEDIFF('year', start, end) | YEAR(end) - YEAR(start) | Difference in years |
| DATEDIFF('hour', start, end) | (UNIX_TIMESTAMP(end) - UNIX_TIMESTAMP(start)) / 3600 | Difference in hours |
| DATEDIFF('second', start, end) | UNIX_TIMESTAMP(end) - UNIX_TIMESTAMP(start) | Difference in seconds |
| DATEADD('day', n, date) | DATE_ADD(date, n) | Add days |
| DATEADD('month', n, date) | ADD_MONTHS(date, n) | Add months |
| DATEADD('year', n, date) | ADD_MONTHS(date, n * 12) | Add years |
| DATEADD('hour', n, date) | TIMESTAMPADD(HOUR, n, date) | Add hours |
| EXTRACT(YEAR FROM date) | EXTRACT(YEAR FROM date) | Extract year |
| TRUNC(date) | TRUNC(date, 'DD') | Truncate to day |
| CONVERT_TIMEZONE(tz, ts) | FROM_UTC_TIMESTAMP(ts, tz) | Timezone conversion (2 args) |
| CONVERT_TIMEZONE(src_tz, tgt_tz, ts) | FROM_UTC_TIMESTAMP(TO_UTC_TIMESTAMP(ts, src_tz), tgt_tz) | Timezone conversion (3 args) |
String Functions - Supported
| Redshift Function | Databricks Equivalent | Notes |
|---|---|---|
| LEN(str) | CHAR_LENGTH(str) | String length |
| CHAR_LENGTH(str) | CHAR_LENGTH(str) | String length |
| TEXTLEN(str) | CHAR_LENGTH(str) | String length |
| STRPOS(str, substr) | POSITION(substr IN str) | Find substring position |
| CHARINDEX(substr, str) | INSTR(str, substr) | Find substring position |
| REPLICATE(str, n) | REPEAT(str, n) | Repeat string |
| REPLACE_CHARS(str, old, new) | REPLACE(str, old, new) | Replace characters |
| SPLIT_PART(str, delim, part) | SPLIT(str, delim)[part-1] | Split and get part |
| REGEXP_SUBSTR(str, pattern) | REGEXP_EXTRACT(str, pattern, 0) | Extract using regex |
| REGEXP_COUNT(str, pattern) | LENGTH(str) - LENGTH(REGEXP_REPLACE(str, pattern, '')) | Count regex matches |
| CHR(n) | CHAR(n) | ASCII code to character |
| QUOTE_LITERAL(str) | CONCAT(''', REGEXP_REPLACE(str, ''', ''''), ''') | Quote string literal |
Aggregate Functions - Supported
| Redshift Function | Databricks Equivalent | Notes |
|---|---|---|
| LISTAGG(col, delim) | ARRAY_JOIN(COLLECT_LIST(col), delim) | Concatenate strings |
| LISTAGG(DISTINCT col, delim) WITHIN GROUP (ORDER BY x) | ARRAY_JOIN(ARRAY_DISTINCT(...), delim) | Distinct concatenation with ordering |
| MEDIAN(col) | PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY col) | Median value |
| STDDEV_POP(col) | STDDEV_POP(col) | Population standard deviation |
| STDDEV_SAMP(col) | STDDEV_SAMP(col) | Sample standard deviation |
| VAR_POP(col) | VAR_POP(col) | Population variance |
| VAR_SAMP(col) | VAR_SAMP(col) | Sample variance |
JSON Functions - Supported
| Redshift Function | Databricks Equivalent | Notes |
|---|---|---|
| JSON_EXTRACT_PATH_TEXT(json, 'key') | GET_JSON_OBJECT(json, '$.key') | Extract JSON value (2 args) |
| JSON_EXTRACT_PATH_TEXT(json, 'k1', 'k2') | GET_JSON_OBJECT(json, '$.k1.k2') | Extract nested JSON value |
| JSON_EXTRACT_ARRAY_ELEMENT_TEXT(json, idx) | GET_JSON_OBJECT(json, '$[idx]') | Extract array element |
| JSON_PARSE(str) | FROM_JSON(str, schema) | Parse JSON string |
| JSON_QUERY(json, path) | JSON_EXTRACT_SCALAR(json, path) | Query JSON |
| IS_VALID_JSON(str) | CASE WHEN TRY_PARSE_JSON(str) IS NOT NULL THEN TRUE ELSE FALSE END | Validate JSON |
Conditional Functions - Supported
| Redshift Function | Databricks Equivalent | Notes |
|---|---|---|
| NVL(expr1, expr2) | COALESCE(expr1, expr2) | Null value replacement |
| ISNULL(expr1, expr2) | COALESCE(expr1, expr2) | Null value replacement |
| DECODE(expr, val1, res1, def) | CASE WHEN expr=val1 THEN res1 ELSE def END | Conditional expression |
| DECODE(expr, v1, r1, v2, r2, def) | CASE WHEN expr=v1 THEN r1 WHEN expr=v2 THEN r2 ELSE def END | Multiple conditions |
Mathematical Functions - Supported
| Redshift Function | Databricks Equivalent | Notes |
|---|---|---|
| DLOG1(x) | LN(x) | Natural logarithm |
| CEILING(x) | CEILING(x) | Round up |
| FLOOR(x) | FLOOR(x) | Round down |
| ROUND(x, n) | ROUND(x, n) | Round to n decimal places |
| TRUNC(x, n) | TRUNC(x, n) | Truncate to n decimal places |
| ABS(x) | ABS(x) | Absolute value |
| SIGN(x) | SIGN(x) | Sign of number |
| MOD(x, y) | MOD(x, y) | Modulo |
| POWER(x, y) | POWER(x, y) | Exponentiation |
Array Functions - Supported
| Redshift Function | Databricks Equivalent | Notes |
|---|---|---|
| ARRAY[1,2,3] | ARRAY(1,2,3) | Array literal |
| ARRAY_TO_STRING(arr, delim) | ARRAY_JOIN(arr, delim) | Array to string |
| ARRAY_UPPER(arr, dim) | SIZE(arr) | Array size |
| GET_ARRAY_LENGTH(arr) | SIZE(arr) | Array length |
| SPLIT_TO_ARRAY(str, delim) | SPLIT(str, delim) | String to array |
Window Functions - Supported
| Redshift Function | Databricks Equivalent | Notes |
|---|---|---|
| ROW_NUMBER() OVER (...) | ROW_NUMBER() OVER (...) | Row number |
| RANK() OVER (...) | RANK() OVER (...) | Rank |
| DENSE_RANK() OVER (...) | DENSE_RANK() OVER (...) | Dense rank |
| FIRST_VALUE(col) IGNORE NULLS | FIRST_VALUE(col) IGNORE NULLS | First value ignoring nulls |
| LAST_VALUE(col) IGNORE NULLS | LAST_VALUE(col) IGNORE NULLS | Last value ignoring nulls |
| LAG(col, offset) OVER (...) | LAG(col, offset) OVER (...) | Previous row value |
| LEAD(col, offset) OVER (...) | LEAD(col, offset) OVER (...) | Next row value |
Utility Functions - Supported
| Redshift Function | Databricks Equivalent | Notes |
|---|---|---|
| GENERATE_SERIES(start, end) | EXPLODE(SEQUENCE(start, end, 1)) | Generate integer series |
| GENERATE_SERIES(start, end, step) | EXPLODE(SEQUENCE(start, end, step)) | Generate integer series with step |
| URLPARSE(url, part) | PARSE_URL(url, part) | Parse URL |
| PG_BACKEND_PID() | CURRENT_USER() |
DDL Statements
CREATE TABLE - Supported Features
Redshift:
CREATE TABLE customers (
customer_id INT IDENTITY(1,1) PRIMARY KEY,
first_name VARCHAR(50) NOT NULL,
last_name VARCHAR(100) NOT NULL,
email VARCHAR(255) UNIQUE,
birth_date DATE,
registration_date TIMESTAMP DEFAULT GETDATE(),
loyalty_points DECIMAL(10,2) DEFAULT 0,
status VARCHAR(20)
)
DISTSTYLE KEY
DISTKEY (customer_id)
SORTKEY (registration_date, customer_id);
Converted Databricks SQL:
CREATE OR REPLACE TABLE customers (
customer_id INT GENERATED ALWAYS AS IDENTITY (START WITH 1 INCREMENT BY 1),
-- FIXME databricks.migration.unsupported.feature PRIMARY KEY
first_name STRING NOT NULL,
last_name STRING NOT NULL,
email STRING,
-- FIXME databricks.migration.unsupported.feature UNIQUE Constraint
birth_date DATE,
registration_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
loyalty_points NUMERIC(10,2) DEFAULT 0,
status STRING
)
ZORDER BY (registration_date, customer_id);
CREATE TABLE - Conversion Notes
| Redshift Feature | Databricks Conversion | Status |
|---|---|---|
| IDENTITY(start, increment) | GENERATED ALWAYS AS IDENTITY | ✅ Supported |
| PRIMARY KEY | Commented with FIXME | ⚠️ Not enforced (commented) |
| FOREIGN KEY | Commented with FIXME | ⚠️ Not enforced (commented) |
| UNIQUE constraints | Commented with FIXME | ⚠️ Not enforced (commented) |
| CHECK constraints | Commented with FIXME | ⚠️ Not enforced (commented) |
| DEFAULT values | DEFAULT | ✅ Supported |
| NOT NULL | NOT NULL | ✅ Supported |
| DISTSTYLE KEY/EVEN/ALL | Removed | ℹ️ Not applicable in Databricks |
| DISTKEY (column) | Removed | ℹ️ Not applicable in Databricks |
| SORTKEY (columns) | ZORDER BY (columns) | ✅ Converted to Z-ordering |
| COMPOUND SORTKEY | ZORDER BY | ✅ Converted to Z-ordering |
| INTERLEAVED SORTKEY | ZORDER BY | ✅ Converted to Z-ordering |
| ENCODE (compression) | Removed | ℹ️ Databricks handles compression automatically |
CREATE VIEW
Redshift:
CREATE VIEW active_customers AS
SELECT
customer_id,
first_name || ' ' || last_name AS full_name,
email,
DATEDIFF(year, birth_date, CURRENT_DATE) AS age
FROM customers
WHERE status = 'active';
Converted Databricks SQL:
CREATE OR REPLACE VIEW active_customers AS
SELECT
customer_id,
first_name || ' ' || last_name AS full_name,
email,
YEAR(CURRENT_DATE) - YEAR(birth_date) AS age
FROM customers
WHERE status = 'active';
CREATE TABLE AS SELECT (CTAS)
Redshift:
CREATE TABLE customer_summary AS
SELECT
customer_id,
COUNT(*) AS order_count,
SUM(total_amount) AS total_spent
FROM orders
GROUP BY customer_id;
Converted Databricks SQL:
CREATE OR REPLACE TABLE customer_summary AS
SELECT
customer_id,
COUNT(*) AS order_count,
SUM(total_amount) AS total_spent
FROM orders
GROUP BY customer_id;
CREATE TABLE LIKE
Redshift:
CREATE TABLE customers_backup (LIKE customers INCLUDING DEFAULTS);
Converted Databricks SQL:
CREATE OR REPLACE TABLE customers_backup AS
SELECT * FROM customers;
DML Statements
UPDATE with JOIN - Converted to MERGE
Redshift:
WITH customer_segments AS (
SELECT
customer_id,
UPPER(TRIM(first_name)) || ' ' || UPPER(TRIM(last_name)) AS full_name,
DATEDIFF(year, date_of_birth, CURRENT_DATE) AS age,
CASE
WHEN DATEDIFF(year, date_of_birth, CURRENT_DATE) < 25 THEN 'Young'
WHEN DATEDIFF(year, date_of_birth, CURRENT_DATE) BETWEEN 25 AND 45 THEN 'Adult'
ELSE 'Senior'
END AS age_segment,
DATE_PART(month, registration_date) AS registration_month
FROM customers
WHERE status = 'active'
)
UPDATE customers
SET
first_name = cs.full_name,
updated_at = CURRENT_TIMESTAMP
FROM customer_segments cs
WHERE customers.customer_id = cs.customer_id
AND cs.age_segment = 'Senior'
AND cs.registration_month IN (1, 12);
Converted Databricks SQL:
WITH customer_segments AS (
SELECT
customer_id,
UPPER(TRIM(first_name)) || ' ' || UPPER(TRIM(last_name)) AS full_name,
YEAR(CURRENT_DATE) - YEAR(date_of_birth) AS age,
CASE
WHEN YEAR(CURRENT_DATE) - YEAR(date_of_birth) < 25 THEN 'Young'
WHEN YEAR(CURRENT_DATE) - YEAR(date_of_birth) BETWEEN 25 AND 45 THEN 'Adult'
ELSE 'Senior'
END AS age_segment,
DATE_PART(month, registration_date) AS registration_month
FROM customers
WHERE status = 'active'
)
MERGE INTO customers
USING customer_segments cs
ON customers.customer_id = cs.customer_id
AND cs.age_segment = 'Senior'
AND cs.registration_month IN (1, 12)
WHEN MATCHED THEN UPDATE SET
first_name = cs.full_name,
updated_at = CURRENT_TIMESTAMP;
DELETE with JOIN - Converted to MERGE
Redshift:
DELETE FROM orders
USING customers
WHERE orders.customer_id = customers.customer_id
AND customers.status = 'inactive'
AND orders.order_date < CURRENT_DATE - 365;
Converted Databricks SQL:
MERGE INTO orders
USING customers
ON orders.customer_id = customers.customer_id
AND customers.status = 'inactive'
AND orders.order_date < CURRENT_DATE - INTERVAL 365 DAYS
WHEN MATCHED THEN DELETE;
INSERT INTO SELECT
Redshift:
INSERT INTO customer_archive
SELECT * FROM customers
WHERE status = 'deleted'
AND updated_at < CURRENT_DATE - 730;
Converted Databricks SQL:
INSERT INTO customer_archive
SELECT * FROM customers
WHERE status = 'deleted'
AND updated_at < DATE_ADD(current_date, -730);
Verified Conversion Examples
The following examples are taken directly from the functional test suite and show actual input/output conversions.
Example 1: Date Functions
Source Redshift SQL:
SELECT
o.order_number,
CONCAT(c.first_name, ' ', c.last_name) AS customer_name,
TO_CHAR(o.order_date, 'YYYY-MM-DD') AS formatted_order_date,
DATEDIFF(day, o.order_date, o.estimated_delivery) AS delivery_days
FROM orders o
INNER JOIN customers c ON o.customer_id = c.customer_id
WHERE o.order_date >= DATEADD(month, -1, CURRENT_DATE)
ORDER BY o.order_date DESC;
Converted Databricks SQL:
SELECT
o.order_number,
CONCAT(c.first_name, ' ', c.last_name) AS customer_name,
DATE_FORMAT(o.order_date,'y-MM-dd') AS formatted_order_date,
DATEDIFF(DAY, o.order_date, o.estimated_delivery) AS delivery_days
FROM orders o
INNER JOIN customers c ON o.customer_id = c.customer_id
WHERE o.order_date >= ADD_MONTHS(CURRENT_DATE, -1)
ORDER BY o.order_date DESC;
Key Conversions:
TO_CHAR(date, format)→DATE_FORMAT(date, format)with format translationDATEDIFF(day, ...)→DATEDIFF(DAY, ...)DATEADD(month, n, date)→ADD_MONTHS(date, n)
Example 2: UPDATE with FROM → MERGE
Source Redshift SQL:
UPDATE customers
SET last_login = CURRENT_TIMESTAMP,
updated_at = CURRENT_TIMESTAMP
WHERE UPPER(email) LIKE '%@GMAIL.COM'
AND last_login < DATEADD(month, -3, CURRENT_DATE);
Converted Databricks SQL:
UPDATE customers
SET last_login = CURRENT_TIMESTAMP,
updated_at = CURRENT_TIMESTAMP
WHERE UPPER(email) LIKE '%@GMAIL.COM'
AND last_login < ADD_MONTHS(CURRENT_DATE, -3);
Key Conversions:
DATEADD(month, -3, date)→ADD_MONTHS(date, -3)- Simple UPDATE statements remain as UPDATE (no MERGE needed)
Example 3: CTEs with UPDATE FROM → MERGE
Source Redshift SQL:
WITH recent_orders AS (
SELECT
o.customer_id,
COUNT(*) AS order_count,
SUM(o.total_amount) AS total_spent,
MAX(o.order_date) AS last_order_date,
SUBSTRING(o.order_number, 1, 4) AS order_prefix
FROM orders o
WHERE o.order_date >= DATEADD(month, -6, CURRENT_DATE)
AND o.status IN ('completed', 'shipped')
GROUP BY o.customer_id, SUBSTRING(o.order_number, 1, 4)
HAVING COUNT(*) >= 3
),
loyalty_updates AS (
SELECT
ro.customer_id,
FLOOR(ro.total_spent / 100) * 10 AS bonus_points,
CONCAT('VIP-', LPAD(ro.customer_id::VARCHAR, 6, '0')) AS new_customer_code
FROM recent_orders ro
)
UPDATE customers
SET
loyalty_points = loyalty_points + lu.bonus_points,
customer_code = lu.new_customer_code,
updated_at = CURRENT_TIMESTAMP
FROM loyalty_updates lu
WHERE customers.customer_id = lu.customer_id;
Converted Databricks SQL:
WITH recent_orders AS (
SELECT
o.customer_id,
COUNT(*) AS order_count,
SUM(o.total_amount) AS total_spent,
MAX(o.order_date) AS last_order_date,
SUBSTRING(o.order_number, 1, 4) AS order_prefix
FROM orders o
WHERE o.order_date >= ADD_MONTHS(CURRENT_DATE, -6)
AND o.status IN ('completed', 'shipped')
GROUP BY o.customer_id, SUBSTRING(o.order_number, 1, 4)
HAVING COUNT(*) >= 3
),
loyalty_updates AS (
SELECT
ro.customer_id,
FLOOR(ro.total_spent / 100) * 10 AS bonus_points,
CONCAT('VIP-', LPAD(ro.customer_id::STRING, 6, '0')) AS new_customer_code
FROM recent_orders ro
)
MERGE INTO customers
USING loyalty_updates lu
ON customers.customer_id = lu.customer_id
WHEN MATCHED THEN UPDATE SET
loyalty_points = loyalty_points + lu.bonus_points,
customer_code = lu.new_customer_code,
updated_at = CURRENT_TIMESTAMP;
Key Conversions:
UPDATE ... FROM ...→MERGE INTO ... USING ... WHEN MATCHED::VARCHAR→::STRINGDATEADD(month, -6, date)→ADD_MONTHS(date, -6)- CTEs preserved with updated syntax
Example 4: REGEXP_COUNT Function
Source Redshift SQL:
UPDATE customers
SET loyalty_points = loyalty_points +
CASE
WHEN REGEXP_COUNT(email, '[A-Z]') > 3 THEN 100
WHEN REGEXP_COUNT(email, '[0-9]') > 2 THEN 50
ELSE 25
END,
status = DECODE(
REGEXP_COUNT(phone, '[0-9]'),
10, 'VERIFIED',
11, 'VERIFIED',
'PENDING'
)
WHERE customer_id IN (
SELECT DISTINCT o.customer_id
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
WHERE o.order_date >= CURRENT_DATE - INTERVAL '30 days'
AND o.total_amount > 500
);
Converted Databricks SQL:
UPDATE customers
SET loyalty_points = loyalty_points +
CASE
WHEN LENGTH(email) - LENGTH(REGEXP_REPLACE(email, '[A-Z]', '')) > 3 THEN 100
WHEN LENGTH(email) - LENGTH(REGEXP_REPLACE(email, '[0-9]', '')) > 2 THEN 50
ELSE 25
END,
status = case when LENGTH(phone) - LENGTH(REGEXP_REPLACE(phone, '[0-9]', ''))=10 then 'VERIFIED' when LENGTH(phone) - LENGTH(REGEXP_REPLACE(phone, '[0-9]', ''))=11 then 'VERIFIED' else 'PENDING' END
WHERE customer_id IN (
SELECT DISTINCT o.customer_id
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
WHERE o.order_date >= CURRENT_DATE - INTERVAL '30 days'
AND o.total_amount > 500
);
Key Conversions:
REGEXP_COUNT(str, pattern)→LENGTH(str) - LENGTH(REGEXP_REPLACE(str, pattern, ''))DECODE(expr, v1, r1, v2, r2, default)→CASE WHEN expr=v1 THEN r1 WHEN expr=v2 THEN r2 ELSE default END
Example 5: INSERT with REGEXP_COUNT and Complex Functions
Source Redshift SQL:
INSERT INTO addresses (customer_id, address_type, street_address, city, state_province, postal_code, country, is_primary, created_at)
SELECT
c.customer_id,
'BILLING',
TRANSLATE(c.first_name || ' ' || c.last_name, 'AEIOU', '12345') || ' Street',
CASE
WHEN REGEXP_COUNT(c.email, '@gmail') > 0 THEN 'New York'
WHEN REGEXP_COUNT(c.email, '@yahoo') > 0 THEN 'Los Angeles'
ELSE 'Chicago'
END,
DECODE(
REGEXP_COUNT(c.phone, '^1'),
1, 'NY',
'CA'
),
LPAD(ABS(RANDOM() * 99999)::INT, 5, '0'),
'USA',
1,
CURRENT_TIMESTAMP
FROM customers c
WHERE c.customer_id NOT IN (
SELECT DISTINCT customer_id
FROM addresses
WHERE address_type = 'BILLING'
)
AND c.registration_date >= CURRENT_DATE - INTERVAL '90 days';
Converted Databricks SQL:
INSERT INTO addresses (customer_id, address_type, street_address, city, state_province, postal_code, country, is_primary, created_at)
SELECT
c.customer_id,
'BILLING',
TRANSLATE(c.first_name || ' ' || c.last_name, 'AEIOU', '12345') || ' Street',
CASE
WHEN LENGTH(c.email) - LENGTH(REGEXP_REPLACE(c.email, '@gmail', '')) > 0 THEN 'New York'
WHEN LENGTH(c.email) - LENGTH(REGEXP_REPLACE(c.email, '@yahoo', '')) > 0 THEN 'Los Angeles'
ELSE 'Chicago'
END,
case when LENGTH(c.phone) - LENGTH(REGEXP_REPLACE(c.phone, '^1', ''))=1 then 'NY' else 'CA' END,
LPAD(ABS(RANDOM() * 99999)::INT, 5, '0'),
'USA',
1,
CURRENT_TIMESTAMP
FROM customers c
WHERE c.customer_id NOT IN (
SELECT DISTINCT customer_id
FROM addresses
WHERE address_type = 'BILLING'
)
AND c.registration_date >= CURRENT_DATE - INTERVAL '90 days';
Key Conversions:
- Multiple
REGEXP_COUNT()conversions in CASE statements DECODE()with 3 parameters → inline CASETRANSLATE()function preserved (supported in Databricks)LPAD(),ABS(),RANDOM()preserved
Example 6: DELETE with CTEs
Source Redshift SQL:
WITH expired_products AS (
SELECT
p.product_id,
p.product_name,
REPLACE(LOWER(p.product_name), ' ', '_') AS slug_name,
DATEDIFF(day, p.launch_date, CURRENT_DATE) AS days_since_launch,
DATE_TRUNC('quarter', p.launch_date) AS launch_quarter
FROM products p
WHERE p.launch_date < DATEADD(year, -2, CURRENT_DATE)
AND p.stock_quantity = 0
AND p.is_active = 1
),
category_info AS (
SELECT
c.category_id,
INITCAP(c.category_name) AS formatted_name
FROM categories c
WHERE c.is_active = 1
)
DELETE FROM products
WHERE product_id IN (
SELECT ep.product_id
FROM expired_products ep
JOIN category_info ci ON ep.product_id IN (
SELECT product_id FROM products WHERE category_id = ci.category_id
)
WHERE ep.days_since_launch > 730
AND CHARINDEX('discontinued', LOWER(ep.product_name)) > 0
);
Converted Databricks SQL:
WITH expired_products AS (
SELECT
p.product_id,
p.product_name,
REPLACE(LOWER(p.product_name), ' ', '_') AS slug_name,
DATEDIFF(DAY, p.launch_date, CURRENT_DATE) AS days_since_launch,
DATE_TRUNC('quarter', p.launch_date) AS launch_quarter
FROM products p
WHERE p.launch_date < DATEADD(year, -2, CURRENT_DATE)
AND p.stock_quantity = 0
AND p.is_active = 1
),
category_info AS (
SELECT
c.category_id,
INITCAP(c.category_name) AS formatted_name
FROM categories c
WHERE c.is_active = 1
)
DELETE FROM products
WHERE product_id IN (
SELECT ep.product_id
FROM expired_products ep
JOIN category_info ci ON ep.product_id IN (
SELECT product_id FROM products WHERE category_id = ci.category_id
)
WHERE ep.days_since_launch > 730
AND INSTR(LOWER(ep.product_name), 'discontinued') > 0
);
Key Conversions:
DATEDIFF(day, ...)→DATEDIFF(DAY, ...)CHARINDEX(substr, str)→INSTR(str, substr)(note: argument order swapped)INITCAP()preserved (supported in Databricks)DATE_TRUNC()preserved- CTEs with DELETE preserved
Advanced Features
Common Table Expressions (CTEs)
Redshift:
WITH monthly_sales AS (
SELECT
DATE_TRUNC('month', order_date) AS month,
SUM(total_amount) AS monthly_total
FROM orders
GROUP BY 1
),
customer_orders AS (
SELECT
customer_id,
COUNT(*) AS order_count,
AVG(total_amount) AS avg_order_value
FROM orders
GROUP BY customer_id
)
SELECT
ms.month,
ms.monthly_total,
COUNT(DISTINCT co.customer_id) AS active_customers,
AVG(co.avg_order_value) AS avg_customer_value
FROM monthly_sales ms
CROSS JOIN customer_orders co
GROUP BY ms.month, ms.monthly_total;
Converted Databricks SQL:
WITH monthly_sales AS (
SELECT
DATE_TRUNC('month', order_date) AS month,
SUM(total_amount) AS monthly_total
FROM orders
GROUP BY 1
),
customer_orders AS (
SELECT
customer_id,
COUNT(*) AS order_count,
AVG(total_amount) AS avg_order_value
FROM orders
GROUP BY customer_id
)
SELECT
ms.month,
ms.monthly_total,
COUNT(DISTINCT co.customer_id) AS active_customers,
AVG(co.avg_customer_value) AS avg_customer_value
FROM monthly_sales ms
CROSS JOIN customer_orders co
GROUP BY ms.month, ms.monthly_total;
Window Functions with IGNORE NULLS
Redshift:
SELECT
customer_id,
order_date,
total_amount,
FIRST_VALUE(payment_method) IGNORE NULLS
OVER (PARTITION BY customer_id ORDER BY order_date) AS first_payment_method,
LAST_VALUE(payment_method) IGNORE NULLS
OVER (PARTITION BY customer_id ORDER BY order_date
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS last_payment_method
FROM orders;
Converted Databricks SQL:
SELECT
customer_id,
order_date,
total_amount,
FIRST_VALUE(payment_method) IGNORE NULLS
OVER (PARTITION BY customer_id ORDER BY order_date) AS first_payment_method,
LAST_VALUE(payment_method) IGNORE NULLS
OVER (PARTITION BY customer_id ORDER BY order_date
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS last_payment_method
FROM orders;
LISTAGG with DISTINCT
Redshift:
SELECT
customer_id,
LISTAGG(DISTINCT product_category, ', ')
WITHIN GROUP (ORDER BY product_category) AS categories
FROM order_items
GROUP BY customer_id;
Converted Databricks SQL:
SELECT
customer_id,
ARRAY_JOIN(
ARRAY_DISTINCT(
SPLIT(
LISTAGG(product_category, ', ')
WITHIN GROUP (ORDER BY product_category),
', '
)
),
', '
) AS categories
FROM order_items
GROUP BY customer_id;
Timezone Conversions
Redshift:
SELECT
order_id,
order_timestamp,
order_timestamp AT TIME ZONE 'America/New_York' AS ny_time,
CONVERT_TIMEZONE('America/New_York', order_timestamp) AS ny_converted,
CONVERT_TIMEZONE('UTC', 'America/Los_Angeles', order_timestamp) AS la_time
FROM orders;
Converted Databricks SQL:
SELECT
order_id,
order_timestamp,
FROM_UTC_TIMESTAMP(order_timestamp, 'America/New_York') AS ny_time,
FROM_UTC_TIMESTAMP(order_timestamp, 'America/New_York') AS ny_converted,
FROM_UTC_TIMESTAMP(TO_UTC_TIMESTAMP(order_timestamp, 'UTC'), 'America/Los_Angeles') AS la_time
FROM orders;
Lateral Column References
Redshift allows referencing previously defined columns in the same SELECT clause. Databricks conversion handles this automatically.
Redshift:
SELECT
customer_id,
first_name || ' ' || last_name AS full_name,
LENGTH(full_name) AS name_length, -- Lateral reference to full_name
UPPER(full_name) AS full_name_upper -- Another lateral reference
FROM customers;
Converted Databricks SQL:
SELECT
customer_id,
first_name || ' ' || last_name AS full_name,
LENGTH(first_name || ' ' || last_name) AS name_length,
UPPER(first_name || ' ' || last_name) AS full_name_upper
FROM customers;
Redshift-Specific Features
Distribution Styles
Redshift distribution styles (DISTSTYLE, DISTKEY) are removed during conversion as Databricks handles data distribution automatically.
Redshift:
CREATE TABLE orders (
order_id INT,
customer_id INT,
order_date DATE,
total_amount DECIMAL(10,2)
)
DISTSTYLE KEY
DISTKEY (customer_id);
Converted Databricks SQL:
CREATE OR REPLACE TABLE orders (
order_id INT,
customer_id INT,
order_date DATE,
total_amount NUMERIC(10,2)
);
Sort Keys
Redshift SORTKEY is converted to Databricks ZORDER BY for optimized query performance.
Redshift:
CREATE TABLE events (
event_id BIGINT,
event_date DATE,
user_id INT,
event_type VARCHAR(50)
)
COMPOUND SORTKEY (event_date, user_id);
Converted Databricks SQL:
CREATE OR REPLACE TABLE events (
event_id BIGINT,
event_date DATE,
user_id INT,
event_type STRING
)
ZORDER BY (event_date, user_id);
Compression Encoding
Redshift ENCODE clauses for column compression are removed as Databricks manages compression automatically.
Redshift:
CREATE TABLE products (
product_id INT ENCODE az64,
product_name VARCHAR(200) ENCODE lzo,
description TEXT ENCODE zstd,
price DECIMAL(10,2) ENCODE raw
);
Converted Databricks SQL:
CREATE OR REPLACE TABLE products (
product_id INT,
product_name STRING,
description STRING,
price NUMERIC(10,2)
);
Vacuum and Analyze
Redshift VACUUM and ANALYZE commands are removed as Databricks handles optimization automatically.
Redshift:
VACUUM orders;
ANALYZE customers;
VACUUM FULL products;
Converted Databricks SQL:
-- VACUUM and ANALYZE commands are not needed in Databricks
-- Databricks performs automatic optimization
Known Limitations
Features Not Supported
-
SUPER Data Type
- Redshift's semi-structured SUPER type
- Workaround: Use STRING type and parse with JSON functions, or use VARIANT type
-
HLLSKETCH Data Type
- HyperLogLog sketches for approximate counting
- Workaround: Use
APPROX_COUNT_DISTINCT()or custom UDFs
-
GEOMETRY Data Type
- Spatial/geographic data
- Workaround: Store as STRING and use Databricks spatial functions
-
Referential Integrity Constraints
- PRIMARY KEY, FOREIGN KEY constraints are commented out
- Workaround: Implement validation logic in application or ETL code
-
CHECK Constraints
- Column-level CHECK constraints are commented out
- Workaround: Implement validation in application code or use Delta table constraints where applicable
-
UNIQUE Constraints
- UNIQUE constraints are commented out
- Workaround: Use MERGE operations or application-level validation
-
Cursor Operations
- DECLARE CURSOR, OPEN, FETCH, CLOSE
- Workaround: Rewrite using set-based operations or Databricks SQL stored procedures
-
UNLOAD Command
- Redshift UNLOAD to S3
- Workaround: Use Databricks
COPY INTOor DataFrame write operations
-
COPY Command
- Redshift COPY from S3/files
- Workaround: Use Databricks
COPY INTOor DataFrame read operations
-
Redshift Spectrum External Tables
- Querying data directly in S3
- Workaround: Use Databricks External Tables or Delta Lake
Next Steps
- Export Redshift SQL scripts to .sql files
- Run conversion using Lakebridge CLI:
databricks labs lakebridge transpile \
--source-dialect redshift \
--input-source /path/to/redshift/scripts \
--output-folder /output/databricks-sql \
--target-technology databricks-sql - Review generated SQL for FIXME comments
- Address unsupported features (constraints, SUPER types, etc.)
- Test converted SQL in Databricks workspace
- Optimize with Delta Lake features (OPTIMIZE, Z-ORDER)
- Deploy to production
For more information, see: