Skip to main content

Data transformation

Structure all transformation layers following the medallion architecture (bronze, silver, gold). Perform all transformations within Databricks to ensure optimal performance and data security, using the following patterns.

Incremental transformations (streaming)

Use incremental transformations when each run processes only new or changed data since the last run, maintaining state via checkpoints (e.g., silver-layer cleaning, incremental aggregates).

  1. Use Lakeflow Spark Declarative Pipelines (SDP) for incremental ETL transformations (SQL or Python). SDP is declarative, scalable, and automates orchestration.
    • Use Pipelines on Databricks SQL for the simplest SQL-only approach (streaming tables/materialized views).
    • When you need Python-specific transformation capabilities (for example, AUTO CDC FROM SNAPSHOT, programmatic flow creation), use Python to author and use the Pipelines API to manage lifecycle, also available via SDKs.
  2. For procedural incremental logic beyond SDP's declarative operators, use Structured Streaming APIs with Lakeflow Jobs (continuous or periodic trigger).

Documentation: Pipelines on Databricks SQL | Python Development | Pipelines API | Structured Streaming | Continuous Jobs

Full refresh transformations (batch)

Use batch transformations when each run independently reprocesses a complete, bounded dataset (for example, recomputing gold-layer aggregates for a specific date) and produces the same result regardless of prior runs.

  • Recommended: From your product, submit these batch workloads to Databricks and let Lakeflow Jobs handle orchestration, scheduling, dependencies, monitoring, and alerting for the underlying SQL, notebooks, or Python scripts.
  • If you must orchestrate in your product (optional pattern): When your product is required to remain the primary orchestrator, you can keep workflow control in your system and invoke Databricks to execute parameterized SQL or other workloads, while still relying on Databricks to report execution status and results back to your system.

Documentation: Lakeflow Jobs

General transformation patterns

These patterns apply to both incremental and batch transformations:

  1. Run SQL transformations using Databricks SQL. See the SQL language reference.
  2. Use UDFs for reusable custom logic (scalar, batch, table functions) when built-ins aren't sufficient. Prefer built-ins for performance.
  3. Use task-specific AI functions for unstructured data extraction (e.g., ai_parse_document). See AI functions.
  4. Use notebooks (PySpark/Scala) for advanced logic too complex for declarative pipelines.
  5. Use Databricks Connect for interactive, low-latency transformations driven by UI flows.

Documentation: Medallion Architecture | UDFs | ai_parse_document | Databricks Connect

Please refer to the orchestration section.

Orchestration

Partner integrations can orchestrate Databricks workloads programmatically through APIs or by using Lakeflow Jobs for multi-step workflows.

  1. Use the Databricks REST API (and SDKs/CLI built on it) to programmatically orchestrate resources and runs.

    • Jobs API for batch and scheduled workloads
    • SQL Statement Execution API for executing SQL against warehouses
    • Command Execution API for interactive commands (classic clusters only, not recommended for orchestration)
  2. Use Lakeflow Jobs for multi-step workflows (notebooks, scripts, pipelines) with triggers, dependencies, and alerts. Jobs can be managed programmatically and integrated with external orchestrators like Airflow.

Documentation: REST API | Jobs API | Lakeflow Jobs

What's next