Skip to main content

Data engineering

Data Engineering partners build ingestion, ETL/ELT, transformation, and orchestration tools. See the data ingestion and transformation patterns for foundational context.

Data ingestion products

Requirements

  • Use Unity Catalog volumes as the default governed landing/staging zone for file drops. Volumes provide unified pathing and access control across clouds.
  • See common patterns for file-based ingestion and streaming/CDC.
  • Default targets must be Unity Catalog managed tables. Managed tables are governed and optimized, delivering predictive optimization, automatic maintenance, lower cost, faster performance, and interoperability.
    • You may support external tables as an option when required, but managed should be your default.

Documentation: Unity Catalog Volumes | Working with Volume Files | Managed Tables

Data transformation products

Requirements

  • Perform all transformations within Databricks to ensure optimal performance and governance. Structure layers following the medallion architecture (bronze, silver, gold).
  • For incremental transformations, use Lakeflow SDP as the default. For procedural logic beyond SDP, use Structured Streaming with Lakeflow Jobs.
  • For batch transformations, let Lakeflow Jobs handle orchestration, or orchestrate from your product if required.
  • Use SQL, UDFs, AI functions, notebooks, or Databricks Connect based on your transformation needs.

See data transformation patterns for detailed implementation guidance.

Documentation: Medallion Architecture | Lakeflow SDP | Lakeflow Jobs

Reverse ETL products

Requirements

  • Use SQL Warehouses to query data from Databricks. Recommend Serverless SQL Warehouses in your docs for better performance.
  • Deliver curated datasets via Structured Streaming or SDP Sinks, or schedule Lakeflow Jobs for API pushes. Alternatively, expose governed datasets via managed tables and Unity Catalog's open APIs.

Documentation: Structured Streaming | SDP Sinks | Lakeflow Jobs | Catalogs API

Orchestration products

Requirements

  • Use the Databricks REST API (and SDKs/CLI built on it) to programmatically orchestrate Databricks resources and runs.

Documentation: REST API

What's next