Skip to main content

Architecture

This page covers what partners need to architect when building a product on top of Clean Rooms — not how Clean Rooms work internally. For a product overview (central clean room model, Delta Sharing data flow, serverless isolation), see What is Databricks Clean Rooms?

Partner integration patterns

Programmatic vs. customer-setup

The first decision when building a Clean Rooms-based product is whether your application creates and manages clean rooms on behalf of your customers, or whether customers set them up themselves with your guidance.

Programmatic (partner-managed): Your application uses the Databricks REST API to create rooms, attach assets, configure approvals, and trigger runs — all from your backend. The customer experience is entirely within your product; they may not interact with the Databricks UI at all. This is the right pattern for productized, repeatable offerings like identity resolution services, measurement subscriptions, or analytics-as-a-service.

Customer-setup: Customers create and configure their own clean rooms with your documentation and notebooks as guidance. Your role is to provide well-structured assets (tables, views, notebooks) and a clear setup guide. This is appropriate for partnerships where the customer's technical team leads the implementation, or for one-off engagements.

Most scaled partner products end up combining both: the partner provisions rooms and publishes assets programmatically, while the customer connects their own data and triggers runs.

Recommended onboarding flow

Before you touch customer data, run a self-collaboration dry run in your own workspace — create a clean room where your organization is both creator and collaborator. Use representative synthetic data and your production notebooks. Validate permissions, execution behavior, output visibility, and cost before onboarding your first customer.

For customer onboarding, the recommended sequence is:

  1. Verify prerequisites — confirm the customer has Unity Catalog enabled, serverless compute enabled, and Delta Sharing enabled on their metastore. The CREATE CLEAN ROOM privilege must be granted to the service principal or user creating the room.
  2. Exchange sharing identifiers — each party's sharing identifier (metastore ID + workspace ID + user email) is what links workspaces into a room. Collect this from the customer and share yours before room creation.
  3. Create the room and attach your assets — create the clean room via the Databricks UI or REST API, then publish your tables, views, volumes, and notebooks. The room starts in a PROVISIONING state and moves to ACTIVE once both parties are connected.
  4. Customer attaches their assets — the customer adds their tables or volumes to the room on their side.
  5. Run approval and first execution — walk the customer through the notebook approval UI, or configure auto-approval rules for trusted workflows. Trigger the first run and validate outputs.
  6. Establish monitoring — configure alerting on the clean_room_events system table and review run history in the Clean Rooms UI.

See Create clean rooms for the full setup reference.

Service principal ownership

Always use a group-owned service principal as the clean room owner — not an individual user. This is critical: if an individual owns the clean room and leaves the organization, the room becomes inaccessible. The service principal must maintain USE CATALOG, USE SCHEMA, and SELECT (or equivalent) on all assets it publishes to the room for the lifetime of the room. Revocation of these privileges breaks the room.


Data modeling for clean rooms

Separate PII from derived features

The single most important data modeling decision is keeping raw PII in separate tables from derived features and signals. Publish only the derived features into the clean room, never the source PII tables. This limits blast radius if a notebook is approved that has unexpected behavior, and it makes it easier to reason about what could be reconstructed from outputs.

A practical pattern: maintain a keys table (hashed identifiers only) and a features table (aggregated signals, no direct identifiers), and share only the features table plus a hashed key join column. Customers bring their own hashed key table. The clean room joins on the hashed key and never needs access to either party's raw PII.

Design for joinability

If your product requires matching records across parties, the join key must be agreed on in advance and normalized consistently on both sides. Mismatched hashing algorithms, case normalization differences, or salt variations will result in zero or low match rates that are impossible to diagnose from inside the clean room. Document your key normalization contract explicitly — which hash function, which fields, whether to lowercase, trim, or strip special characters — and validate it during onboarding before any production data is committed.

Column aliasing

When you publish a table into a clean room, collaborators see the column alias, not your internal column names. Use aliases to expose a clean, stable, customer-facing schema that is independent of your internal naming conventions. Treat aliases like an API contract: once a collaborator has built notebooks against them, changing an alias is a breaking change that requires re-approval and re-testing.

Asset naming conventions

Clean room names are immutable and must be unique across all collaborator metastores. Plan your naming scheme before you create your first room — include enough context in the name to identify the partner, use case, and environment (e.g., partner-acme-measurement-prod, partner-acme-measurement-dev). You cannot rename a clean room, and the name will appear in the other party's metastore.


Cross-cloud architecture decisions

Central clean room region

When you create a clean room, you specify the cloud and region where the central clean room (CCR) runs. This is where compute executes and where output tables are temporarily materialized. Choose the region carefully:

  • Minimize egress: Place the CCR in the same region as the data-heavy party. If your tables are 10x larger than the customer's, place the CCR close to your data. Cross-region reads incur egress fees on the data-heavy side.
  • Respect data residency: Some customers will have hard requirements that data not leave a specific region or country. The CCR region must satisfy both parties' requirements. For EU customers, this typically means an EU CCR even if one party is US-based.
  • Availability: Not all Databricks regions support Clean Rooms. Confirm supported regions with your Databricks account team before committing to a customer deployment, especially for GCP or less common Azure regions.

Cross-cloud cost implications

Cross-cloud collaborations (e.g., provider on AWS, customer on Azure) are supported but incur cross-cloud egress fees in addition to serverless compute costs. Serverless compute is billed to whoever runs the notebook — typically the customer in a standard clean room. Factor this into your pricing model if you are designing a managed service where you run notebooks on the customer's behalf.


Output table design

What belongs in outputs

Output tables are what collaborators take back to their own metastores. Design them with the same discipline you would apply to an API response: return only what is necessary, and make outputs auditable and explainable.

Good candidates for output tables: segment IDs, match confidence scores, offer flags, aggregated metrics, model predictions per hashed ID. These are actionable, can't be used to reconstruct raw input data, and are easy to audit.

Bad candidates: row-level joins of both parties' data, raw features that reveal the other party's schema, computed columns that could enable reverse inference of source records.

Preventing signal leakage through outputs

Even outputs that don't contain raw data can leak signal if not designed carefully. Examples: a high-precision match confidence score might reveal information about your identity graph; a per-record segment flag on a small input table could reveal segment membership by process of elimination. Consider:

  • Applying minimum group size thresholds before writing to output tables (suppress rows where the underlying group has fewer than N records)
  • Returning bucketed scores rather than continuous precision values
  • Adding noise or rounding to aggregates where the underlying population is small

Output table lifecycle

Output tables are temporary by design — they have a managed TTL and are not permanently retained in the clean room. Build your downstream workflows to consume and persist output tables promptly after each run. If runs are scheduled, ensure your consumption pipeline can handle the case where an output table has expired before it was read.


What's next

  • Review security including the customer security review checklist and approval workflow guidance
  • Explore use cases for vertical-specific data modeling examples
  • See Create clean rooms for step-by-step setup instructions