Use cases
Databricks Clean Rooms unlock a wide range of collaborative analytics and AI use cases across industries. The common thread: multiple parties need to derive joint insights from sensitive data without exposing it to each other.
Use case overview
| Industry | Use case |
|---|---|
| Advertising & Retail | Audience segmentation & targeting, measurement & attribution, lookalike modeling |
| Financial Services | Fraud detection & prevention, regulatory risk & compliance, targeted product development |
| Healthcare & Life Sciences | Population health research, drug discovery, genomic target identification, ML on EHR data |
| Cross-Industry | Identity resolution, cross-organizational ML model training |
Advertising & retail media
As the partner (measurement firm, identity provider, or data provider): You own the clean room, publish the notebooks, and protect your matching logic or attribution methodology as IP. You are Party 1.
As the customer (brand, retailer, or broadcaster): You publish your first-party data (purchase records, CRM, viewership) into the room. You review and run the partner's notebooks. You receive the output tables. You are Party 2.
Reference architecture: measurement partner integration
What the partner owns: The clean room, the attribution notebook, the scoring library. The customer never sees the notebook source or library contents — only the input parameters and output schema.
What the customer owns: Their conversion and sales data, and the hashed key table used for joining. They review the notebook's declared behavior (inputs, outputs) and approve it before running.
What leaves the room: Campaign performance scores, ROAS by segment, audience reach metrics — aggregated results only, no row-level data from either party.
Audience segmentation and targeting
Retailers and media companies collaborate with brands and advertisers to build audience segments from combined first-party data — without either party exposing their customer lists.
Pattern:
- Retailer shares purchase transaction data (as tables) into the clean room
- Brand shares CRM data (as tables) into the clean room
- Approved notebook performs audience matching and segment creation
- Output table contains anonymized segment IDs, not raw PII
Try-before-you-buy evaluation
Data providers can stand up a Clean Room where prospects explore a sample or production subset of their data under strict privacy rules — validating schema, join logic, and business value before any data is exported or a contract is signed. When the prospect is ready, they graduate to a Delta Sharing subscription for ongoing delivery.
This pattern is especially useful for premium data providers, identity resolution vendors demonstrating match rates, and analytics firms showing attribution methodology before a deal closes.
Measurement and attribution
Ad platforms and publishers can run joint attribution models using impression data and conversion data without sharing the underlying datasets.
A streaming TV provider can give broadcasters access to viewership and ad impression data through an approved notebook — broadcasters get attribution insights, but never access the raw audience data directly. This enables advertisers to optimize campaigns with full data fidelity while the provider retains full privacy control.
Identity providers and measurement firms can own the notebooks and offer this as a recurring subscription service across multiple brands and retailers.
Lookalike modeling
Parties can collaboratively train lookalike models on combined customer data, with the model output (not the training data) as the clean room result.
Financial services
As the partner (data/analytics provider, fintech, or consulting firm): You typically publish the notebooks, the scoring methodology, and optionally a reference dataset (e.g., fraud signals, credit features). You are Party 1.
As the customer (bank, insurer, or fintech): You publish regulated transaction data, account records, or customer features. You run the partner's pre-approved notebooks and receive the output flags or scores. You are Party 2.
In consortium models (e.g., multi-bank fraud rings), each institution is a co-equal party. The partner may act as the neutral operator who provisions and manages the room.
Fraud detection and prevention
Financial institutions can combine transaction signals across organizations to train fraud detection models on shared behavioral patterns without exposing customer account data. Clean Rooms enable multi-cloud collaboration using approved notebooks, allowing organizations to standardize the landing zone for external data while meeting unique privacy requirements.
Consortiums of multiple financial institutions can run joint fraud detection across all parties, with each institution keeping regulated data in its own metastore. The Clean Room performs hashed joins and modeling according to strict policies — only flags and scores leave, not raw records.
Secure partner matching
A common pattern for co-branded card programs and fintech partnerships:
- Each party hashes or HMACs shared identifiers (e.g., hashed email, device ID) using the same algorithm
- Each party shares only hashed keys and derived features — not raw PII — into the Clean Room
- The Clean Room runs approved notebooks that join on hashed keys, apply eligibility or risk logic, and emit only offer flags or scores back to each side
- Each party receives only their portion of the output — no raw data from the other party crosses the boundary
Data and analytics providers can plug their datasets into this pattern as a neutral third party, and consulting partners can package it as a reusable collaboration workflow.
Regulatory risk and compliance
Regulated institutions can run joint compliance checks, risk models, or stress tests against combined datasets while maintaining strict data residency and access controls required by regulators.
Secure lending partner collaboration
Financial technology companies can share customer financial signals with lending partners to improve loan decisioning without exposing raw PII. Clean Rooms allow fintech providers to enforce privacy controls while seamlessly integrating with partners regardless of their platform or cloud provider.
Healthcare and life sciences
As the partner (ML firm, ISV, or CRO): You develop the training notebooks and package your algorithms as private Python wheels. You publish the notebook and library into the clean room but never access the underlying health records. You are the ML Expert.
As the customer (health system, payer, or pharma company): You own and govern the sensitive data (EHR, claims, genomics). You create the clean room, review the partner's notebook, and run it against your data. The model or output stays in your metastore. You are the Data Owner.
ML on electronic health records
Healthcare organizations can train machine learning models on sensitive EHR data without the data science team ever directly accessing the underlying records.
Pattern (two-actor model):
| Actor | Role |
|---|---|
| Data Owner | Governs EHR data, publishes tables to the clean room, runs the notebook |
| ML Expert | Develops the training code as a private Python library, publishes the library as a volume, publishes the notebook |
Step-by-step:
- Data Owner creates the clean room, invites the ML Expert using their sharing identifier
- Data Owner publishes raw EHR tables — ML Expert can see column metadata, not data
- ML Expert packages training code as a Python wheel, publishes it as a volume
- ML Expert publishes a notebook that uses the private library and outputs a trained model
- Data Owner reviews and runs the notebook — the model is the output, not the raw data
- ML Expert can update the library at any time; each update requires a new round of review
This pattern supports readmission prediction, patient outcome classification, and other sensitive clinical models.
Drug discovery and genomics
Pharmaceutical companies and CROs (contract research organizations) can collaborate on clinical trial data, genomic datasets, and observational research studies across organizational boundaries — enabling multi-party clinical analysis while maintaining strict data partitioning.
Population health research
Public health agencies and healthcare systems can run joint population health analyses across combined patient datasets without creating a centralized repository of protected health information.
Identity resolution
As the partner (identity provider): You own the identity graph, the matching notebooks, and the enrichment logic. You publish your graph as tables and your matching algorithm as a notebook (optionally hidden). You are Party 1.
As the customer (brand, retailer, or publisher): You publish a key table (hashed emails, MAIDs, device IDs) and optionally a seed audience. You run the partner's notebook. You receive enriched segment IDs or match rates as output — never the partner's raw graph. You are Party 2.
Identity resolution is one of the most common Clean Rooms use cases across industries. When joining disparate data assets, organizations need to match entities across datasets (e.g., matching an advertiser's customer list with a publisher's user graph) without sharing raw PII.
The challenge without Clean Rooms: Traditional approaches require sharing PII directly with an identity provider — creating privacy risk, compliance exposure, and dependency on third-party data movement.
With Clean Rooms:
Key benefits:
- No raw PII exposure to the identity provider
- No data movement to a third-party system
- Scalable — works across large datasets using Databricks serverless compute
- Auditable — all matching logic is captured in the approval workflow
Common identity resolution patterns
Graph extension / enrichment: A customer uploads a key table (emails, device IDs, MAIDs). The identity provider's notebook enriches with graph attributes — demographics, affinities, segment memberships — and returns only enhanced IDs or segment flags, not graph internals.
Audience overlap and lookalike: The customer passes a seed audience table. The provider computes overlap with graph segments and generates lookalike segment IDs. Only segment IDs or reach/frequency aggregates are returned — no raw cross-party data.
Multi-brand / retail media hub: The identity provider acts as a neutral hub connecting brand and retailer datasets, applying shared identifiers and audiences across multiple parties in a single room.
Identity providers can package these patterns using a Codeless Clean Room approach so customers see a simple guided interface — not schemas or notebook code. Leading identity resolution providers offer these capabilities natively through Databricks Clean Rooms.
Productization patterns
Most partners start with a one-off clean room for a single customer. Turning that into a scalable, repeatable product requires deliberate design from the beginning.
Templating notebooks across customers
Your core business logic — a scoring model, a matching algorithm, an attribution framework — is the same across every customer. What changes is which tables the customer brings, what they name their columns, and what outputs they need. Design your notebooks from day one to accept parameterized inputs:
- Use notebook parameters for table names, schema names, and column mappings rather than hardcoding them
- Document the expected input schema as a contract — column names, data types, join key format — so customers know exactly what to prepare
- Keep a versioned notebook library internally:
v1.2.0of your attribution notebook is what you deploy to all active rooms, and an upgrade is a deliberate release event, not a silent file change
This approach lets you manage dozens of active rooms without maintaining dozens of divergent notebook variants.
Managing a fleet of clean rooms
When you have many customers, each in their own clean room, managing them manually through the UI becomes unworkable. Build operational tooling from the start:
- Use the Databricks REST API or UI to create rooms, attach assets, and manage approvals — not manual one-off setup per customer
- Maintain a registry (a simple database table or config file) mapping customer ID → clean room name → current notebook etag → last run timestamp
- Build a deployment script that, when you release a new notebook version, iterates over your registry and publishes the update to all active rooms. Notify customer operators that re-approval is needed before the new version can run
- Set up monitoring on
clean_room_eventsfor each room — failed runs, unexpected auto-approvals, or long periods without any run activity are all signals worth alerting on
Handling notebook version upgrades at scale
Every notebook update in Clean Rooms requires a new approval from the other party before it can run. In a one-customer context this is manageable. Across fifty customers it requires coordination.
A workable process:
- Release candidate notebook is tested in your internal dry-run room
- A release communication goes to customer operators explaining what changed and what they need to re-approve
- You push the new notebook version to all rooms via the REST API or by updating each room through the UI
- Customers approve on their timeline — old versions remain runnable until they upgrade
- After a cutover window, you deprecate support for older versions and confirm all rooms are on the current etag
Treat this identically to how you manage API version deprecations — with advance notice, a transition window, and a hard cutover date.
Multi-tenant billing and cost allocation
Serverless compute used by Clean Rooms is billed to whoever triggers the notebook run. In the standard model, the customer runs the notebook — compute cost lands in their Databricks account. If you operate rooms where your service principal triggers runs on behalf of customers, the cost lands in your account.
Think through your pricing model before your first customer conversation:
- Pass-through: Customers run their own notebooks; they absorb compute cost directly. Your pricing covers the value of your data and IP, not compute
- Bundled: You run notebooks on behalf of customers; you absorb compute cost and build it into your subscription pricing. Simpler for customers, but requires you to model and cap your compute exposure
- Hybrid: Customers run standard workloads; you trigger premium or high-volume runs as a managed service add-on
Estimate expected run frequency and data volumes upfront. Serverless compute for Clean Rooms is billed per DBU; a room running hourly attribution jobs on large tables has meaningfully different economics than a room running a weekly match job on a small key table.
What's next
- Understand the architecture before building your first clean room
- Review the security and IP protection model to prepare for discussions with collaborators
- See Create clean rooms for hands-on setup
- Read the privacy-centric ML blog post for a detailed walkthrough of the EHR use case