Software-Defined Storage

The Software-Defined Storage blueprint outlines an Open Source led solution to bridge a persistent gap in enterprise data architecture: the lack of consistent governance between established on-premises data estates and modern cloud-native data platforms.

The proposed solution centers on embedding the OSS Delta Sharing protocol natively within on-premises storage vendors' products, enabling governed, real-time data sharing without requiring data migration or replication. This approach is positioned as a foundational step for storage partners to progressively integrate with the Databricks Data Intelligence Platform.

Why Hybrid Governance?

Modern enterprises operate in an inherently hybrid world. Despite the widespread adoption of cloud data platforms, a significant portion of enterprise data — particularly at petabyte scale — remains anchored on-premises.

Three converging forces explain why this data remains local and why the problem is growing rather than shrinking:

Regulation: Data Sovereignty Laws Force Data to Stay Local

Data sovereignty legislation exists in over 120 countries, including frameworks such as GDPR in Europe, LGPD in Brazil, and PDPA across Southeast Asia. These regulations impose strict requirements on where data can reside, how it can be transferred across borders, and who can access it. For multinational enterprises, this means that large segments of their data estate are legally prohibited from being moved to public cloud infrastructure. This is not a transitional constraint — it is a permanent architectural reality that governance tooling must accommodate.

Economics: On-Premises Storage Is Often More Cost-Effective at Scale

For predictable, high-volume workloads, petabyte-scale on-premises object storage delivers a significantly lower total cost of ownership compared to equivalent cloud storage. As cloud egress costs, storage pricing, and data gravity become increasingly relevant concerns, many enterprises are actively repatriating data from cloud back to on-premises infrastructure. This trend is accelerating, not reversing, which means cloud-first governance assumptions are becoming increasingly misaligned with enterprise reality.

Governance Gap: No Parity Between Cloud and On-Premises Governance

Cloud data platforms like Databricks Unity Catalog provide comprehensive governance capabilities: fine-grained access control, automated policy enforcement, data lineage tracking, and audit logging — all managed through a unified catalog.

On-premises data today lacks any equivalent. Data sitting in on-premises object stores is effectively invisible to these governance systems. There is no standardized mechanism for applying the same access control policies, capturing lineage, or enforcing compliance rules across the cloud boundary. This gap represents both a compliance risk and a significant barrier to data sharing and monetization.

Delta Sharing is an open protocol developed by Databricks and donated to the Linux Foundation. It is designed to address the challenge of securely sharing live data across organizational and infrastructure boundaries — without copying or moving the underlying data. Unlike proprietary data exchange mechanisms, Delta Sharing is built on open standards, making it interoperable across a wide range of compute engines, query tools, and storage backends.

Key characteristics of the protocol include:

No vendor lock-in: Delta Sharing is an open specification. Any storage vendor, compute engine, or analytics tool can implement it without licensing constraints or dependency on a single vendor's ecosystem.
Live data, no replication: Data is shared in place. Consumers access the latest version of a dataset directly from object storage, eliminating the latency and cost of ETL pipelines, data copies, or staging environments.
Built-in security controls: Authentication and authorization are handled at the protocol level. The sharing server enforces access policies before generating short-lived, pre-signed URLs, ensuring that consumers can only access data they are explicitly permitted to read.
Cross-environment compatibility: Delta Sharing works across on-premises infrastructure, private cloud, and all major public clouds (AWS, Azure, GCP), making it well-suited for hybrid and multi-cloud architectures.
Broad consumer ecosystem: The protocol is supported by a wide range of client tools and frameworks, including Tableau, Apache Spark, Pandas, Power BI, Microsoft Excel, Java, and Databricks itself, meaning data consumers do not need to change their existing tooling to benefit from shared data.

Architecture

The Delta Sharing protocol defines a stateless client-server interaction model built on top of standard HTTPS and object storage pre-signed URL mechanisms. The architecture is intentionally simple, which contributes to its broad adoptability.

The data sharing flow proceeds as follows:

Authentication: The Delta Sharing client presents a bearer token to the Delta Sharing Server. This token is provisioned out-of-band and stored in a .share profile file distributed to the recipient. It identifies the consumer and scopes their access to specific shares.
Table Request: The authenticated client issues a request for a specific table, identified by its share, schema, and table name. The request can optionally include partition filters or version predicates to limit the scope of the response.
Access Control Enforcement: The server evaluates the request against its access control policies. In Databricks' implementation, this is backed by Unity Catalog, which provides fine-grained, attribute-based access control. In OSS implementations, access control can be managed through configuration files or integrated with external identity providers.
Pre-Signed URL Generation: If the request is authorized, the server generates a set of pre-signed, short-lived URLs pointing directly to the underlying Parquet files in object storage. These URLs are time-bounded, typically expiring within minutes, limiting the window of exposure if intercepted.
Direct Object Storage Access: The client uses the pre-signed URLs to read Parquet files directly from object storage — bypassing the server entirely for the data transfer itself. This design means the server handles only metadata and access control, while data transfer happens directly between the client and storage, maximizing throughput and minimizing server-side load.

This architecture is deliberately storage-agnostic. The server abstracts the underlying storage backend, meaning it can generate pre-signed URLs for S3, Azure Blob Storage, GCS, or any S3-compatible store such as MinIO.

A key technical challenge is that the Delta Sharing protocol natively speaks Delta Lake format — its server logic is built around reading Delta transaction logs (_delta_log) to resolve table snapshots and identify the relevant Parquet files. Iceberg tables use a different metadata format (manifest files and snapshot trees), which the standard Delta Sharing Server does not understand natively.

Two approaches are available to bridge this gap in on-premises deployments:

Path 1: Convert to Delta Lake

The most straightforward approach is to convert the source table to Delta Lake format before exposing it via Delta Sharing. This can be done in two ways:

Direct conversion: For Hive-managed tables, Spark can write the table into Delta format, generating the _delta_log alongside the existing Parquet data files.
Delta Shallow Clone: For Iceberg tables, a shallow clone creates a Delta Lake metadata layer that references the existing Parquet files in place, without physically copying data. This minimizes storage overhead while making the table readable by a Delta Sharing Server.

Path 2: Apache XTable Metadata Translation

For environments where the source table must remain in Iceberg format — for example, because other consumers depend on the Iceberg catalog — Apache XTable provides a non-destructive metadata translation layer. XTable reads the Iceberg metadata and generates an equivalent _delta_log folder alongside it, making the table simultaneously readable as both Iceberg and Delta Lake. The Delta Sharing Server can then serve the table using the generated Delta metadata, while Iceberg-native consumers continue reading it through the Iceberg catalog without any disruption. No data files are duplicated — only metadata is translated.

This second path is particularly valuable in heterogeneous environments where multiple table formats coexist and where a single source of truth for the data files is required.

What It Unlocks

By embedding Delta Sharing natively in on-premises storage, this blueprint unlocks several high-value data use cases that are currently blocked by the governance gap:

Tabular Lakehouse Data Sharing: Organizations can share live Delta, Iceberg, and Hudi tables across organizational boundaries for enrichment and analytical use cases — without ETL pipelines, data transfers, or format conversions visible to the end consumer.
Marketplace Data Collaboration: Data producers can expose curated datasets to external partners or customers through a governed, auditable sharing mechanism. This is the foundation for data marketplace business models where data is a monetizable asset.
Data Clean Rooms: Two or more parties can share sensitive datasets for joint analysis without either party gaining direct access to the other's raw data. Delta Sharing's access control model, combined with compute environments that enforce query restrictions, enables privacy-preserving collaboration on regulated or confidential data.
Unstructured Data (under consideration): While the current protocol focuses on tabular data in Parquet format, extending Delta Sharing to cover unstructured data — such as documents, images, or media files stored in object storage — is an area of active exploration. The pre-signed URL mechanism is format-agnostic, making this a natural extension of the protocol.

This Software-Defined Storage blueprint presents a simple and pragmatic way for extending enterprise-grade data governance beyond the cloud boundary. By leveraging the open Delta Sharing protocol as the integration layer, on-premises storage vendors can become active participants in the modern data lakehouse ecosystem — enabling governed, real-time data sharing without requiring data migration.

For storage partners, this represents both a technical integration path and a strategic positioning opportunity: embedding Delta Sharing natively transforms on-premises storage from a data silo into a governed, interoperable node in the broader Databricks Data Intelligence Platform.

Unlocking On-Premise and Hybrid Data Estate with Delta Sharing​

Why Hybrid Governance?​

Regulation: Data Sovereignty Laws Force Data to Stay Local​

Economics: On-Premises Storage Is Often More Cost-Effective at Scale​

Governance Gap: No Parity Between Cloud and On-Premises Governance​

Delta Sharing​

Architecture​

Serving Iceberg Tables via Delta Sharing (On-Premises)​

Path 1: Convert to Delta Lake​

Path 2: Apache XTable Metadata Translation​

What It Unlocks​

Unlocking On-Premise and Hybrid Data Estate with Delta Sharing