Skip to main content

Data as a Product

Treat your data as a product, not just a dataset. A strong data product is curated, documented, and maintained with clear ownership and purpose. Start by identifying who will use the data and how. Define your target personas, their business objectives, and the primary use cases the data is meant to support.

What is a data product?

Unlike raw data extracts or ad-hoc exports, a data product is packaged for consumption—with defined SLAs, versioning, quality standards, and documentation that makes it immediately actionable.

Unity Catalog layout

Use a clean object layout pattern in Unity Catalog:

  • Catalog: The product or brand boundary
  • Schemas: Organized by domain
  • Tables: The raw facts and dimensions customers will query

If you share an entire schema, recipients can automatically access all assets currently in it and anything you add later (tables, views, notebooks, models, volumes, etc.). Treat schema boundaries as part of your product contract.

Conventions that reduce consumer friction

  • Use consistent timestamp types and timezones (and document them)
  • Avoid string-typed columns for dates and numerics
  • Keep column naming predictable (snake_case is common)
  • Prefer stable IDs over names for joins (names change; IDs shouldn't)
  • Keep enumerations and codes in reference tables

Metadata and AI readiness

Unity Catalog allows you to include detailed metadata such as table and column descriptions, tags, business terms, and data relationships. Rich metadata makes a dataset usable, searchable, and valuable across multiple tools and teams.

Good metadata also makes your shared data easier to use with Databricks Genie and other AI-powered tools. When Genie can read complete and accurate metadata, it can help users query and explore your data through natural language, reducing the effort required to get insights and making your data accessible to a wider audience.

Think of metadata as the foundation for AI readiness. When your data is self-describing and semantically rich, it becomes far easier for both humans and AI systems to interpret and apply it effectively. For a comprehensive checklist, see AI readiness.

Change Data Feed

Enable Change Data Feed (CDF) where it makes sense to support incremental reads and automation. Plan for schema evolution and backward compatibility to ensure that updates do not break consumer workloads.

When to enable CDF

  • High-volume tables where full refresh is expensive (egress + compute) and most changes are incremental
  • Automation and integration patterns where consumers want a scheduled job that pulls only new or changed rows
  • SCD, corrections, or late-arriving data where you need to represent insert, delete, and update events explicitly, including update pre/post images
  • Operational and audit needs where you want an immutable history of row-level change events between table versions

What's next