Data as a Product
Treat your data as a product, not just a dataset. A strong data product is curated, documented, and maintained with clear ownership and purpose. Start by identifying who will use the data and how. Define your target personas, their business objectives, and the primary use cases the data is meant to support.
Unlike raw data extracts or ad-hoc exports, a data product is packaged for consumption—with defined SLAs, versioning, quality standards, and documentation that makes it immediately actionable.
Unity Catalog layout
Use a clean object layout pattern in Unity Catalog:
- Catalog: The product or brand boundary
- Schemas: Organized by domain
- Tables: The raw facts and dimensions customers will query
If you share an entire schema, recipients can automatically access all assets currently in it and anything you add later (tables, views, notebooks, models, volumes, etc.). Treat schema boundaries as part of your product contract.
Conventions that reduce consumer friction
- Use consistent timestamp types and timezones (and document them)
- Avoid string-typed columns for dates and numerics
- Keep column naming predictable (snake_case is common)
- Prefer stable IDs over names for joins (names change; IDs shouldn't)
- Keep enumerations and codes in reference tables
Metadata and AI readiness
Unity Catalog allows you to include detailed metadata such as table and column descriptions, tags, business terms, and data relationships. Rich metadata makes a dataset usable, searchable, and valuable across multiple tools and teams.
Good metadata also makes your shared data easier to use with Databricks Genie and other AI-powered tools. When Genie can read complete and accurate metadata, it can help users query and explore your data through natural language, reducing the effort required to get insights and making your data accessible to a wider audience.
Think of metadata as the foundation for AI readiness. When your data is self-describing and semantically rich, it becomes far easier for both humans and AI systems to interpret and apply it effectively. For a comprehensive checklist, see AI readiness.
Change Data Feed
Enable Change Data Feed (CDF) where it makes sense to support incremental reads and automation. Plan for schema evolution and backward compatibility to ensure that updates do not break consumer workloads.
When to enable CDF
- High-volume tables where full refresh is expensive (egress + compute) and most changes are incremental
- Automation and integration patterns where consumers want a scheduled job that pulls only new or changed rows
- SCD, corrections, or late-arriving data where you need to represent insert, delete, and update events explicitly, including update pre/post images
- Operational and audit needs where you want an immutable history of row-level change events between table versions
What's next
- Review the AI readiness checklist to optimize for Genie and AI tools
- Learn about the different share types you can offer
- Explore D2D sharing patterns for structured and unstructured data
- Set up dynamic views for fine-grained access control