Cost Management
Cost management is fundamental to understanding margins, pricing customers accurately, and scaling profitably. Databricks provides a comprehensive suite of tools for monitoring, attributing, and controlling costs across your data and AI workloads. By implementing proper cost management practices from the start, you can gain visibility into per-customer spending, accurately attribute usage to customers and internal operations, and proactively manage budgets.
For additional tips, see Easy Ways to Optimize Your Costs.
Overview
The Databricks cost management framework is built around four key capabilities:
| Capability | Description |
|---|---|
| System Tables | Billable usage logs stored in system.billing.usage provide granular details about account usage, including metadata about resources, custom tags, and user identity |
| Tagging | Custom tags enable accurate attribution of Databricks usage to business units, teams, and projects for chargeback purposes |
| Budgets | Create budget thresholds with email alerts to stay informed about usage across your account, workspaces, or specific tag-based groups |
| AI/BI Dashboards | Pre-built cost management dashboards and AI/BI Genie spaces for visualizing and exploring usage data |
Tagging Strategy
For Built On Databricks solutions, the tagging strategy should be a part of system design—not an operational afterthought. Whether you charge customers via usage-based pricing or flat subscriptions, attribution is essential for understanding true cost, gross margin, and scalability over time.
If you are a Built-On partner already, refer to guidance in the Partner Portal on the tagging requirements for the Built-On program.
Design-Time, Not Retrofit
Tagging should be designed alongside your pricing and packaging, and built into your automation from day one. Retroactive tagging is incomplete, error-prone, and often impossible at scale.
Key implications:
- Tagging decisions directly affect billing accuracy, margin visibility, and contract viability
- Manual tagging does not scale—automated enforcement is necessary
- Tags should be applied programmatically at resource creation (clusters, jobs, SQL warehouses, serverless workloads)
This differs from internal cost management models that can afford to "start small and iterate." Partners building commercial solutions need attribution in place before onboarding the first customer.
How Tagging Works
Tagging operates at the compute level—clusters, jobs, SQL warehouses, and serverless workloads are the resources that generate DBU consumption and carry attribution tags.
Tags follow a parent-child relationship and roll up to the workspace level:
| Deployment Model | Tagging Approach |
|---|---|
| Workspace per customer | Tag at the workspace level; all underlying compute automatically inherits those tags. Simplifies attribution since all usage in the workspace belongs to one customer. |
| Shared workspace (multi-customer) | Tag at the compute level per customer. If you require per-customer attribution, provision dedicated compute resources for each customer. |
Your deployment model and tagging strategy are interrelated decisions. While workspace-per-customer simplifies attribution through inheritance, shared workspaces offer infrastructure efficiency—but require disciplined compute-level tagging.
Supported Resources for Custom Tags
| Resource | UI Tagging | API Tagging |
|---|---|---|
| Workspace | Azure Only | Account API |
| Pool | Pools UI | Instance Pool API |
| All-purpose & Job Compute | Compute UI | Clusters API |
| SQL Warehouse | SQL Warehouse UI | Warehouses API |
| Database Instance | Database Instance UI | Database API |
| Serverless Workloads | Account Console | Serverless Budget Policies |
For detailed guidance, see Use tags to attribute and track usage.
Implementation Steps
Step 1: Set Up Tagging Best Practices
Establish tagging standards using customer_id as a foundational tag key for external billing attribution. Tags are formatted as key:value pairs and can be applied to compute resources, SQL warehouses, jobs, and serverless workloads.
Example Tagging Schema:
| Tag Key | Purpose | Example Values |
|---|---|---|
customer_id | Primary billing entity | acme_corp, customer_12345 |
tenant_id | Sub-customer or business unit | acme_marketing, acme_engineering |
environment | Operational vs. customer workloads | production, internal, staging |
service | Application component or feature | etl_pipeline, analytics_api, ml_inference |
cost_center | Internal cost allocation | platform_ops, data_engineering |
For external consumption (customer billing), customer_id is typically sufficient. Additional tags help with internal margin analysis and operational cost tracking.
Implementation Example: Programmatic Tagging
The following examples demonstrate tagging patterns. Always verify API syntax and parameters against the official Databricks documentation.
Apply tags when creating clusters:
from databricks.sdk import WorkspaceClient
w = WorkspaceClient()
# Create cluster with customer tags
cluster = w.clusters.create(
cluster_name="customer-acme-etl",
spark_version="13.3.x-scala2.12",
node_type_id="i3.xlarge",
num_workers=3,
autotermination_minutes=30,
custom_tags={
"customer_id": "acme_corp",
"environment": "production",
"service": "etl_pipeline"
}
)
Apply tags when creating SQL warehouses:
# Create SQL warehouse with customer tags
warehouse = w.warehouses.create(
name="customer-acme-warehouse",
cluster_size="Small",
min_num_clusters=1,
max_num_clusters=3,
auto_stop_mins=15,
tags={
"custom_tags": [
{"key": "customer_id", "value": "acme_corp"},
{"key": "environment", "value": "production"},
{"key": "service", "value": "analytics_api"}
]
}
)
Apply tags to jobs:
# Create job with customer tags
job = w.jobs.create(
name="customer-acme-daily-report",
tasks=[
{
"task_key": "generate_report",
"notebook_task": {
"notebook_path": "/Workspace/reports/daily_summary"
},
"existing_cluster_id": cluster.cluster_id
}
],
tags={
"customer_id": "acme_corp",
"service": "reporting"
}
)
Step 2: Implement Tag Enforcement Policies
Implement fixed policies requiring pre-defined tags to be applied to all workloads. This ensures completeness and accuracy of your cost data and prevents untagged resources from incurring unattributed costs.
See Tag enforcement for detailed guidance.
| Policy Type | Description |
|---|---|
| Compute Policies | Enforce required tags on clusters and pools |
| Serverless Budget Policies | Apply tags to serverless compute workloads including notebooks, jobs, pipelines, and model serving endpoints |
Implementation Example: Tag Enforcement
The following examples demonstrate enforcement patterns. Always verify syntax against the official Databricks documentation.
Cluster policy requiring customer_id tag:
{
"custom_tags.customer_id": {
"type": "fixed",
"value": "{{customer_id}}",
"hidden": false
},
"custom_tags.environment": {
"type": "fixed",
"value": "production",
"hidden": true
},
"autotermination_minutes": {
"type": "fixed",
"value": 30,
"hidden": false
},
"spark_version": {
"type": "allowlist",
"values": [
"13.3.x-scala2.12",
"14.3.x-scala2.12"
],
"defaultValue": "14.3.x-scala2.12"
},
"node_type_id": {
"type": "allowlist",
"values": [
"i3.xlarge",
"i3.2xlarge",
"i3.4xlarge"
],
"defaultValue": "i3.xlarge"
}
}
Serverless budget policy (applied at account level):
Serverless policies automatically apply tags to notebooks, jobs, pipelines, and serving endpoints. Configure in Account Console → Compute → Serverless → Budget Policies:
- Policy name:
customer-acme-serverless - Monthly budget:
$5,000 - Tags:
customer_id=acme_corp,environment=production - Scope: Apply to specific workspaces or users
Python SDK - Create cluster policy:
from databricks.sdk import WorkspaceClient
w = WorkspaceClient()
# Create policy requiring customer_id tag
policy = w.cluster_policies.create(
name="customer-attribution-required",
definition="""{
"custom_tags.customer_id": {
"type": "fixed",
"value": "{{customer_id}}",
"hidden": false
},
"autotermination_minutes": {
"type": "fixed",
"value": 30,
"hidden": false
}
}"""
)
Step 3: Develop Budget Alerts
Create budgets and budget alerts to monitor usage associated with appropriate tags. Budgets help you stay informed about spending and can trigger email notifications when thresholds are exceeded.
Budgets are currently created through the Account Console UI. There is no programmatic API or SDK support for budget creation. For automated cost monitoring, query system.billing.usage directly (see Step 4 below).
See Create and monitor budgets for complete instructions.
To create a budget:
- In the Account Console sidebar, click Usage
- Click the Budgets tab, then click Add budget
- Enter a name and monthly budget amount (in USD)
- In the Definitions section, limit tracking to specific workspaces and/or custom tags (e.g.,
customer_id=acme_corp) - Enter email addresses for notifications when the budget is reached
- Click Create
Best practices for partner budgets:
- Create per-customer budgets filtered by
customer_idtag - Set thresholds at 80% and 100% of expected monthly usage
- Include both technical and business stakeholders in alerts
- Review budget vs. actual monthly to refine capacity planning
Budgets improve your ability to monitor usage but do not stop usage or charges from occurring. Your final bill can exceed your budget amount.
Step 4: Analyze Usage Data
Account admins can import customizable AI/BI cost management dashboards to any Unity Catalog-enabled workspace. These dashboards contain usage breakdowns by product, SKU name, and custom tags, along with analysis of the most expensive usage sources.
See Usage dashboards for more information.
To import the dashboard:
- From the Account Console, click the Usage tab
- Click the Setup dashboard button
- Select whether to reflect the entire account's usage or just a single workspace
- Select the destination workspace and click Import
Once imported, the dashboard is fully customizable and can be published like any other dashboard. You can also use AI/BI Genie to explore spending trends, anomalies, and cost-saving recommendations through a natural language interface.
Implementation Example: Query System Tables for Cost Analysis
The following examples demonstrate cost analysis patterns. Always verify table schema and functions against the official Databricks documentation.
Analyze costs by customer:
SELECT
usage_metadata.custom_tags['customer_id'] AS customer_id,
usage_date,
SUM(usage_quantity) AS total_dbus,
COUNT(DISTINCT workspace_id) AS workspaces_used,
SUM(usage_quantity * list_prices.pricing.default) AS estimated_cost_usd
FROM system.billing.usage
LEFT JOIN system.billing.list_prices
ON usage.sku_name = list_prices.sku_name
AND usage.usage_date = list_prices.price_start_time
WHERE usage_date >= CURRENT_DATE() - INTERVAL 30 DAYS
AND usage_metadata.custom_tags['customer_id'] IS NOT NULL
GROUP BY customer_id, usage_date
ORDER BY usage_date DESC, total_dbus DESC;
Identify untagged resources:
SELECT
usage_date,
workspace_id,
sku_name,
usage_unit,
SUM(usage_quantity) AS total_usage
FROM system.billing.usage
WHERE usage_date >= CURRENT_DATE() - INTERVAL 7 DAYS
AND usage_metadata.custom_tags['customer_id'] IS NULL
GROUP BY usage_date, workspace_id, sku_name, usage_unit
ORDER BY total_usage DESC;
Monthly cost by customer and service:
SELECT
DATE_TRUNC('month', usage_date) AS month,
usage_metadata.custom_tags['customer_id'] AS customer_id,
usage_metadata.custom_tags['service'] AS service,
sku_name,
SUM(usage_quantity) AS total_dbus,
SUM(usage_quantity * list_prices.pricing.default) AS estimated_cost_usd
FROM system.billing.usage
LEFT JOIN system.billing.list_prices
ON usage.sku_name = list_prices.sku_name
AND usage.usage_date = list_prices.price_start_time
WHERE usage_date >= DATE_TRUNC('month', CURRENT_DATE() - INTERVAL 3 MONTHS)
GROUP BY month, customer_id, service, sku_name
ORDER BY month DESC, estimated_cost_usd DESC;
Top 10 most expensive workloads:
SELECT
usage_metadata.custom_tags['customer_id'] AS customer_id,
usage_metadata.cluster_id,
usage_metadata.job_id,
usage_metadata.job_name,
SUM(usage_quantity) AS total_dbus,
SUM(usage_quantity * list_prices.pricing.default) AS estimated_cost_usd
FROM system.billing.usage
LEFT JOIN system.billing.list_prices
ON usage.sku_name = list_prices.sku_name
AND usage.usage_date = list_prices.price_start_time
WHERE usage_date >= CURRENT_DATE() - INTERVAL 30 DAYS
GROUP BY customer_id, usage_metadata.cluster_id, usage_metadata.job_id, usage_metadata.job_name
ORDER BY estimated_cost_usd DESC
LIMIT 10;
Workspace Management Scenarios
Your tagging implementation approach depends on how much control you have over the Databricks environment.
Partner-Managed Workspaces
When you provision and manage the workspace, you have full access to Databricks' native tagging and policy enforcement:
- Use cluster policies to require and lock attribution tags at compute creation
- Configure serverless budget policies to automatically apply tags to notebooks, jobs, pipelines, and serving endpoints
- Restrict warehouse creation to administrators who ensure proper tagging on setup
- Leverage system tables and the usage dashboard for usage reporting and margin analysis
For partner-managed workspaces, track two categories of consumption:
| Type | Purpose | Example Tags |
|---|---|---|
| Internal consumption | Track your own platform operations, engineering, and overhead | environment, team, service |
| External consumption | Attribute usage for customer billing or margin analysis | customer_id, line_of_business, cost_center |
For external consumption, tag granularity should match what you bill. In most cases this means per-customer attribution, but for large enterprise deployments you may need finer granularity. See Monitor usage using tags for implementation details.
Implementation Example: Enforcing Tags at Workspace Provisioning
The following examples demonstrate enforcement patterns for partner-managed workspaces. Always verify API syntax against the official Databricks documentation.
Provision workspace with default policies and tags:
from databricks.sdk import AccountClient, WorkspaceClient
# Initialize account client
account = AccountClient()
# 1. Create workspace for customer
workspace = account.workspaces.create(
workspace_name="customer-acme-prod",
aws_region="us-west-2",
credentials_id="<credentials-id>",
storage_configuration_id="<storage-config-id>"
)
# 2. Initialize workspace client
w = WorkspaceClient(host=workspace.deployment_name)
# 3. Create cluster policy requiring customer tags
policy = w.cluster_policies.create(
name="customer-acme-policy",
definition="""{
"custom_tags.customer_id": {
"type": "fixed",
"value": "acme_corp",
"hidden": false
},
"custom_tags.environment": {
"type": "fixed",
"value": "production",
"hidden": true
},
"autotermination_minutes": {
"type": "fixed",
"value": 30,
"hidden": false
}
}"""
)
# 4. Assign policy to customer's group
w.permissions.update(
request_object_type="cluster-policies",
request_object_id=policy.policy_id,
access_control_list=[
{
"group_name": "customer-acme-users",
"permission_level": "CAN_USE"
}
]
)
# 5. Create pre-tagged SQL warehouse
warehouse = w.warehouses.create(
name="customer-acme-warehouse",
cluster_size="Small",
max_num_clusters=3,
auto_stop_mins=15,
tags={
"custom_tags": [
{"key": "customer_id", "value": "acme_corp"},
{"key": "environment", "value": "production"}
]
}
)
# 6. Grant warehouse access to customer group
w.warehouses.set_workspace_warehouse_config(
data_access_config=[
{
"group_name": "customer-acme-users",
"permission_level": "CAN_USE"
}
]
)
Enforce tagging compliance check (housekeeping job):
# Run daily to identify untagged resources
from databricks.sdk import WorkspaceClient
w = WorkspaceClient()
# Check for clusters without required tags
clusters = w.clusters.list()
for cluster in clusters:
tags = cluster.custom_tags or {}
if 'customer_id' not in tags:
print(f"⚠️ Untagged cluster: {cluster.cluster_id} ({cluster.cluster_name})")
# Option: Auto-terminate, alert, or tag programmatically
# Check for warehouses without required tags
warehouses = w.warehouses.list()
for warehouse in warehouses:
tags = warehouse.tags or {}
tag_dict = {tag['key']: tag['value'] for tag in tags.get('custom_tags', [])}
if 'customer_id' not in tag_dict:
print(f"⚠️ Untagged warehouse: {warehouse.id} ({warehouse.name})")
Customer-Managed Workspaces
Customer-controlled workspaces limit your access to control tagging and view consumption. This requires alternative telemetry approaches—see Customer Managed for guidance.
Enforcement Best Practices
-
Compute as a Managed Service — Compute should be preconfigured with tags and cost controls in place. Users consume resources, they don't configure them.
- Cluster policies — Assign by group with VMs, libraries, auto-scaling, and auto-termination defined. Users select from governed options.
- Serverless policies — Attach by default so tags apply automatically to all serverless resources.
- Warehouses — Provision by group. Users can start or restart warehouses they have access to—not create new ones.
-
Budget alerts — Configure budget monitoring to track usage against thresholds by tag
-
Housekeeping jobs — Build automated compliance checks to identify and remediate untagged or mistagged resources
Additional Cost Controls
Beyond tagging and budgets, Databricks provides proactive controls to prevent cost overruns before they occur.
Compute Policies
Compute policies define governance rules for cluster creation, restricting instance types, autoscaling behavior, auto-termination settings, and enforcing required tags. Policies are assigned to groups and ensure users can only create compliant compute resources.
Key policy controls for cost management:
- Instance type restrictions: Limit to cost-effective instance families
- Auto-termination: Force clusters to terminate after idle period (e.g., 30 minutes)
- Autoscaling limits: Cap maximum workers to prevent runaway costs
- Spot instance usage: Require spot/preemptible instances for non-critical workloads
See Cluster policy definitions for complete reference.
Implementation Example: Cost-Optimized Compute Policy
The following example demonstrates cost control patterns. Always verify policy syntax against the official Databricks documentation.
Cost-optimized cluster policy:
{
"node_type_id": {
"type": "allowlist",
"values": ["i3.xlarge", "i3.2xlarge"],
"defaultValue": "i3.xlarge"
},
"driver_node_type_id": {
"type": "fixed",
"value": "i3.xlarge",
"hidden": true
},
"autotermination_minutes": {
"type": "range",
"minValue": 10,
"maxValue": 60,
"defaultValue": 30
},
"autoscale": {
"type": "fixed",
"value": {
"min_workers": 1,
"max_workers": 10
}
},
"aws_attributes.availability": {
"type": "fixed",
"value": "SPOT_WITH_FALLBACK",
"hidden": true
},
"aws_attributes.spot_bid_price_percent": {
"type": "fixed",
"value": 100,
"hidden": true
},
"custom_tags.customer_id": {
"type": "fixed",
"value": "{{customer_id}}",
"hidden": false
},
"custom_tags.environment": {
"type": "fixed",
"value": "production",
"hidden": true
}
}
This policy:
- Restricts to cost-effective i3 instances
- Forces 10-60 minute auto-termination
- Caps autoscaling at 10 workers
- Uses spot instances with on-demand fallback
- Requires customer_id tag
SQL Warehouse Size and Auto-Stop
SQL warehouses offer built-in cost controls through sizing and auto-stop configuration:
- Cluster size: Start with "Small" or "Medium" for most workloads
- Scaling: Enable auto-scaling to handle burst traffic without over-provisioning
- Auto-stop: Configure 10-15 minute idle timeout to prevent runaway costs
- Serverless warehouses: Consider serverless SQL for variable workloads (pay only for active query time)
See SQL warehouse types for sizing guidance.
Serverless Overspend Protection
Serverless compute includes built-in timeout protection:
- Default timeout: 2.5 hours for serverless notebooks and jobs
- Configurable: Adjust via
spark.databricks.execution.timeoutspark config - Budget policies: Apply spending limits to serverless workloads at account or workspace level
For partners, consider setting tighter timeouts for customer-facing workloads and monitoring serverless usage patterns to identify optimization opportunities.
Cost Visibility Enables Better Decisions
Even with flat subscription fees, your costs are usage-based. Customer-level attribution lets you:
- Analyze account profitability
- Understand margin by segment
- Make informed pricing and packaging decisions
- Position for a future shift to usage-based or hybrid models
Tagging is what makes your cost data actionable.
Beyond internal operations, proper tagging is also a requirement for co-sell motions within the Databricks ecosystem. The Built On Databricks Partner Program—which provides go-to-market support, technical resources, and partnership benefits—requires accurate customer attribution through standardized tagging.
What's Next
- Automation — Automate tagging enforcement with Terraform and DABs
- Onboarding — Configure tagging during customer onboarding
- Scale & Limits — Understand resource quotas and limits