Skip to main content

Cost Management

Cost management is fundamental to understanding margins, pricing customers accurately, and scaling profitably. Databricks provides a comprehensive suite of tools for monitoring, attributing, and controlling costs across your data and AI workloads. By implementing proper cost management practices from the start, you can gain visibility into per-customer spending, accurately attribute usage to customers and internal operations, and proactively manage budgets.

For additional tips, see Easy Ways to Optimize Your Costs.

Overview

The Databricks cost management framework is built around four key capabilities:

CapabilityDescription
System TablesBillable usage logs stored in system.billing.usage provide granular details about account usage, including metadata about resources, custom tags, and user identity
TaggingCustom tags enable accurate attribution of Databricks usage to business units, teams, and projects for chargeback purposes
BudgetsCreate budget thresholds with email alerts to stay informed about usage across your account, workspaces, or specific tag-based groups
AI/BI DashboardsPre-built cost management dashboards and AI/BI Genie spaces for visualizing and exploring usage data

Tagging Strategy

For Built On Databricks solutions, the tagging strategy should be a part of system design—not an operational afterthought. Whether you charge customers via usage-based pricing or flat subscriptions, attribution is essential for understanding true cost, gross margin, and scalability over time.

tip

If you are a Built-On partner already, refer to guidance in the Partner Portal on the tagging requirements for the Built-On program.

Design-Time, Not Retrofit

Tagging should be designed alongside your pricing and packaging, and built into your automation from day one. Retroactive tagging is incomplete, error-prone, and often impossible at scale.

Key implications:

  • Tagging decisions directly affect billing accuracy, margin visibility, and contract viability
  • Manual tagging does not scale—automated enforcement is necessary
  • Tags should be applied programmatically at resource creation (clusters, jobs, SQL warehouses, serverless workloads)

This differs from internal cost management models that can afford to "start small and iterate." Partners building commercial solutions need attribution in place before onboarding the first customer.

How Tagging Works

Tagging operates at the compute level—clusters, jobs, SQL warehouses, and serverless workloads are the resources that generate DBU consumption and carry attribution tags.

Tags follow a parent-child relationship and roll up to the workspace level:

Deployment ModelTagging Approach
Workspace per customerTag at the workspace level; all underlying compute automatically inherits those tags. Simplifies attribution since all usage in the workspace belongs to one customer.
Shared workspace (multi-customer)Tag at the compute level per customer. If you require per-customer attribution, provision dedicated compute resources for each customer.

Your deployment model and tagging strategy are interrelated decisions. While workspace-per-customer simplifies attribution through inheritance, shared workspaces offer infrastructure efficiency—but require disciplined compute-level tagging.

Supported Resources for Custom Tags

ResourceUI TaggingAPI Tagging
WorkspaceAzure OnlyAccount API
PoolPools UIInstance Pool API
All-purpose & Job ComputeCompute UIClusters API
SQL WarehouseSQL Warehouse UIWarehouses API
Database InstanceDatabase Instance UIDatabase API
Serverless WorkloadsAccount ConsoleServerless Budget Policies

For detailed guidance, see Use tags to attribute and track usage.

Implementation Steps

Step 1: Set Up Tagging Best Practices

Establish tagging standards using customer_id as a foundational tag key for external billing attribution. Tags are formatted as key:value pairs and can be applied to compute resources, SQL warehouses, jobs, and serverless workloads.

Example Tagging Schema:

Tag KeyPurposeExample Values
customer_idPrimary billing entityacme_corp, customer_12345
tenant_idSub-customer or business unitacme_marketing, acme_engineering
environmentOperational vs. customer workloadsproduction, internal, staging
serviceApplication component or featureetl_pipeline, analytics_api, ml_inference
cost_centerInternal cost allocationplatform_ops, data_engineering

For external consumption (customer billing), customer_id is typically sufficient. Additional tags help with internal margin analysis and operational cost tracking.

Implementation Example: Programmatic Tagging
note

The following examples demonstrate tagging patterns. Always verify API syntax and parameters against the official Databricks documentation.

Apply tags when creating clusters:

from databricks.sdk import WorkspaceClient

w = WorkspaceClient()

# Create cluster with customer tags
cluster = w.clusters.create(
cluster_name="customer-acme-etl",
spark_version="13.3.x-scala2.12",
node_type_id="i3.xlarge",
num_workers=3,
autotermination_minutes=30,
custom_tags={
"customer_id": "acme_corp",
"environment": "production",
"service": "etl_pipeline"
}
)

Apply tags when creating SQL warehouses:

# Create SQL warehouse with customer tags
warehouse = w.warehouses.create(
name="customer-acme-warehouse",
cluster_size="Small",
min_num_clusters=1,
max_num_clusters=3,
auto_stop_mins=15,
tags={
"custom_tags": [
{"key": "customer_id", "value": "acme_corp"},
{"key": "environment", "value": "production"},
{"key": "service", "value": "analytics_api"}
]
}
)

Apply tags to jobs:

# Create job with customer tags
job = w.jobs.create(
name="customer-acme-daily-report",
tasks=[
{
"task_key": "generate_report",
"notebook_task": {
"notebook_path": "/Workspace/reports/daily_summary"
},
"existing_cluster_id": cluster.cluster_id
}
],
tags={
"customer_id": "acme_corp",
"service": "reporting"
}
)

Step 2: Implement Tag Enforcement Policies

Implement fixed policies requiring pre-defined tags to be applied to all workloads. This ensures completeness and accuracy of your cost data and prevents untagged resources from incurring unattributed costs.

See Tag enforcement for detailed guidance.

Policy TypeDescription
Compute PoliciesEnforce required tags on clusters and pools
Serverless Budget PoliciesApply tags to serverless compute workloads including notebooks, jobs, pipelines, and model serving endpoints
Implementation Example: Tag Enforcement
note

The following examples demonstrate enforcement patterns. Always verify syntax against the official Databricks documentation.

Cluster policy requiring customer_id tag:

{
"custom_tags.customer_id": {
"type": "fixed",
"value": "{{customer_id}}",
"hidden": false
},
"custom_tags.environment": {
"type": "fixed",
"value": "production",
"hidden": true
},
"autotermination_minutes": {
"type": "fixed",
"value": 30,
"hidden": false
},
"spark_version": {
"type": "allowlist",
"values": [
"13.3.x-scala2.12",
"14.3.x-scala2.12"
],
"defaultValue": "14.3.x-scala2.12"
},
"node_type_id": {
"type": "allowlist",
"values": [
"i3.xlarge",
"i3.2xlarge",
"i3.4xlarge"
],
"defaultValue": "i3.xlarge"
}
}

Serverless budget policy (applied at account level):

Serverless policies automatically apply tags to notebooks, jobs, pipelines, and serving endpoints. Configure in Account Console → Compute → Serverless → Budget Policies:

  • Policy name: customer-acme-serverless
  • Monthly budget: $5,000
  • Tags: customer_id=acme_corp, environment=production
  • Scope: Apply to specific workspaces or users

Python SDK - Create cluster policy:

from databricks.sdk import WorkspaceClient

w = WorkspaceClient()

# Create policy requiring customer_id tag
policy = w.cluster_policies.create(
name="customer-attribution-required",
definition="""{
"custom_tags.customer_id": {
"type": "fixed",
"value": "{{customer_id}}",
"hidden": false
},
"autotermination_minutes": {
"type": "fixed",
"value": 30,
"hidden": false
}
}"""
)

Step 3: Develop Budget Alerts

Create budgets and budget alerts to monitor usage associated with appropriate tags. Budgets help you stay informed about spending and can trigger email notifications when thresholds are exceeded.

note

Budgets are currently created through the Account Console UI. There is no programmatic API or SDK support for budget creation. For automated cost monitoring, query system.billing.usage directly (see Step 4 below).

See Create and monitor budgets for complete instructions.

To create a budget:

  1. In the Account Console sidebar, click Usage
  2. Click the Budgets tab, then click Add budget
  3. Enter a name and monthly budget amount (in USD)
  4. In the Definitions section, limit tracking to specific workspaces and/or custom tags (e.g., customer_id=acme_corp)
  5. Enter email addresses for notifications when the budget is reached
  6. Click Create

Best practices for partner budgets:

  • Create per-customer budgets filtered by customer_id tag
  • Set thresholds at 80% and 100% of expected monthly usage
  • Include both technical and business stakeholders in alerts
  • Review budget vs. actual monthly to refine capacity planning
note

Budgets improve your ability to monitor usage but do not stop usage or charges from occurring. Your final bill can exceed your budget amount.

Step 4: Analyze Usage Data

Account admins can import customizable AI/BI cost management dashboards to any Unity Catalog-enabled workspace. These dashboards contain usage breakdowns by product, SKU name, and custom tags, along with analysis of the most expensive usage sources.

See Usage dashboards for more information.

To import the dashboard:

  1. From the Account Console, click the Usage tab
  2. Click the Setup dashboard button
  3. Select whether to reflect the entire account's usage or just a single workspace
  4. Select the destination workspace and click Import

Once imported, the dashboard is fully customizable and can be published like any other dashboard. You can also use AI/BI Genie to explore spending trends, anomalies, and cost-saving recommendations through a natural language interface.

Implementation Example: Query System Tables for Cost Analysis
note

The following examples demonstrate cost analysis patterns. Always verify table schema and functions against the official Databricks documentation.

Analyze costs by customer:

SELECT
usage_metadata.custom_tags['customer_id'] AS customer_id,
usage_date,
SUM(usage_quantity) AS total_dbus,
COUNT(DISTINCT workspace_id) AS workspaces_used,
SUM(usage_quantity * list_prices.pricing.default) AS estimated_cost_usd
FROM system.billing.usage
LEFT JOIN system.billing.list_prices
ON usage.sku_name = list_prices.sku_name
AND usage.usage_date = list_prices.price_start_time
WHERE usage_date >= CURRENT_DATE() - INTERVAL 30 DAYS
AND usage_metadata.custom_tags['customer_id'] IS NOT NULL
GROUP BY customer_id, usage_date
ORDER BY usage_date DESC, total_dbus DESC;

Identify untagged resources:

SELECT
usage_date,
workspace_id,
sku_name,
usage_unit,
SUM(usage_quantity) AS total_usage
FROM system.billing.usage
WHERE usage_date >= CURRENT_DATE() - INTERVAL 7 DAYS
AND usage_metadata.custom_tags['customer_id'] IS NULL
GROUP BY usage_date, workspace_id, sku_name, usage_unit
ORDER BY total_usage DESC;

Monthly cost by customer and service:

SELECT
DATE_TRUNC('month', usage_date) AS month,
usage_metadata.custom_tags['customer_id'] AS customer_id,
usage_metadata.custom_tags['service'] AS service,
sku_name,
SUM(usage_quantity) AS total_dbus,
SUM(usage_quantity * list_prices.pricing.default) AS estimated_cost_usd
FROM system.billing.usage
LEFT JOIN system.billing.list_prices
ON usage.sku_name = list_prices.sku_name
AND usage.usage_date = list_prices.price_start_time
WHERE usage_date >= DATE_TRUNC('month', CURRENT_DATE() - INTERVAL 3 MONTHS)
GROUP BY month, customer_id, service, sku_name
ORDER BY month DESC, estimated_cost_usd DESC;

Top 10 most expensive workloads:

SELECT
usage_metadata.custom_tags['customer_id'] AS customer_id,
usage_metadata.cluster_id,
usage_metadata.job_id,
usage_metadata.job_name,
SUM(usage_quantity) AS total_dbus,
SUM(usage_quantity * list_prices.pricing.default) AS estimated_cost_usd
FROM system.billing.usage
LEFT JOIN system.billing.list_prices
ON usage.sku_name = list_prices.sku_name
AND usage.usage_date = list_prices.price_start_time
WHERE usage_date >= CURRENT_DATE() - INTERVAL 30 DAYS
GROUP BY customer_id, usage_metadata.cluster_id, usage_metadata.job_id, usage_metadata.job_name
ORDER BY estimated_cost_usd DESC
LIMIT 10;

Workspace Management Scenarios

Your tagging implementation approach depends on how much control you have over the Databricks environment.

Partner-Managed Workspaces

When you provision and manage the workspace, you have full access to Databricks' native tagging and policy enforcement:

  • Use cluster policies to require and lock attribution tags at compute creation
  • Configure serverless budget policies to automatically apply tags to notebooks, jobs, pipelines, and serving endpoints
  • Restrict warehouse creation to administrators who ensure proper tagging on setup
  • Leverage system tables and the usage dashboard for usage reporting and margin analysis

For partner-managed workspaces, track two categories of consumption:

TypePurposeExample Tags
Internal consumptionTrack your own platform operations, engineering, and overheadenvironment, team, service
External consumptionAttribute usage for customer billing or margin analysiscustomer_id, line_of_business, cost_center

For external consumption, tag granularity should match what you bill. In most cases this means per-customer attribution, but for large enterprise deployments you may need finer granularity. See Monitor usage using tags for implementation details.

Implementation Example: Enforcing Tags at Workspace Provisioning
note

The following examples demonstrate enforcement patterns for partner-managed workspaces. Always verify API syntax against the official Databricks documentation.

Provision workspace with default policies and tags:

from databricks.sdk import AccountClient, WorkspaceClient

# Initialize account client
account = AccountClient()

# 1. Create workspace for customer
workspace = account.workspaces.create(
workspace_name="customer-acme-prod",
aws_region="us-west-2",
credentials_id="<credentials-id>",
storage_configuration_id="<storage-config-id>"
)

# 2. Initialize workspace client
w = WorkspaceClient(host=workspace.deployment_name)

# 3. Create cluster policy requiring customer tags
policy = w.cluster_policies.create(
name="customer-acme-policy",
definition="""{
"custom_tags.customer_id": {
"type": "fixed",
"value": "acme_corp",
"hidden": false
},
"custom_tags.environment": {
"type": "fixed",
"value": "production",
"hidden": true
},
"autotermination_minutes": {
"type": "fixed",
"value": 30,
"hidden": false
}
}"""
)

# 4. Assign policy to customer's group
w.permissions.update(
request_object_type="cluster-policies",
request_object_id=policy.policy_id,
access_control_list=[
{
"group_name": "customer-acme-users",
"permission_level": "CAN_USE"
}
]
)

# 5. Create pre-tagged SQL warehouse
warehouse = w.warehouses.create(
name="customer-acme-warehouse",
cluster_size="Small",
max_num_clusters=3,
auto_stop_mins=15,
tags={
"custom_tags": [
{"key": "customer_id", "value": "acme_corp"},
{"key": "environment", "value": "production"}
]
}
)

# 6. Grant warehouse access to customer group
w.warehouses.set_workspace_warehouse_config(
data_access_config=[
{
"group_name": "customer-acme-users",
"permission_level": "CAN_USE"
}
]
)

Enforce tagging compliance check (housekeeping job):

# Run daily to identify untagged resources
from databricks.sdk import WorkspaceClient

w = WorkspaceClient()

# Check for clusters without required tags
clusters = w.clusters.list()
for cluster in clusters:
tags = cluster.custom_tags or {}
if 'customer_id' not in tags:
print(f"⚠️ Untagged cluster: {cluster.cluster_id} ({cluster.cluster_name})")
# Option: Auto-terminate, alert, or tag programmatically

# Check for warehouses without required tags
warehouses = w.warehouses.list()
for warehouse in warehouses:
tags = warehouse.tags or {}
tag_dict = {tag['key']: tag['value'] for tag in tags.get('custom_tags', [])}
if 'customer_id' not in tag_dict:
print(f"⚠️ Untagged warehouse: {warehouse.id} ({warehouse.name})")

Customer-Managed Workspaces

Customer-controlled workspaces limit your access to control tagging and view consumption. This requires alternative telemetry approaches—see Customer Managed for guidance.

Enforcement Best Practices

  • Compute as a Managed Service — Compute should be preconfigured with tags and cost controls in place. Users consume resources, they don't configure them.

    • Cluster policies — Assign by group with VMs, libraries, auto-scaling, and auto-termination defined. Users select from governed options.
    • Serverless policies — Attach by default so tags apply automatically to all serverless resources.
    • Warehouses — Provision by group. Users can start or restart warehouses they have access to—not create new ones.
  • Budget alerts — Configure budget monitoring to track usage against thresholds by tag

  • Housekeeping jobs — Build automated compliance checks to identify and remediate untagged or mistagged resources

Additional Cost Controls

Beyond tagging and budgets, Databricks provides proactive controls to prevent cost overruns before they occur.

Compute Policies

Compute policies define governance rules for cluster creation, restricting instance types, autoscaling behavior, auto-termination settings, and enforcing required tags. Policies are assigned to groups and ensure users can only create compliant compute resources.

Key policy controls for cost management:

  • Instance type restrictions: Limit to cost-effective instance families
  • Auto-termination: Force clusters to terminate after idle period (e.g., 30 minutes)
  • Autoscaling limits: Cap maximum workers to prevent runaway costs
  • Spot instance usage: Require spot/preemptible instances for non-critical workloads

See Cluster policy definitions for complete reference.

Implementation Example: Cost-Optimized Compute Policy
note

The following example demonstrates cost control patterns. Always verify policy syntax against the official Databricks documentation.

Cost-optimized cluster policy:

{
"node_type_id": {
"type": "allowlist",
"values": ["i3.xlarge", "i3.2xlarge"],
"defaultValue": "i3.xlarge"
},
"driver_node_type_id": {
"type": "fixed",
"value": "i3.xlarge",
"hidden": true
},
"autotermination_minutes": {
"type": "range",
"minValue": 10,
"maxValue": 60,
"defaultValue": 30
},
"autoscale": {
"type": "fixed",
"value": {
"min_workers": 1,
"max_workers": 10
}
},
"aws_attributes.availability": {
"type": "fixed",
"value": "SPOT_WITH_FALLBACK",
"hidden": true
},
"aws_attributes.spot_bid_price_percent": {
"type": "fixed",
"value": 100,
"hidden": true
},
"custom_tags.customer_id": {
"type": "fixed",
"value": "{{customer_id}}",
"hidden": false
},
"custom_tags.environment": {
"type": "fixed",
"value": "production",
"hidden": true
}
}

This policy:

  • Restricts to cost-effective i3 instances
  • Forces 10-60 minute auto-termination
  • Caps autoscaling at 10 workers
  • Uses spot instances with on-demand fallback
  • Requires customer_id tag

SQL Warehouse Size and Auto-Stop

SQL warehouses offer built-in cost controls through sizing and auto-stop configuration:

  • Cluster size: Start with "Small" or "Medium" for most workloads
  • Scaling: Enable auto-scaling to handle burst traffic without over-provisioning
  • Auto-stop: Configure 10-15 minute idle timeout to prevent runaway costs
  • Serverless warehouses: Consider serverless SQL for variable workloads (pay only for active query time)

See SQL warehouse types for sizing guidance.

Serverless Overspend Protection

Serverless compute includes built-in timeout protection:

  • Default timeout: 2.5 hours for serverless notebooks and jobs
  • Configurable: Adjust via spark.databricks.execution.timeout spark config
  • Budget policies: Apply spending limits to serverless workloads at account or workspace level

For partners, consider setting tighter timeouts for customer-facing workloads and monitoring serverless usage patterns to identify optimization opportunities.

Cost Visibility Enables Better Decisions

Even with flat subscription fees, your costs are usage-based. Customer-level attribution lets you:

  • Analyze account profitability
  • Understand margin by segment
  • Make informed pricing and packaging decisions
  • Position for a future shift to usage-based or hybrid models

Tagging is what makes your cost data actionable.

Beyond internal operations, proper tagging is also a requirement for co-sell motions within the Databricks ecosystem. The Built On Databricks Partner Program—which provides go-to-market support, technical resources, and partnership benefits—requires accurate customer attribution through standardized tagging.

What's Next

  • Automation — Automate tagging enforcement with Terraform and DABs
  • Onboarding — Configure tagging during customer onboarding
  • Scale & Limits — Understand resource quotas and limits