Automation
Automation is fundamental to building scalable, reliable, and secure Databricks deployments. By adopting infrastructure-as-code (IaC) principles and implementing robust CI/CD pipelines, organizations can ensure consistency, reduce manual errors, and accelerate delivery cycles.
Core Philosophy: Automate Everything
The guiding principle for Databricks deployments should be automate everything. This includes workspace provisioning, security configurations, data pipelines, jobs, and monitoring.
Automation provides:
- Version control and auditability
- Repeatability across environments
- Rapid recovery from failures
- Consistent scaling across deployments
Terraform for Infrastructure Provisioning
Terraform is the recommended tool for provisioning and managing Databricks workspaces and the underlying cloud infrastructure. The Databricks Terraform Provider enables you to define workspaces, clusters, jobs, notebooks, Unity Catalog resources, permissions, and more as declarative code.
See the Databricks Terraform Provider documentation for complete API reference.
When to Use Terraform
Terraform is ideal for:
- Workspace provisioning and initial setup across AWS, Azure, or GCP
- Cloud infrastructure configuration (VPCs, subnets, security groups, storage accounts)
- Unity Catalog setup (metastores, catalogs, schemas, external locations)
- Identity and access management (groups, service principals, permissions)
- Network security configurations and private connectivity
- Multi-workspace deployments with consistent configurations
Security Reference Architecture (SRA)
For organizations with stringent security requirements, particularly those in regulated industries or government sectors, the Security Reference Architecture (SRA) Terraform Templates provide pre-configured security best practices.
The SRA enables deployment of Databricks workspaces with hardened configurations modeled after the most security-conscious customers. It covers AWS, AWS GovCloud, Azure, and GCP deployments, providing a strong foundation for secure infrastructure that aligns with Databricks Security Best Practices.
Databricks Asset Bundles (DABs)
Databricks Asset Bundles facilitate the adoption of software engineering best practices for data and AI projects. Bundles provide an infrastructure-as-code approach specifically designed for managing Databricks resources like jobs, pipelines, and ML experiments.
What DABs Include
A bundle provides an end-to-end definition of a project:
- Source files (notebooks, Python files) containing business logic
- Definitions for Databricks Jobs, Delta Live Tables pipelines, and Dashboards
- MLflow Experiments and registered models configuration
- Model Serving endpoint definitions
- Unit and integration test configurations
When to Use DABs
DABs are ideal for:
- Team-based development of data, analytics, and ML projects
- Managing ML pipeline resources with production best practices from day one
- Setting organizational standards through custom bundle templates
- Maintaining version history for regulatory compliance
Getting Started with DABs
Install the Databricks CLI and initialize a new bundle:
# Install Databricks CLI
pip install databricks-cli
# Initialize a new bundle from template
databricks bundle init
# Deploy to development
databricks bundle deploy -t dev
# Run a workflow
databricks bundle run my_job -t dev
See the Bundle CLI reference for complete command documentation.
Terraform vs. DABs: Choosing the Right Tool
Both Terraform and DABs serve important but distinct purposes. Understanding when to use each is critical for building an effective automation strategy.
| Aspect | Terraform | DABs |
|---|---|---|
| Primary Purpose | Infrastructure provisioning and platform configuration | Application/project deployment and workflow management |
| Scope | Workspaces, cloud resources, Unity Catalog, IAM, networking | Jobs, pipelines, notebooks, ML experiments, dashboards |
| Change Frequency | Less frequent (infrastructure changes) | More frequent (code and workflow updates) |
| Typical Users | Platform/DevOps engineers, infrastructure teams | Data engineers, data scientists, ML engineers |
| State Management | Terraform state files (remote backend recommended) | Workspace-based (no external state) |
| Template Support | Terraform modules for reusability | Custom bundle templates for project standards |
Recommended Approach: Use Both Together
The most effective automation strategy combines both tools:
-
Use Terraform to provision and configure the foundational infrastructure: workspaces, networking, security configurations, Unity Catalog metastores, and identity management. Terraform establishes the secure, compliant platform foundation.
-
Use DABs to deploy and manage the applications and workflows that run on that infrastructure: data pipelines, ML training jobs, model serving endpoints, and dashboards. DABs enable rapid iteration on business logic while maintaining production best practices.
CI/CD Best Practices
Continuous integration and continuous delivery (CI/CD) automates the building, testing, and deployment of code, enabling more reliable and frequent releases.
See CI/CD Best Practices Documentation for detailed guidance.
High-Level CI/CD Flow
| Stage | Description |
|---|---|
| Version | Store code and notebooks in Git. Use Databricks Git folders for development and testing before committing changes. |
| Code | Develop in Databricks notebooks or locally with VS Code using the Databricks extension. |
| Build | Use DABs to automatically build artifacts during deployments. Leverage pylint with the Databricks Labs plugin for code quality. |
| Deploy | Deploy changes using DABs with GitHub Actions, Azure DevOps, or Jenkins. |
| Test | Run automated tests with pytest to validate code changes before production deployment. |
| Run | Execute bundle workflows using the Databricks CLI. |
| Monitor | Track performance with Databricks jobs monitoring to identify and resolve production issues. |
Authentication for CI/CD
Use service principals instead of user accounts for CI/CD authentication. For the most secure approach, implement OAuth token federation (workload identity federation), which eliminates the need to store Databricks secrets in your CI/CD system.
See CI/CD authentication best practices for detailed guidance on securing your deployment pipelines.
Quick Reference
| Tool/Resource | Link |
|---|---|
| Databricks Terraform Provider | github.com/databricks/terraform-provider-databricks |
| Terraform Provider Documentation | registry.terraform.io/providers/databricks/databricks/latest/docs |
| Security Reference Architecture (SRA) | github.com/databricks/terraform-databricks-sra |
| Databricks Asset Bundles | docs.databricks.com/dev-tools/bundles/ |
| Bundle Templates | docs.databricks.com/dev-tools/bundles/templates |
| Databricks CLI | docs.databricks.com/dev-tools/cli/ |
| CI/CD Best Practices | docs.databricks.com/dev-tools/ci-cd/ |
| GitHub Actions for DABs | docs.databricks.com/dev-tools/bundles/ci-cd#github-actions |
| VS Code Extension | marketplace.visualstudio.com/items?itemName=databricks.databricks |
What's Next
- Onboarding — Customer and user onboarding workflows
- Cost Management — Automate tagging enforcement
- Governance — Unity Catalog automation patterns