brickster has 1:1 mappings with the clusters REST API, enabling full control of Databricks clusters from your R session.
Cluster Creation
Clusters have a number of parameters and can be configured to match
to needs of a given workload. db_cluster_create()
facilitates creation of a cluster in a Databricks workspace for all
cloud platforms (AWS, Azure, GCP).
Depending on the cloud you will need to change the node types and
cloud_attrs
to be one of; aws_attributes()
,
azure_attributes()
, or gcp_attributes()
.
Below we will create a cluster on AWS and then step through using the other supporting functions.
library(brickster)
# create a small cluster on AWS with DBR 9.1 LTS
new_cluster <- db_cluster_create(
name = "brickster-cluster",
spark_version = "9.1.x-scala2.12",
num_workers = 2,
node_type_id = "m5a.xlarge",
cloud_attrs = aws_attributes(
ebs_volume_count = 3,
ebs_volume_size = 100
)
)
Refer to documentation for details on how to use other parameters not
mentioned here (e.g. spark_conf
).
Before creating a cluster you may want to check the supported values for a number of the parameters. There are functions to assist with this:
Function | Purpose |
---|---|
db_cluster_runtime_versions() |
List of runtime versions available for the workspace, useful for
finding relevant spark_version
|
db_cluster_list_node_types() |
List of supported node types available in workspace/region, useful
for finding relevant
node_type_id /driver_node_type_id
|
db_cluster_list_zones() |
AWS Only, lists availability zones (AZ) clusters can occupy |
db_cluster_get()
will provide details for the cluster we
just created, including information such as the state.
This can be useful as you may wish to wait for the cluster to be
RUNNING
, which is exactly what
get_and_start_cluster()
uses internally to wait until the
cluster is running before completing.
cluster_info <- db_cluster_get(cluster_id = new_cluster$cluster_id)
cluster_info$state
Editing Clusters
You can edit Databricks clusters to change various parameters using
db_cluster_edit()
. For example, we may decide we want our
cluster to autoscale between 2-8 nodes and add some tags.
# we are required to input all parameters
db_cluster_edit(
cluster_id = new_cluster$cluster_id,
name = "brickster-cluster",
spark_version = "9.1.x-scala2.12",
node_type_id = "m5a.xlarge",
autoscale = cluster_autoscale(min_workers = 2, max_workers = 8),
cloud_attrs = aws_attributes(
ebs_volume_count = 3,
ebs_volume_size = 100
),
custom_tags = list(
purpose = "brickster_cluster_demo"
)
)
However, if the intention was to only change the size of a given
cluster the db_cluster_resize()
function is a simpler
alternative.
I can either adjust the number of workers or change the autoscale
range. If the range or workers is adjusted via autoscale
the number of workers active on the cluster will be increased/decreased
if they are outside the bounds.
# adjust number autoscale range to be between 4-6 workers
db_cluster_resize(
cluster_id = new_cluster$cluster_id,
autoscale = cluster_autoscale(min_workers = 4, max_workers = 6)
)
It’s important to note that if specifying num_workers
instead of autoscale
on a cluster than has an existing
autoscale range it will become a fixed number of workers from that point
onward.
Databricks clusters can be “pinned” which stops them from being
removed after 30 days of termination. db_cluster_pin()
and
db_cluster_unpin()
are the functions used for changing if a
cluster is “pinned” or not.
# pin the cluster
db_cluster_pin(cluster_id = new_cluster$cluster_id)
# unpin the cluster
# db_cluster_unpin(cluster_id = new_cluster$cluster_id)
Cluster State
There are a few functions that can be used to to manage the state of an existing cluster
Function | Purpose |
---|---|
db_cluster_start() |
Start a cluster that is inactive |
db_cluster_restart() |
Restart a cluster, cluster must already be running |
db_cluster_delete()
/db_cluster_terminate()
|
Terminate an active cluster, does not remove the cluster configuration from Databricks |
db_cluster_perm_delete() |
Stops (if active) and permanently deletes a cluster, it will not longer appear in Databricks |
Cluster Libraries
Databricks clusters can have libraries installed from a number of
sources using db_libs_install()
and the associated
libs_*()
functions:
Function | Library Source |
---|---|
lib_cran() |
CRAN |
lib_pypi() |
PyPi |
lib_egg() |
Python egg (file) |
lib_whl() |
Python wheel (file) |
lib_maven() |
Maven |
lib_jar() |
JAR (file) |
# installing a package from CRAN on cluster
db_libs_install(
cluster_id = new_cluster$cluster_id,
libraries = libraries(
lib_cran(package = "palmerpenguins"),
lib_cran(package = "dplyr")
)
)
For convenience the wait_for_lib_installs()
function
will block until all the libraries for the specified cluster have
finished installing.
wait_for_lib_installs(cluster_id = new_cluster$cluster_id)
Installation of libraries is asynchronous and will complete in the
background. db_libs_cluster_status()
is used to check on
the installation status of libraries for a given cluster,
db_libs_all_cluster_statuses()
is used for getting the
status of all libraries across all clusters in the workspace.
db_libs_cluster_status(cluster_id = new_cluster$cluster_id)
Libraries can be uninstalled using
db_libs_uninstall()
.
db_libs_uninstall(
cluster_id = new_cluster$cluster_id,
libraries = libraries(
lib_cran(package = "palmerpenguins")
)
)
Using db_libs_cluster_status()
shows that the library
will be uninstalled upon restart
(e.g. db_cluster_restart()
).
db_libs_cluster_status(cluster_id = new_cluster$cluster_id)
Events
A list of events regarding the clusters activity can be fetched via
db_cluster_events()
. There are many event
types that can occur, and by default the 50 most recent events are
returned.
events <- db_cluster_events(cluster_id = new_cluster$cluster_id)
head(events, 1)