Data Dictionary - 0.6.1.x

Consumption Layer “Tables” (Views)

All end users should be hitting consumer tables first. Digging into lower layers gets significantly more complex. Below is the data model for the consumption layer. The consumption layer is often in a stand-alone database apart from the ETL tables to minimize clutter and confusion. These entities in this layer are actually not tables at all (with a few minor exceptions such as lookup tables) but rather views. This allows for the Overwatch development team to alter the underlying columns, names, types, and structures without breaking existing transformations. Instead, view column names will remain the same but may be repointed to a newer version of a column, etc.

ETL should not be developed atop the consumption layer views but rather the gold layer. Before Overwatch version upgrades, it’s important that the engineering team review the change list and upgrade requirements before upgrading. These upgrades may require a remap depending on the changes. As of version 1.0 release, all columns in the gold layer will be underscored with their schema version number, column changes will reference the later release version but the views published with Overwatch will almost always point to the latest version of each column and will not include the schema suffix to simplify the data model for the average consumer.

Data Organization

The large gray boxes in the simplified ERD below depict the two major, logical sections of the data model:

  • Databricks Platform - Metadata captured by the Databricks platform that can be used to assist in workspace governance. This data can also be enriched with the Spark data enabling in-depth analyses. The breadth of metadata is continuing to grow, stay tuned for additional capabilities.
  • Spark UI The spark UI section is derived from the spark event logs and essentially contains every single piece of data from the Spark UI. There are a few sections that are not included in the first release but the data is present in spark_events_bronze albeit extremely complex to derive. The Overwatch development team is working tirelessly to expose additional SparkUI data and will publish as soon as it’s ready.

OverwatchERD

Column Descriptions

Complete column descriptions are only provided for the consumption layer. The entity names are linked below.

Most tables below provide a data SAMPLE for reference. You may either click to view it or right click the SAMPLE link and click saveTargetAs or saveLinkAs and save the file. Note that these files are TAB delimited, so you will need to view as such if you save to local file. The data in the files were generated from an Azure, test deployment created by Overwatch Developers.

Cluster

SAMPLE

KEY – organization_id + cluster_id + unixTimeMS

Incremental Columns – unixTimeMS

Partition Columns – organization_id

Write Mode – Append

Column Type Description
cluster_id string Canonical Databricks cluster ID (more info in Common Meta Fields)
action string create, edit, or snapImpute – depicts the type of action for the cluster – **snapImpute is used on first run to initialize the state of the cluster even if it wasn’t created/edited since audit logs began
timestamp timestamp timestamp the action took place
cluster_name string user-defined name of the cluster
driver_node_type string Canonical name of the driver node type.
node_type string Canonical name of the worker node type.
num_workers int The number of workers defined WHEN autoscaling is disabled
autoscale struct The min/max workers defined WHEN autoscaling is enabled
auto_termination_minutes int The number of minutes before the cluster auto-terminates due to inactivity
enable_elastic_disk boolean Whether autoscaling disk was enabled or not
is_automated booelan Whether the cluster is automated (true if automated false if interactive)
cluster_type string Type of cluster (i.e. Serverless, SQL Analytics, Single Node, Standard)
security_profile struct Complex type to describe secrity features enabled on the cluster. More information Below
cluster_log_conf string Logging directory if configured
init_script array Array of init scripts
custom_tags string User-Defined tags AND also includes Databricks JobID and Databricks RunName when the cluster is created by a Databricks Job as an automated cluster. Other Databricks services that create clusters also store unique information here such as SqlEndpointID when a cluster is created by “SqlAnalytics”
cluster_source string Shows the source of the action **(TODO – checking on why null scenario with BUI)
spark_env_vars string Spark environment variables defined on the cluster
spark_conf string custom spark configuration on the cluster that deviate from default
acl_path_prefix string Automated jobs pass acl to clusters via a path format, the path is defined here
instance_pool_id string Canononical pool id from which workers receive nodes
driver_instance_pool_id string Canononical pool id from which driver receives node
instance_pool_name string Name of pool from which workers receive nodes
driver_instance_pool_name string Name of pool from which driver receives node
spark_version string DBR version - scala version
idempotency_token string Idempotent jobs token if used

ClusterStateFact

SAMPLE

KEY – organization_id + cluster_id + state + unixTimeMS_state_start

Incremental Columns – state_start_date + unixTimeMS

Partition Columns – organization_id + state_start_date

Z-Order Columns – cluster_id + unixTimeMS_state_start

Write Mode – Merge

Costs and state details by cluster at every state in the cluster lifecycle.

NOTE This fact table is not normalized on time. Some states will span multiple days and must be smoothed across days (i.e. divide by days_in_state) when trying to calculate costs by day. All states are force-terminated at the end of the Overwatch run to the until-timestamp of the run. If the state was still active at this time, it will be updated on the subsequent run.

Column Type Description
cluster_id string Canonical Databricks cluster ID (more info in Common Meta Fields)
cluster_name string Name of cluster at beginning of state
custom_tags string JSON string of key/value pairs for all cluster associated custom tags give to the cluster
*_state_start various timestamp reference column at the time the state began
*_state_end various timestamp reference column at the time the state ended
state string state of the cluster – full list HERE
current_num_workers long number of workers in use by the cluster at the start of the state
target_num_worers long number of workers targeted to be present by the completion of the state. Should be equal to current_num_workers except during RESIZING state
uptime_since_restart_S double Seconds since the cluster was last restarted / terminated
uptime_in_state_S double Seconds the cluster spent in current state
uptime_in_state_H double Hours the cluster spent in current state
driver_node_type_id string KEY of driver node type to enable join to instanceDetails
node_type_id string KEY of worker node type to enable join to instanceDetails
cloud_billable boolean All current known states are cloud billable. This means that cloud provider charges are present during this state
databricks_billable boolean State incurs databricks DBU costs. All states incur DBU costs except: INIT_SCRIPTS_FINISHED, INIT_SCRIPTS_STARTED, STARTING, TERMINATING, CREATING, RESTARTING
isAutomated boolean Whether the cluster was created as an “automated” or “interactive” cluster
dbu_rate double Effective dbu rate used for calculations (effective at time of pipeline run)
state_dates array Array of all dates across which the state spanned
days_in_state int Number of days in state
worker_potential_core_H double Worker core hours available to execute spark tasks
core_hours double All core hours of entire cluster (including driver). Nodes * cores * hours in state
driver_compute_cost double Compute costs associated with driver runtime
driver_dbu_cost double DBU costs associated with driver runtime
worker_compute_cost double Compute costs associated with worker runtime
worker_dbu_cost double DBU costs associated with cumulative runtime of all worker nodes
total_driver_cost double Driver costs including DBUs and compute
total_worker_cost double Worker costs including DBUs and compute
total_compute_cost double All compute costs for Driver and Workers
total_dbu_cost double All dbu costs for Driver and Workers
total_cost double Total cost from Compute and DBUs for all nodes (including Driver)
driverSpecs struct Driver node details
workerSpecs struct Worker node details
Cost Functions Explained

EXPECTATIONS – Note that Overwatch costs are derived. This is good and bad. Good as it allows for costs to be broken down by any dimension at the millisecond level. Bad because there can be significant differences between the derived costs and actual costs. These should generally be very close to equal but may differ within margin of error by as much as 10%. To verify the cost functions and the elements therein feel free to review them in more detail. If your costs are off by a large marine, please review all the components of the cost function and correct any configurations as necessary to align your reality with the Overwatch config. The default costs are list price and often do not accurately reflect a customer’s costs.

  • driver_compute_cost: when cloudBillable –> Driver Node Compute Contract Price Hourly (instanceDetails) * Uptime_In_State_H –> otherwise 0
  • worker_compute_cost: when cloudBillable –> Worker Node Compute Contract Price Hourly (instanceDetails) * Uptime_In_State_H * target_num_workers –> otherwise 0
    • target_num_workers used here is ambiguous. Assuming all targeted workers can be provisioned, the calculation is most accurate; however, if some workers cannot be provisioned the worker_compute_cost will be slightly higher than actual while target_num_workers > current_num_workers. target_num_workers used here because the compute costs begin accumulating as soon as the node is provisioned, not at the time it is added to the cluster.
  • driver_dbu_cost: when databricks_billable –> driver_hourly_dbus (instancedetails.hourlyDBUs) * houry_dbu_rate for dbu type (dbuCostDetails.contract_price) * uptime_in_state_H –> otherwise 0
  • worker_dbu_cost: when databricks_billable –> driver_hourly_dbus (instancedetails.hourlyDBUs) * houry_dbu_rate for dbu type (dbuCostDetails.contract_price) * current_num_workers * uptime_in_state_H –> otherwise 0
    • current_num_workers used here as dbu costs do not begin until the node able to receive workloads (i.e. node is moved from target_worker to current_worker / “upsize_complete” state)
  • cloudBillable: Cluster is in a running state
    • GAP: Note that cloud billable ends at the time the cluster is terminated even though the nodes remain provisioned in the cloud provider for several more minutes; these additional minutes are not accounted for in this cost function.

InstanceDetails

AWS Sample | AZURE_Sample

KEY – Organization_ID + API_name

Incremental Columns – Pipeline_SnapTS

Partition Columns – organization_id

Write Mode – Append

This table is unique and it’s purpose is to enable users to identify node specific contract costs associated with Databricks and the Cloud Provider through time. Defaults are loaded as an example by workspace. These defaults are meant to be reasonable, not accurate by default as there is a wide difference between cloud discount rates and prices between regions / countries. Everytime Overwatch runs, it validates the presence of this table and whether it has any data present for the current workspace, if it does not it creates and appends the relevant data to it; otherwise no action is taken. This gives the user the ability to extend / customize this table to fit their needs by workspace. Each organization_id (workspace), should provide complete cost data for each node used in that workspace. If you decide to completely customize the table, it’s critical to note that some columns are required for the ETL to function; these fields are indicated below in the table with an asterisk.

The organization_id (i.e. workspace id) is automatically generated for each workspace if that organization_id is not present in the table already (or the table is not present at all). Each workspace (i.e. organization_id) often has unique costs, this table enables you to customize compute pricing.

IMPORTANT This table must be configured such that there are no overlapping costs (by time) and no gaps (by time) in costs for any key (organization_id + API_name) between primordial date and current date. This means that for a record to be “expired” the following must be true:

  • original key expired by setting activeUntil == expiry date
  • original key must be created with updated information and must:
    • have activeFrom == expiry date of previous record (no gap, no overlap)
    • have activeUntil == lit(null).cast(“date”)

Azure VM Pricing Page

AWS EC2 Pricing Page

Column Type Description
instance string Common name of instance type
API_name* string Canonical KEY name of the node type – use this to join to node_ids elsewhere
vCPUs* int Number of virtual cpus provisioned for the node type
Memory_GB double Gigabyes of memory provisioned for the node type
Compute_Contract_Price* double Contract price for the instance type as negotiated between customer and cloud vendor. This is the value used in cost functions to deliver cost estimates. It is defaulted to equal the on_demand compute price
On_Demand_Cost_Hourly double On demand, list price for node type DISCLAIMER – cloud provider pricing is dynamic and this is meant as an initial reference. This value should be validated and updated to reflect actual pricing
Linux_Reserved_Cost_Hourly double Reserved, list price for node type DISCLAIMER – cloud provider pricing is dynamic and this is meant as an initial reference. This value should be validated and updated to reflect actual pricing
Hourly_DBUs* double Number of DBUs charged for the node type
is_active boolean whether the contract price is currently active. This must be true for each key where activeUntil is null
activeFrom* date The start date for the costs in this record. NOTE this MUST be equal to one other record’s activeUntil unless this is the first record for these costs. There may be no overlap in time or gaps in time.
activeUntil* date The end date for the costs in this record. Must be null to indicate the active record. Only one record can be active at all times. The key (API_name) must have zero gaps and zero overlaps from the Overwatch primordial date until now indicated by null (active)

dbuCostDetails

KEY – Organization_ID + sku

Incremental Columns – activeFrom

Partition Columns – organization_id

Write Mode – Append

Slow-changing dimension to track DBU contract costs by workspace through time. This table should only need to be edited in very rare circumstances such as historical cost correction. Note that editing these contract prices will not retroactively modify historical pricing in the costing table such as clusterStateFact or jobRunCostPotentialFact. For prices to be recalculated, the gold pipeline modules must be rolled back properly such that the costs can be rebuilt with the updated values.

Column Type Description
sku string One of automated, interactive, jobsLight, sqlCompute
contract_price double Price paid per DBU on the sku
is_active boolean whether the contract price is currently active. This must be true for each key where activeUntil is null
activeFrom* date The start date for the costs in this record. NOTE this MUST be equal to one other record’s activeUntil unless this is the first record for these costs. There may be no overlap in time or gaps in time.
activeUntil* date The end date for the costs in this record. Must be null to indicate the active record. Only one record can be active at all times. The key (API_name) must have zero gaps and zero overlaps from the Overwatch primordial date until now indicated by null (active)

Job

SAMPLE

KEY – organization_id + job_id + unixTimeMS + action + request_id

Incremental Columns – unixTimeMS

Partition Columns – organization_id

Write Mode – Append

Column Type Description
organization_id string Canonical workspace id
job_id string Databricks job id
action string Action type defined by the record. One of: create, reset, update, delete, resetJobAcl, changeJobAcl. More information about these actions can be found here
timestamp timestamp timestamp the action took place
job_name string User defined name of job. NOTE, all jobs created through the UI are initialized with the name, “Untitled” therefore UI-created-named jobs will have an edit action to set the name. The cluster is also set to automated and defaulted on UI create as well
job_type string ?? TBD ?? – there is a job_task_type but that’s in Run_id
timeout_seconds string null unless specified, default == null. Timeout seconds specified in UI or via api
schedule string JSON - quartz cron expression of scheduled job and timezone_id
notebook_path string null if job task does not point to a notebook task. If job points to notebook for execution, this is path to that notebook
new_settings dynamic struct job action with “reset” or “update” where settings were changed. Includes complex type of cluster. JobSettings Structure Found Here
cluster dyanmic struct Where relevant, contains the “new_cluster” spec when cluster definition is “new_cluster” or automated. If job definition points to existing cluster the cluster_id can be found here
aclPermissionSet string Predefined aclPermissionsSet such as “Admin” or “Owner”. More information on these can be found HERE
grants string JSON Array of explicit grants given to explicit user list
target_user_id string Databricks canonical user id to which the aclPermissionSet is to be applied
session_id string session_id that requested the action
request_id string request_id of the action
user_agent string request origin such as browser, terraform, api, etc.
response struct response of api call including errorMessage, result, and statusCode (HTTP 200,400, etc)
source_ip_address string Origin IP of action requested

JobRun

SAMPLE

KEY – organization_id + run_id + startEpochMS

Incremental Columns – startEpochMS

Partition Columns – organization_id

Write Mode – Merge

Inventory of every canonical job run executed in a databricks workspace.

Column Type Description
organization_id string Canonical workspace id
run_id int Incremental Canonical run ID for the workspaced
run_name string Name of run (if named)
job_runtime struct Complex type with all, standard, runtime information regarding the runtime of the job. The start begins from the moment the job run start is requested (including cluster create time if relevant) and the end time marks the moment the workspace identifies the run as terminated
job_id int Canonical ID of job
id_in_job int Run instance of the job_id. This can be seen in the UI in a job spec denoted as “Run N”. Each Job ID has first id_in_job as “Run 1” and is incrented and canonical ONLY for the job_id. This field omits the “Run " prefix and results in an integer value of run instance within job.
job_cluster_type string Either new OR existing. New == automated and existing == interactive cluster type
job_task_type string Job Task Type - such as Notebook, Python, etc. See JobTask
job_terminal_state string Result state of run such as SUCCESS, FAILED, TIMEDOUT, CANCELLED. See RunResultState
job_trigger_type string Type of trigger: PERIODIC, ONE_TIME, RETRY. See TriggerType
cluster_id string Canonical workspace cluster id
notebook_params string JSON A map of (String, String) parameters sent to notebook parameter overrides. See ParamPair
libraries string JSON Array of Libraries in the format defined HERE
children array Array of structs that show all children of this job
workflow_context string ?? REVIEW ??
task_detail struct Unified location for JobTask contingent upon jobrun task. See JobTask
cancellation_detail struct All cancellation request detail and status
time_detail struct All time dimensions tied to the run. runBeginTime == Time the run began to execute on the cluster, SubmissionTime == Time the run request was received by the endpoint (before cluster build/start), completionTime == time the workspace denoted the runID was terminal state
started_by struct Run request received from user – email is recorded here – usage == started_by.email
request_detail struct Complete request detail received by the endpoint

JobRunCostPotentialFact

SAMPLE

KEY – organization_id + run_id + startEpochMS

Incremental Columns – startEpochMS

Partition Columns – organization_id

Write Mode – Merge

This fact table defines the job, the cluster, the cost, the potential, and utilization (if cluster logging is enabled) of a cluster associated with a specific Databricks Job Run.

Dimensionality Note that this fact table is not normalized by time but rather by job run and cluster state. Costs are not derived from job runs but from clusters thus the state[s] of the cluster are what’s pertinent when tying to cost. This is extremely important in the case of long running jobs, such as streaming.

SCENARIO: Imagine a streaming job with 12 concurrent runs on an existing cluster that run for 20 days at the end of which the driver dies for some reason causing all runs fail and begin retrying but failing. When the 20 days end, the cost will be captured solely on that date and even more importantly, not only will all 20 days be captured at that date but the cost associated will be cluster runtime for 20 days * number of runs. Overwatch will automatically smooth the costs across the concurrent runs but not the days running since this fact table is not based by on an equidistant time axis.

  • Potential: Total core_milliseconds for which the cluster COULD execute spark tasks. This derivation only includes the worker nodes in a state ready to receive spark tasks (i.e. Running). Nodes being added or running init scripts are not ready for spark jobs thus those core milliseconds are omitted from the total potential.
  • Cost: Derived from the instanceDetails table and DBU configured contract price (see Configuration for more details). The compute costs in instanceDetails table are taken from the “Compute_Contract_Price” values associated with the instance type in instanceDetails.
  • Utilization: Utilization is a function of core milliseconds used during spark task execution divided by the total amount of core milliseconds available given the cluster size and state. (i.e. spark_task_runtime_H / worker_potential_core_H)
  • Cluster State: The state[s] of a cluster during a run. As the cluster scales and morphs to accommodate the run’s needs, the state changes. The number of state changes are recorded in this table as “run_cluster_states”.
  • Run State: Advanced Topic for data engineers and developers. This topic is discussed in considerable detail in the Advanced Topics section. Given a cluster state, the run state is a state of all runs on a cluster at a given moment in time. This is the measure used to calculate shared costs across concurrent runs. A run state cannot pass the boundaries of a cluster state, a run that continues across cluster-state lines will result in a new run state.
Column Type Description
organization_id string Canonical workspace id
job_id long Canonical ID of job
id_in_job long Run instance of the job_id. This can be seen in the UI in a job spec denoted as “Run N”. Each Job ID has first id_in_job as “Run 1” and is incrented and canonical ONLY for the job_id. This field omits the “Run " prefix and results in an integer value of run instance within job.
job_runtime struct Time details of job start/end in epoch millis and timestamps
cluster_id string Canonical workspace cluster id
cluster_name string Name of cluster at time of run
cluster_type string Either new OR existing. New == automated and existing == interactive cluster type
custom_tags string JSON string of key/value pairs for all cluster associated custom tags give to the cluster
run_terminal_state string Final state of the job such as “Succeeded”, “Failed” or “Cancelled”
run_trigger_type string How the run was triggered (i.e. cron / manual)
run_task_type string Type of task in the job (i.e. notebook, jar, spark-submit, etc.)
driver_node_type_id string KEY of driver node type to enable join to instanceDetails
node_type_id string KEY of worker node type to enable join to instanceDetails
dbu_rate double Effective DBU rate at time of job run used for calculations based on configured contract price in instanceDetails at the time of the Overwatch Pipeline Run
running_days array Array (or list) of dates (not strings) across which the job run executed. This simplifies day-level cost attribution, among other metrics, when trying to smooth costs for long-running / streaming jobs
avg_cluster_share double Average share of the cluster the run had available assuming fair scheduling. This DOES NOT account for activity outside of jobs (i.e. interactive notebooks running alongside job runs), this measure only splits out the share among concurrent job runs. Measure is only calculated for interactive clusters, automated clusters assume 100% run allocation. For more granular utilization detail, enable cluster logging and utilize “job_run_cluster_util” column which derives utilization at the spark task level.
avg_overlapping_runs double Number of concurrent runs shared by the cluster on average throughout the run
max_overlapping_runs long Highest number of concurrent runs on the cluster during the run
run_cluster_states long Count of cluster states during the job run
worker_potential_core_H double cluster core hours capable of executing spark tasks, “potential”
driver_compute_cost double Compute costs associated with driver runtime
driver_dbu_cost double DBU costs associated with driver runtime
worker_compute_cost double Compute costs associated with worker runtime
worker_dbu_cost double DBU costs associated with cumulative runtime of all worker nodes
total_driver_cost double Driver costs including DBUs and compute
total_worker_cost double Worker costs including DBUs and compute
total_compute_cost double All compute costs for Driver and Workers
total_dbu_cost double All dbu costs for Driver and Workers
total_cost double Total cost from Compute and DBUs for all nodes (including Driver)
spark_task_runtimeMS long Spark core execution time in milliseconds (i.e. task was operating/locking on core)
spark_task_runtime_H double Spark core execution time in Hours (i.e. task was operating/locking on core)
job_run_cluster_util double Cluster utilization: spark task execution time / cluster potential. True measure by core of utilization. Only available when cluster logging is enabled.

Notebook

SAMPLE

KEY – organization_id + notebook_id + request_id + action + unixTimeMS

Incremental Columns – unixTimeMS

Partition Columns – organization_id

Write Mode – Append

Column Type Description
notebook_id string Canonical notebook id
notebook_name string Name of notebook at time of action requested
notebook_path string Path of notebook at time of action requested
cluster_id string Canonical workspace cluster id
action string action recorded
timestamp timestamp timestamp the action took place
old_name string When action is “renameNotebook” this holds notebook name before rename
old_path string When action is “moveNotebook” this holds notebook path before move
new_name string When action is “renameNotebook” this holds notebook name after rename
new_path string When action is “moveNotebook” this holds notebook path after move
parent_path string When action is “renameNotebook” notebook containing, workspace path is recorded here
user_email string Email of the user requesting the action
request_id string Canonical request_id
response struct HTTP response including errorMessage, result, and statusCode

InstancePool

KEY – organization_id + instance_pool_id + timestamp

Incremental Columns – timestamp

Partition Columns – organization_id

Write Mode – Merge

Column Type Description
instance_pool_id string Canonical notebook id
instance_pool_name string Name of notebook at time of action requested
actionName string action recorded
timestamp long timestamp the action took place
node_type_id string Type of node in the pool
idle_instance_autotermination_minutes long Minutes after which a node shall be terminated if unused
min_idle_instances long Minimum number of hot instances in the pool
max_capacity long Maximum number of nodes allowed in the pool
preloaded_spark_versions string Spark versions preloaded on nodes in the pool

Account Tables

Not exposed in the consumer database. These tables contain more sensitive information and by default are not exposed in the consumer database but held back in the ETL database. This is done purposely to simplify security when/if desired. If desired, this can be exposed in consumer database with a simple vew definition exposing the columns desired.

For deeper insights regarding audit, please reference auditLogSchema. This is simplified through the use of the ETL_DB.audit_log_bronze and filter where serviceName == accounts for example. Additionally, you may filter down to specific actions using “actionName”. An example query is provided below:

spark.table("overwatch.audit_log_bronze")
  .filter('serviceName === "accounts" && 'actionName === "createGroup")
  .selectExpr("*", "requestParams.*").drop("requestParams")      

Slow changing dimension of user entity through time. Also used as reference map from user_email to user_id

Column Type Description
organization_id string Canonical workspace id
user_id string Canonical user id for which the action was requested (within the workspace) (target)
user_email string User’s email for which the action was requested (target)
action string Action requested to be performed
added_from_ip_address string Source IP of the request
added_by string Authenticated user that made the request
user_agent string request origin such as browser, terraform, api, etc.

AccountMod

SAMPLE

KEY – organization_id + acton + mod_unixTimeMS + request_id

Incremental Columns – mod_unixTimeMS

Partition Columns – organization_id

Write Mode – Append

TODO

AccountLogin

SAMPLE

KEY – organization_id + login_type + login_unixTimeMS + from_ip_address

Incremental Columns – login_unixTimeMS

Partition Columns – organization_id

Write Mode – Append

Not exposed in the consumer database. This table contains more sensitive information and by default is not exposed in the consumer database but held back in the etl datbase. This is done purposely to simplify security when/if desired. If desired, this can be exposed in consumer database with a simple vew definition exposing the columns desired.

Column Type Description
user_id string Canonical user id (within the workspace)
user_email string User’s email
login_type string Type of login such as web, ssh, token
ssh_username string username used to login via SSH
groups_user_name string ?? To research ??
account_admin_userID string ?? To research ??
login_from_ip_address struct Details about the source login and target logged into
user_agent string request origin such as browser, terraform, api, etc.

The following sections are related to Spark. Everything that can be seend/found in the SparkUI is visibel in the spark tables below. A reasonable understanding of the Spark hierarchy is necessary to make this section simpler. Please reference Spark Hierarchy For More Details for more details.

SparkExecution

SAMPLE

KEY – organization_id + spark_context_id + execution_id + date + unixTimeMS

Incremental Columns – date + unixTimeMS

Partition Columns – organization_id

Write Mode – Merge

Column Type Description
organization_id string Canonical workspace id
spark_context_id string Canonical context ID – One Spark Context per Cluster
cluster_id string Canonical workspace cluster id
execution_id long Spark Execution ID
description string Description provided by spark
details string Execution StackTrace
sql_execution_runtime struct Complete runtime detail breakdown

SparkExecutor

SAMPLE

KEY – organization_id + spark_context_id + executor_id + date + unixTimeMS

Incremental Columns – date + unixTimeMS

Partition Columns – organization_id

Write Mode – Merge

Column Type Description
organization_id string Canonical workspace id
spark_context_id string Canonical context ID – One Spark Context per Cluster
cluster_id string Canonical workspace cluster id
executor_id int Executor ID
executor_info string Executor Detail
removed_reason string Reason executor was removed
executor_alivetime struct Complete lifetime detail breakdown

SparkJob

SAMPLE

KEY – organization_id + spark_context_id + job_id + unixTimeMS

Incremental Columns – date + unixTimeMS

Partition Columns – organization_id + date

Z-Order Columns – cluster_id

Write Mode – Merge

Column Type Description
organization_id string Canonical workspace id
spark_context_id string Canonical context ID – One Spark Context per Cluster
cluster_id string Canonical workspace cluster id
job_id string Spark Job ID
job_group_id string Spark Job Group ID – NOTE very powerful for many reasons. See SparkEvents
execution_id string Spark Execution ID
stage_ids array[long] Array of all Spark Stage IDs nested within this Spark Job
notebook_id string Canonical Databricks Workspace Notebook ID
notebook_path string Databricks Notebook Path
user_email string email of user that owned the request, for Databricks jobs this will be the job owner
db_job_id string Databricks Job Id executing the Spark Job
db_id_in_job string “id_in_job” such as “Run 10” without “Run " prefix. This is a critical join column when working looking up Databricks Jobs metadata
job_runtime string Complete job runtime detail breakdown
job_result struct Job Result and Exception if present

SparkStage

SAMPLE

KEY – organization_id + spark_context_id + stage_id + stage_attempt_id + unixTimeMS

Incremental Columns – date + unixTimeMS

Partition Columns – organization_id + date

Z-Order Columns – cluster_id

Write Mode – Merge

Column Type Description
organization_id string Canonical workspace id
spark_context_id string Canonical context ID – One Spark Context per Cluster
cluster_id string Canonical workspace cluster id
stage_id string Spark Stage ID
stage_attempt_id string Spark Stage Attempt ID
stage_runtime string Complete stage runtime detail
stage_info string Lineage of all accumulables for the Spark Stage

SparkTask

SAMPLE

KEY – organization_id + spark_context_id + task_id + task_attempt_id + stage_id + stage_attempt_id + host + unixTimeMS

Incremental Columns – date + unixTimeMS

Partition Columns – organization_id + date

Z-Order Columns – cluster_id

Write Mode – Merge

USE THE PARTITION COLUMN (date) and Indexed Column (cluster_id) in all joins and filters where possible. This table can get extremely large, select samples or smaller date ranges and reduce joins and columns selected to improve performance.

Column Type Description
organization_id string Canonical workspace id
workspace_name string Customizable human-legible name of the workspace, should be globally unique within the organization
spark_context_id string Canonical context ID – One Spark Context per Cluster
cluster_id string Canonical workspace cluster id
task_id string Spark Task ID
task_attempt_id string Spark Task Attempt ID
stage_id string Spark Stage ID
stage_attempt_id string Spark Stage Attempt ID
executor_id string Spark Executor ID
host string Internal IP address of node
task_runtime string Complete task runtime detail
task_metrics string Lowest level compute metrics provided by spark such as spill bytes, read/write bytes, shuffle info, GC time, Serialization, etc.
task_info string Lineage of all accumulables for the Spark Task
task_type string Spark task Type (i.e. ResultTask, ShuffleMapTask, etc)
task_end_reason string Task end status, state, and details plus stake trace when error

SparkStream_preview

KEY – organization_id + spark_context_id + cluster_id + stream_id + stream_run_id + stream_batch_id + stream_timestamp

Incremental Columns – date + stream_timestamp

Partition Columns – organization_id + date

Z-Order Columns – cluster_id

Write Mode – Merge

Remains in preview through version 0.6.0 as more feedback is requested from users and use-cases before this table structure solidifes.

Column Type Description
spark_context_id string Canonical context ID – One Spark Context per Cluster
cluster_id string Canonical workspace cluster id
stream_id string GUID ID of the spark stream
stream_name string Name of stream if named
stream_run_id string GUID ID of the spark stream run
stream_batch_id long GUID ID of the spark stream run batch
stream_timestamp long Unix time (millis) the stream reported its batch complete metrics
streamSegment string Type of event from the event listener such as ‘Progressed’
streaming_metrics dynamic struct All metrics available for the stream batch run
execution_ids array Array of execution_ids in the spark_context. Can explode and tie back to sparkExecution and other spark tables

Common Meta Fields

Column Type Description
organization_id string Workspace / Organization ID on which the cluster was instantiated
cluster_id string Canonical workspace cluster id
unixTimeMS long unix time epoch as a long in milliseconds
timestamp string unixTimeMS as a timestamp type in milliseconds
date string unixTimeMS as a date type
created_by string
last_edited_by string last user to edit the state of the entity
last_edited_ts string timestamp at which the entitiy’s sated was last edited
deleted_by string user that deleted the entity
deleted_ts string timestamp at which the entity was deleted
event_log_start string Spark Event Log BEGIN file name / path
event_log_end string Spark Event Log END file name / path
Pipeline_SnapTS string Snapshot timestmap of Overwatch run that added the record
Overwatch_RunID string Overwatch canonical ID that resulted in the record load

ETL Tables

The following are the list of potential tables, the module with which it’s created and the layer in which it lives. This list consists of only the ETL tables created to facilitate and deliver the consumption layer The gold and consumption layers are the only layers that maintain column name uniformity and naming convention across all tables. Users should always reference Consumption and Gold layers unless the data necessary has not been curated.

Bronze

Table Scope Layer Description
audit_log_bronze audit bronze Raw audit log data full schema
audit_log_raw_events audit bronze (azure) Intermediate staging table responsible for coordinating intermediate events from azure Event Hub
cluster_events_bronze clusterEvents bronze Raw landing of dataframe derived from JSON response from cluster events api call. Note: cluster events expire after 30 days of last termination. (reference)
clusters_snapshot_bronze clusters bronze API snapshot of existing clusters defined in Databricks workspace at the time of the Overwatch run. Snapshot is taken on each run
jobs_snapshot_bronze jobs bronze API snapshot of existing jobs defined in Databricks workspace at the time of the Overwatch run. Snapshot is taken on each run
pools_snapshot_bronze pools bronze API snapshot of existing pools defined in Databricks workspace at the time of the Overwatch run. Snapshot is taken on each run
spark_events_bronze sparkEvents bronze Raw landing of the master sparkEvents schema and data for all cluster logs. Cluster log locations are defined by cluster specs and all locations will be scanned for new files not yet captured by Overwatch. Overwatch uses an implicit schema generation here, as such, a lack of real-world can cause unforeseen issues.
spark_events_processedfiles sparkEvents bronze Table that keeps track of all previously processed cluster log files (spark event logs) to minimize future file scanning and improve performance. This table can be used to reprocess and/or find specific eventLog files.
pipeline_report NA tracking Tracking table used to identify state and status of each Overwatch Pipeline run. This table is also used to control the start and end points of each run. Altering the timestamps and status of this table will change the ETL start/end points.

Silver

Table Scope Layer Description
account_login_silver accounts silver Login events
account_mods_silver accounts silver Account modification events
cluster_spec_silver clusters silver Slow changing dimension used to track all clusters through time including edits but excluding state change.
cluster_state_detail_silver clusterEvents silver State detail for each cluster event enriched with cost information
job_status_silver jobs silver Slow changing dimension used to track all jobs specifications through time
jobrun_silver jobs silver Historical run of every job since Overwatch began capturing the audit_log_data
notebook_silver notebooks silver Slow changing dimension used to track all notebook changes as it morphs through time along with which user instigated the change. This does not include specific change details of the commands within a notebook just metadata changes regarding the notebook.
pools_silver pools silver Slow changing dimension used to track all changes to instance pools
spark_executions_silver sparkEvents silver All spark event data relevant to spark executions
spark_executors_silver sparkEvents silver All spark event data relevant to spark executors
spark_jobs_silver sparkEvents silver All spark event data relevant to spark jobs
spark_stages_silver sparkEvents silver All spark event data relevant to spark stages
spark_tasks_silver sparkEvents silver All spark event data relevant to spark tasks

Gold

Table Scope Layer Description
account_login_gold accounts gold Login events
account_mods_gold accounts gold Account modification events
cluster_gold clusters gold Slow-changing dimension with all cluster creates and edits through time. These events DO NOT INCLUDE automated cluster resize events or cluster state changes. Automated cluster resize and cluster state changes will be in clusterstatefact_gold. If user changes min/max nodes or node count (non-autoscaling) the event will be registered here AND clusterstatefact_gold.
clusterStateFact_gold clusterEvents gold All cluster event changes along with the time spent in each state and the core hours in each state. This table should be used to find cluster anomalies and/or calculate compute/DBU costs of some given scope.
job_gold jobs gold Slow-changing dimension of all changes to a job definition through time
jobrun_gold jobs gold Dimensional data for each job run in the databricks workspace
notebook_gold notebooks gold Slow changing dimension used to track all notebook changes as it morphs through time along with which user instigated the change. This does not include specific change details of the commands within a notebook just metadata changes regarding the notebook.
instancepool_gold pools gold Slow changing dimension used to track all changes to instance pools
sparkexecution_gold sparkEvents gold All spark event data relevant to spark executions
sparkexecutor_gold sparkEvents gold All spark event data relevant to spark executors
sparkjob_gold sparkEvents gold All spark event data relevant to spark jobs
sparkstage_gold sparkEvents gold All spark event data relevant to spark stages
sparktask_gold sparkEvents gold All spark event data relevant to spark tasks
sparkstream_gold sparkEvents gold All spark event data relevant to spark streams