Welcome To Overwatch > Data Engineering > Data Dictionary (Latest)

Data Dictionary (Latest)

ERD

The “ERD” below is a visual representation of the consumer layer data model. Many of the joinable lines have been omitted to reduce chaos and complexity in the visualization. All columns with the same name are joinable (even if there’s not a line from one table to the other). The relations depicted are to call the analyst’s attention to less obvious joins.

The goal is to present a data model that unifies the different parts of the platform. The Overwatch team will continue to work with Databricks platform teams to publish and simplify this data. The gray boxes annotated as “Backlog/Research” are simply a known gap, and a pursuit of the Overwatch dev team, it does NOT mean it’s going to be released soon but rather that we are aware of the missing component, and we hope to enable gold-level data here in the future.

OverwatchERD

Consumption Layer “Tables” (Views)

All end users should be hitting consumer tables first. Digging into lower layers gets significantly more complex. Below is the data model for the consumption layer. The consumption layer is often in a stand-alone database apart from the ETL tables to minimize clutter and confusion. These entities in this layer are actually not tables at all (with a few minor exceptions such as lookup tables) but rather views. This allows for the Overwatch development team to alter the underlying columns, names, types, and structures without breaking existing transformations. Instead, view column names will remain the same but may be repointed to a newer version of a column, etc.

ETL should not be developed atop the consumption layer views but rather the gold layer. Before Overwatch version upgrades, it’s important that the engineering team review the change list and upgrade requirements before upgrading. These upgrades may require a remap depending on the changes. As of version 1.0 release, all columns in the gold layer will be underscored with their schema version number, column changes will reference the later release version but the views published with Overwatch will almost always point to the latest version of each column and will not include the schema suffix to simplify the data model for the average consumer.

Data Organization

The large gray boxes in the simplified ERD below depict the two major, logical sections of the data model:

Databricks Platform - Metadata captured by the Databricks platform that can be used to assist in workspace governance. This data can also be enriched with the Spark data enabling in-depth analyses. The breadth of metadata is continuing to grow, stay tuned for additional capabilities.
Spark UI The spark UI section is derived from the spark event logs and essentially contains every single piece of data from the Spark UI. There are a few sections that are not included in the first release but the data is present in spark_events_bronze albeit extremely complex to derive. The Overwatch development team is working tirelessly to expose additional SparkUI data and will publish as soon as it’s ready.

Column Descriptions

Complete column descriptions are only provided for the consumption layer. The entity names are linked below.

cluster
clusterStateFact
instanceDetails
job
jobrun
jobRunCostPotentialFact
sqlQueryHistory
notebook
instancePool
dbuCostDetail
accountLogin
accountMod
sparkExecution
sparkExecutor
sparkJob
sparkStage
sparkTask
sparkStream
warehouse
warehouseDbuDetails
warehouseStateFact
notebookCommands
Common Meta Fields
- There are several fields that are present in all tables. Instead of cluttering each table with them, this section was created as a reference to each of these.

Most tables below provide a data SAMPLE for reference. You may either click to view it or right click the SAMPLE link and click saveTargetAs or saveLinkAs and save the file. Note that these files are TAB delimited, so you will need to view as such if you save to local file. The data in the files were generated from an Azure, test deployment created by Overwatch Developers.

Column	Type	Description
cluster_id	string	Canonical Databricks cluster ID (more info in Common Meta Fields)
action	string	create, edit, or snapImpute – depicts the type of action for the cluster – **snapImpute is used on first run to initialize the state of the cluster even if it wasn’t created/edited since audit logs began
timestamp	timestamp	timestamp the action took place
cluster_name	string	user-defined name of the cluster
driver_node_type	string	Canonical name of the driver node type.
node_type	string	Canonical name of the worker node type.
num_workers	int	The number of workers defined WHEN autoscaling is disabled
autoscale	struct	The min/max workers defined WHEN autoscaling is enabled
auto_termination_minutes	int	The number of minutes before the cluster auto-terminates due to inactivity
enable_elastic_disk	boolean	Whether autoscaling disk was enabled or not
is_automated	booelan	Whether the cluster is automated (true if automated false if interactive)
cluster_type	string	Type of cluster (i.e. Serverless, SQL Analytics, Single Node, Standard)
security_profile	struct	Complex type to describe security features enabled on the cluster. More information Below
cluster_log_conf	string	Logging directory if configured
init_script	array	Array of init scripts
custom_tags	string	User-Defined tags AND also includes Databricks JobID and Databricks RunName when the cluster is created by a Databricks Job as an automated cluster. Other Databricks services that create clusters also store unique information here such as SqlEndpointID when a cluster is created by “SqlAnalytics”
cluster_source	string	Shows the source of the action
spark_env_vars	string	Spark environment variables defined on the cluster
spark_conf	string	custom spark configuration on the cluster that deviate from default
acl_path_prefix	string	Automated jobs pass acl to clusters via a path format, the path is defined here
instance_pool_id	string	Canonical pool id from which workers receive nodes
driver_instance_pool_id	string	Canonical pool id from which driver receives node
instance_pool_name	string	Name of pool from which workers receive nodes
driver_instance_pool_name	string	Name of pool from which driver receives node
spark_version	string	DBR version - scala version
idempotency_token	string	Idempotent jobs token if used

Column	Type	Description
instance	string	Common name of instance type
API_name*	string	Canonical KEY name of the node type – use this to join to node_ids elsewhere
vCPUs*	int	Number of virtual cpus provisioned for the node type
Memory_GB	double	Gigabyes of memory provisioned for the node type
Compute_Contract_Price*	double	Contract price for the instance type as negotiated between customer and cloud vendor. This is the value used in cost functions to deliver cost estimates. It is defaulted to equal the on_demand compute price
On_Demand_Cost_Hourly	double	On demand, list price for node type DISCLAIMER – cloud provider pricing is dynamic and this is meant as an initial reference. This value should be validated and updated to reflect actual pricing
Linux_Reserved_Cost_Hourly	double	Reserved, list price for node type DISCLAIMER – cloud provider pricing is dynamic and this is meant as an initial reference. This value should be validated and updated to reflect actual pricing
Hourly_DBUs*	double	Number of DBUs charged for the node type
is_active	boolean	whether the contract price is currently active. This must be true for each key where activeUntil is null
activeFrom*	date	The start date for the costs in this record. NOTE this MUST be equal to one other record’s activeUntil unless this is the first record for these costs. There may be no overlap in time or gaps in time.
activeUntil*	date	The end date for the costs in this record. Must be null to indicate the active record. Only one record can be active at all times. The key (API_name) must have zero gaps and zero overlaps from the Overwatch primordial date until now indicated by null (active)

Column	Type	Description
sku	string	One of automated, interactive, jobsLight, sqlCompute
contract_price	double	Price paid per DBU on the sku
is_active	boolean	whether the contract price is currently active. This must be true for each key where activeUntil is null
activeFrom*	date	The start date for the costs in this record. NOTE this MUST be equal to one other record’s activeUntil unless this is the first record for these costs. There may be no overlap in time or gaps in time.
activeUntil*	date	The end date for the costs in this record. Must be null to indicate the active record. Only one record can be active at all times. The key (API_name) must have zero gaps and zero overlaps from the Overwatch primordial date until now indicated by null (active)

Action	API
SnapImpute	Only created during the first Overwatch Run to initialize records of existing jobs not present in the audit logs. These jobs are still available in the UI but have not been modified since the collection of audit logs begun thus no events have been identified and therefore must be imputed to maximize coverage
Create	“Create New Job API”
Update	“Partially Update a Job”
Reset	“Overwrite All Settings for a Job”
Delete	“Delete A Job”
ChangeJobAcl	“Update Job Permissions”
ResetJobAcls	“Replace Job Permissions” – Not yet supported

Column	Type	Description
organization_id	string	Canonical workspace id
workspace_name	string	Customer defined name of the workspace or workspace_id (default)
job_id	long	Databricks job id
action	string	Action type defined by the record. One of: create, reset, update, delete, resetJobAcl, changeJobAcl. More information about these actions can be found here
date	date	Date of the action for the key
timestamp	timestamp	Timestamp the action took place
job_name	string	User defined name of job.
tags	map	The tags applied to the job if they exist
tasks	array	The tasks defined for the job
job_clusters	array	The job clusters defined for the job
libraries	array	LEGACY – Libraries defined in the job – Nested within tasks as of API 2.1
timeout_seconds	string	Job-level timeout seconds. Databricks supports timeout seconds at both the job level and the task level. Task level timeout_seconds can be found nested within tasks
max_concurrent_runs	long	Job-level – maximum concurrent executions of the job
max_retries	long	LEGACY – Max retries for legacy jobs – Nested within tasks as of API 2.1
retry_on_timeout	boolean	LEGACY – whether or not to retry if a job run times out – Nested within tasks as of API 2.1
min_retry_interval_millis	long	LEGACY – Minimal interval in milliseconds between the start of the failed run and the subsequent retry run. The default behavior is that unsuccessful runs are immediately retried. – Nested within tasks as of API 2.1
schedule	struct	Schedule by which the job should execute and whether or not it is paused
existing_cluster_id	string	LEGACY – If compute is existing interactive cluster the cluster_id will be here – Nested within tasks as of API 2.1
new_cluster	struct	LEGACY – The cluster_spec identified as an automated cluster for legacy jobs – Can be found nested within tasks now but ONLY for direct API Calls, editing legacy jobs, AND sparkSumbit tasks (as they cannot use job_clusters), otherwise, new_clusters defined through the UI will be defined as “job_clusters” and referenced by a “job_cluster_key” in the tasks field.
git_source	struct	Specification for a remote repository containing the notebooks used by this job’s notebook tasks.
task_detail_legacy	struct	LEGACY – The job execution details used to be defined at the root level for API 2.0 as of API 2.1 they have been nested within tasks. The logic definition will be defined here for legacy jobs only (or new jobs created using the 2.0 jobs API)
is_from_dlt	boolean	Whether or not the job was created from DLT – Unsupported as OW doesn’t yet support DLT but left here as a reference in case it can be helpful
aclPermissionSet	struct	Only populated for “ChangeJobAcl” actions. Defines the new ACLs for a job
target_user_id	string	Databricks canonical user id to which the aclPermissionSet is to be applied
session_id	string	session_id that requested the action
request_id	string	request_id of the action
user_agent	string	request origin such as browser, terraform, api, etc.
response	struct	response of api call including errorMessage, result, and statusCode (HTTP 200,400, etc)
source_ip_address	string	Origin IP of action requested
created_by	string	Email account that created the job
created_ts	long	Timestamp the job was created
deleted_by	string	Email account that deleted the job – will be null if job has not been deleted
deleted_ts	long	Timestamp the job was deleted – will be null if job has not been deleted
last_edited_by	string	Email account that made the previous edit – defaults to created by if no edits made
last_edited_ts	long	Timestamp the job was last edited

Column	Type	Description
notebook_id	string	Canonical notebook id
notebook_name	string	Name of notebook at time of action requested
notebook_path	string	Path of notebook at time of action requested
cluster_id	string	Canonical workspace cluster id
action	string	action recorded
timestamp	timestamp	timestamp the action took place
old_name	string	When action is “renameNotebook” this holds notebook name before rename
old_path	string	When action is “moveNotebook” this holds notebook path before move
new_name	string	When action is “renameNotebook” this holds notebook name after rename
new_path	string	When action is “moveNotebook” this holds notebook path after move
parent_path	string	When action is “renameNotebook” notebook containing, workspace path is recorded here
user_email	string	Email of the user requesting the action
request_id	string	Canonical request_id
response	struct	HTTP response including errorMessage, result, and statusCode

Column	Type	Description
login_unixTimeMS	string	Unix Timestamp when the user logged in
login_date	string	Date when user logged in
login_type	string	How did the user log in. One of aadTokenLogin, login, aadBrowserLogin, tokenLogin, samlLogin, jwtLogin, ssh
login_user	string	Canonical user id (within the workspace)
user_email	string	User’s email
login_type	string	Type of login such as web, ssh, token
from_ip_address	struct	Details about the source login and target logged into
user_agent	string	request origin such as browser, terraform, api, etc.
request_id	string	GUID of the login request
response	struct	HTTP Response to login attempt, including statusCode, error message, and result (if any)

Column	Type	Description
mod_unixTimeMS	bigint	Unix timestamp when the modification happened
mod_date	date	Date when the modification happened
action	string	Action performed, one of: add, addPrincipalToGroup, removePrincipalFromGroup, setAdmin, updateUser, delete
endpoint	string	Mechanism for making the change, one of: scim, adminConsole, autoUserCreation, roleAssignment
modified_by	string	Email of user making the change
user_name	string	Email or username of user profile being changed
user_id	string	Canonical user id (within the workspace) of user profile
group_name	string	In case the modification is to a group, the group name, otherwise this will ne NULL
group_id	string	In case the modification is to a group, the group ID, otherwise this will ne NULL
from_ip_address	string	IP Address where the change originated from
user_agent	string	request origin such as browser, terraform, api, etc.
request_id	string	GUID of the login request
response	struct	HTTP Response to login attempt, including statusCode, error message, and result (if any)

Column	Type	Description
spark_context_id	string	Canonical context ID – One Spark Context per Cluster
cluster_id	string	Canonical workspace cluster id
stream_id	string	GUID ID of the spark stream
stream_name	string	Name of stream if named
stream_run_id	string	GUID ID of the spark stream run
stream_batch_id	long	GUID ID of the spark stream run batch
stream_timestamp	long	Unix time (millis) the stream reported its batch complete metrics
streamSegment	string	Type of event from the event listener such as ‘Progressed’
streaming_metrics	dynamic struct	All metrics available for the stream batch run
execution_ids	array	Array of execution_ids in the spark_context. Can explode and tie back to sparkExecution and other spark tables

Column	Type	Description
cloud	string	AWS or Azure or GCP.
cluster_size	string	Size of the DBSQL warehouse.
driver_size	string	Driver size/type running behind the DBSQL warehouse
worker_count	long	Total number of clusters in the DBSQL warehouse
total_dbu	long	Total dbus charged for the cluster per hour
is_active	boolean	whether the contract price is currently active. This must be true for each key where activeUntil is null
activeFrom*	date	The start date for the costs in this record. NOTE this MUST be equal to one other record’s activeUntil unless this is the first record for these costs. There may be no overlap in time or gaps in time.
activeUntil*	date	The end date for the costs in this record. Must be null to indicate the active record. Only one record can be active at all times. The key (API_name) must have zero gaps and zero overlaps from the Overwatch primordial date until now indicated by null (active)

Column	Type	Description
warehouse_id	string	Canonical Databricks Warehouse ID (more info in Common Meta Fields)
warehouse_name	string	Name of warehouse at beginning of state
tags	string	JSON string of key/value pairs for all cluster associated custom tags give to the cluster
unixTimeMS_state_start	various	timestamp reference column at the time the state began
unixTimeMS_state_end	various	timestamp reference column at the time the state ended
state_start_date	date	warehouse state start date
state	string	state of the warehouse – full list HERE
cluster_size	string	configured size of the warehouse clusters.
current_num_clusters	long	current number of clusters in use by the warehouse at the start of the state
target_num_clusters	long	maximum number of clusters targeted to be present by the completion of the state. Should be equal to current_num_workers except during RESIZING state
uptime_since_restart_S	double	Seconds since the cluster was last restarted / terminated
uptime_in_state_S	double	Seconds the cluster spent in current state
uptime_in_state_H	double	Hours the cluster spent in current state
cloud_billable	boolean	All current known states are cloud billable. This means that cloud provider charges are present during this state
databricks_billable	boolean	State incurs databricks DBU costs.
warehouse_type	string	Warehouse type (PRO, Serverless, Classic)
state_dates	array	Array of all dates across which the state spanned
days_in_state	int	Number of days in state
worker_potential_core_H	double	Worker core hours available to execute spark tasks
total_dbu_cost	double	All dbu costs for Driver and Workers

Column	Type	Description
organization_id	string	Workspace / Organization ID on which the cluster was instantiated
cluster_id	string	Canonical workspace cluster id
unixTimeMS	long	unix time epoch as a long in milliseconds
timestamp	string	unixTimeMS as a timestamp type in milliseconds
date	string	unixTimeMS as a date type
created_by	string
last_edited_by	string	last user to edit the state of the entity
last_edited_ts	string	timestamp at which the entitiy’s sated was last edited
deleted_by	string	user that deleted the entity
deleted_ts	string	timestamp at which the entity was deleted
event_log_start	string	Spark Event Log BEGIN file name / path
event_log_end	string	Spark Event Log END file name / path
Pipeline_SnapTS	string	Snapshot timestmap of Overwatch run that added the record
Overwatch_RunID	string	Overwatch canonical ID that resulted in the record load

Table	Scope	Layer	Description
audit_log_bronze	audit	bronze	Raw audit log data full schema
audit_log_raw_events	audit	bronze (azure)	Intermediate staging table responsible for coordinating intermediate events from azure Event Hub
cluster_events_bronze	clusterEvents	bronze	Raw landing of dataframe derived from JSON response from cluster events api call. Note: cluster events expire after 30 days of last termination. (reference)
clusters_snapshot_bronze	clusters	bronze	API snapshot of existing clusters defined in Databricks workspace at the time of the Overwatch run. Snapshot is taken on each run
jobs_snapshot_bronze	jobs	bronze	API snapshot of existing jobs defined in Databricks workspace at the time of the Overwatch run. Snapshot is taken on each run
pools_snapshot_bronze	pools	bronze	API snapshot of existing pools defined in Databricks workspace at the time of the Overwatch run. Snapshot is taken on each run
spark_events_bronze	sparkEvents	bronze	Raw landing of the master sparkEvents schema and data for all cluster logs. Cluster log locations are defined by cluster specs and all locations will be scanned for new files not yet captured by Overwatch. Overwatch uses an implicit schema generation here, as such, a lack of real-world can cause unforeseen issues.
spark_events_processedfiles	sparkEvents	bronze	Table that keeps track of all previously processed cluster log files (spark event logs) to minimize future file scanning and improve performance. This table can be used to reprocess and/or find specific eventLog files.
warehouses_snapshot_bronze	DBSQL	bronze	API snapshot of existing warehouse defined in Databricks workspace at the time of the Overwatch run. Snapshot is taken on each run
pipeline_report	NA	tracking	Tracking table used to identify state and status of each Overwatch Pipeline run. This table is also used to control the start and end points of each run. Altering the timestamps and status of this table will change the ETL start/end points.

Table	Scope	Layer	Description
account_login_silver	accounts	silver	Login events
account_mods_silver	accounts	silver	Account modification events
cluster_spec_silver	clusters	silver	Slow changing dimension used to track all clusters through time including edits but excluding state change.
cluster_state_detail_silver	clusterEvents	silver	State detail for each cluster event enriched with cost information
job_status_silver	jobs	silver	Slow changing dimension used to track all jobs specifications through time
jobrun_silver	jobs	silver	Historical run of every job since Overwatch began capturing the audit_log_data
notebook_silver	notebooks	silver	Slow changing dimension used to track all notebook changes as it morphs through time along with which user instigated the change. This does not include specific change details of the commands within a notebook just metadata changes regarding the notebook.
pools_silver	pools	silver	Slow changing dimension used to track all changes to instance pools
spark_executions_silver	sparkEvents	silver	All spark event data relevant to spark executions
spark_executors_silver	sparkEvents	silver	All spark event data relevant to spark executors
spark_jobs_silver	sparkEvents	silver	All spark event data relevant to spark jobs
spark_stages_silver	sparkEvents	silver	All spark event data relevant to spark stages
spark_tasks_silver	sparkEvents	silver	All spark event data relevant to spark tasks
sql_query_history_silver	DBSQL	silver	History of all the sql queries executed through SQL warehouses
warehouse_spec_silver	DBSQL	silver	State detail for each warehouse event

Table	Scope	Layer	Description
account_login_gold	accounts	gold	Login events
account_mods_gold	accounts	gold	Account modification events
cluster_gold	clusters	gold	Slow-changing dimension with all cluster creates and edits through time. These events DO NOT INCLUDE automated cluster resize events or cluster state changes. Automated cluster resize and cluster state changes will be in clusterstatefact_gold. If user changes min/max nodes or node count (non-autoscaling) the event will be registered here AND clusterstatefact_gold.
clusterStateFact_gold	clusterEvents	gold	All cluster event changes along with the time spent in each state and the core hours in each state. This table should be used to find cluster anomalies and/or calculate compute/DBU costs of some given scope.
job_gold	jobs	gold	Slow-changing dimension of all changes to a job definition through time
jobrun_gold	jobs	gold	Dimensional data for each job run in the databricks workspace
notebook_gold	notebooks	gold	Slow changing dimension used to track all notebook changes as it morphs through time along with which user instigated the change. This does not include specific change details of the commands within a notebook just metadata changes regarding the notebook.
instancepool_gold	pools	gold	Slow changing dimension used to track all changes to instance pools
sparkexecution_gold	sparkEvents	gold	All spark event data relevant to spark executions
sparkexecutor_gold	sparkEvents	gold	All spark event data relevant to spark executors
sparkjob_gold	sparkEvents	gold	All spark event data relevant to spark jobs
sparkstage_gold	sparkEvents	gold	All spark event data relevant to spark stages
sparktask_gold	sparkEvents	gold	All spark event data relevant to spark tasks
sparkstream_gold	sparkEvents	gold	All spark event data relevant to spark streams
sql_query_history_gold	DBSQL	gold	History of all the sql queries executed through SQL warehouses
warehouse_gold	DBSQL	gold	Slow-changing dimension with all warehouse creates and edits through time.
notebookCommands_gold	audit,notebooks,clusterEvents	gold	Information related to Notebook Commands

Data Dictionary (Latest)

ERD

Consumption Layer “Tables” (Views)

Data Organization

Column Descriptions

Cluster

ClusterStateFact

Unsupported Scenarios

Cost Functions Explained

InstanceDetails

dbuCostDetails

Job

JobRun

Unsupported Scenarios

JobRunCostPotentialFact

Unsupported Scenarios

sqlQueryHistory

Notebook

InstancePool

Account Tables

AccountLogin

AccountMod

SparkExecution

SparkExecutor

SparkJob

SparkStage

SparkTask

SparkStream

Warehouse

WarehouseDbuDetails

WarehouseStateFact

Unsupported Scenarios

NotebookCommands

Common Meta Fields

ETL Tables

Bronze

Silver

Gold