DeltaOMS can be configured through multiple methods :
Command Line Parameters over-rides Spark Configurations.
For this tutorial we will use the command line parameters.More details about the other configurations can be found under Additional Configurations section.
Follow the below steps to initialize the DeltaOMS centralized Database and tables.
Import and Open the DeltaOMS Setup notebook STEP 1 into your Databricks environment
Modify the value of the variables omsBaseLocation
, omsDBName
,
omsCheckpointSuffix
, omsCheckpointBase
as appropriate for your environment
Variable | Description |
---|---|
omsBaseLocation | Base location/path of the OMS Database on the Delta Lakehouse |
omsDBName | DeltaOMS Database Name. This is the centralized database with all the Delta logs |
omsCheckpointBase | DeltaOMS ingestion is a streaming process.This defines the Base path for the checkpoints |
omsCheckpointSuffix | Suffix to be added to the checkpoint path (Helps in making the path unique) |
Attach the DeltaOMS jar (as a library through Maven) to a running cluster
Install New
from the clusters Libraries
tab
Install Library
window, select Maven
and click on Search packages
Maven Central
from the drop-down
delta-oms
and select the latest release version
Install
to install the DeltaOMS library into the clusterAttach the imported notebook to the cluster and start executing the cells
Execute com.databricks.labs.deltaoms.init.InitializeOMS.main
method to create the OMS DB and tables.
Validate the DeltaOMS database and tables were created (Cmd 5
and Cmd 6
)
Next, we will add few input sources (existing Delta databases or tables) to be tracked by DeltaOMS. This is done using the same notebook.
Add the names of few databases you want to track via DeltaOMS to the sourceconfig
table in the DeltaOMD DB.
This is done by using a simple SQL INSERT
statement:
INSERT INTO <omsDBName>.sourceconfig VALUES('<Database Name>',false, Map('wildCardLevel','0'))
Refer to the Developer Guide for more details on the tables.
Configure the internal DeltaOMS configuration tables by executing
com.databricks.labs.deltaoms.init.ConfigurePaths.main
.
This will populate the internal configuration table pathconfig
with the detailed path
information for all delta tables under the database(s)
Next, we will create couple of databricks jobs for executing the solution. These jobs can be created manually by following the configuration options mentioned below.
The first databricks job will stream ingest the delta logs from the configured delta tables and persist in the rawactions
DeltaOMS table.
For example, you could name the job OMSIngestion_Job
. The main configurations for the job are:
Main class : com.databricks.labs.deltaoms.ingest.StreamPopulateOMS
Example Parameters : ["--dbName=oms_test_aug31","--baseLocation=dbfs:/user/hive/warehouse/oms","--checkpointBase=dbfs:/user/hive/warehouse/oms/_checkpoints","--checkpointSuffix=_aug31_171000","--skipPathConfig","--skipInitializeOMS","--startingStream=1","--endingStream=50"]
The second job will process the raw actions and organize them into Commit Info and Action snapshots for querying and further analytics.
You could name the job OMSProcessing_Job
. The main configurations for the job are:
Main class : com.databricks.labs.deltaoms.process.OMSProcessRawActions
Example Parameters : ["--dbName=oms_test_aug31","--baseLocation=dbfs:/user/hive/warehouse/oms"]
Example :
The ingestion job can also be created through a sample script provided as part of the solution. The steps to run the sample script are:
omsBaseLocation
, omsDBName
, omsCheckpointSuffix
, omsCheckpointBase
num_streams_per_job
to change number of streams per job.Jobs
UI to look at the created jobsNote: Instead of setting up two different Databricks jobs , you could also setup a single job with multiple tasks using the Multi-task Job feature.
Refer to the Developer Guide for more details on multiple stream approach for DeltaOMS ingestion and the processing job.