DeltaOMS provides the component com.databricks.labs.deltaoms.init.InitializeOMS
for initializing
the centralized OMS database. The component creates the OMS DB at the location specified by the configuration settings.
Note: This process will delete all existing data in the specified location.
Refer to Additional Configurations for full configuration settings.
Once DeltaOMS has been initialized, and the input sources configured Source Config,
the component provided by com.databricks.labs.deltaoms.init.ConfigurePaths
is executed to
populate the internal path config tables. This component is also responsible for
discovering delta tables under a database or under a directory (based on how the source is configured).
Refer to Additional Configurations for full configuration settings.
This component should be executed whenever there are updates to the source config table.
DeltaOMS uses the delta log json files as its input. Details of the Delta Protocol Json files can be found here
Based on the databases, tables or individual path configured in DeltaOMS, it will fetch the
json files under the _delta_log
folder as streams. To prevent creation of multiple streams
(applicable for large number of databases and streams) , the number of streams are optimized by
creating streams on wildcard paths instead of individual paths for table(s).
For example, if a tracked table has the path as
dbfs:/user/hive/warehouse/sample.db/table1
, DeltaOMS uses the path
dbfs:/user/hive/warehouse/sample.db/*/_delta_log/*.json
to read the Delta logs.
The benefits of this approach are :
For ways to configure the wildcard behaviour , refer to Source Config
During ingestion, DeltaOMS reads the list of wildcardpath
s from the pathconfig
table and
creates separate read streams for each path. OMS also maintains separate checkpoint location, pool
and queryname for each stream. The Delta log json files are read, enriched with additional fields
and persisted to the rawactions
table.
The ingestion process is executed from the com.databricks.labs.deltaoms.ingest.StreamPopulateOMS
object. By default, each streaming ingestion job can support upto 50 streams. If more than 50
streams ( for example more than 50 distinct wildcard paths) would be involved , we recommend
creating separate Databricks jobs. The com.databricks.labs.deltaoms.ingest.StreamPopulateOMS
supports command line parameters --startingStream
and --endingStream
for setting the
number of streams in each job.By default, these are set to 1 and 50 respectively.
Refer to Additional Configurations for full configuration settings.
DeltaOMS processes the ingested data from rawactions
and creates enriched / reconciled tables
for Commit Info and Snapshots.
The com.databricks.labs.deltaoms.process.OMSProcessRawActions
object looks for the newly added
raw actions by utilizing the
Change Data Feed feature.
The new actions are reconciled to get a list of data files conforming to
AddFile
and CommitInfo.
These are then persisted into Action Snapshots and Commit Info Snapshots respectively.
Refer to Additional Configurations for full configuration settings.