Execution
Q. How do I get started ?
Please refer to the Getting Started guide
Q. How do I create metadata DLT-META ?
DLT-META needs following metadata files:
- Onboarding File captures input/output metadata
- Data Quality Rules File captures data quality rules
- Silver transformation File captures processing logic as sql
Q. What is DataflowSpecs?
DLT-META translates input metadata into Delta table as DataflowSpecs
Q. How many DLT pipelines will be launched using DLT-META?
DLT-META uses data_flow_group to launch DLT pipelines, so all the tables belongs to same group will be executed under single DLT pipeline.
Q. Can we run onboarding for bronze layer only?
Yes! Please follow below steps:
- Bronze Metadata preparation (example)
- Onboarding Job
- Option#1: DLT-META CLI
- Option#2: Manual Job Use below parameters
{ "onboard_layer": "bronze", "database": "dlt_demo", "onboarding_file_path": "dbfs:/dlt-meta/conf/onboarding.json", "bronze_dataflowspec_table": "bronze_dataflowspec_table", "import_author": "Ravi", "version": "v1", "uc_enabled": "True", "overwrite": "True", "env": "dev" }
- option#3: Databircks Notebook
onboarding_params_map = {
"database": "uc_name.dlt_demo",
"onboarding_file_path": "dbfs:/dlt-meta/conf/onboarding.json",
"bronze_dataflowspec_table": "bronze_dataflowspec_table",
"overwrite": "True",
"env": "dev",
"version": "v1",
"import_author": "Ravi"
}
from src.onboard_dataflowspec import OnboardDataflowspec
OnboardDataflowspec(spark, onboarding_params_map, uc_enabled=True).onboard_bronze_dataflow_spec()
Q. Can we run onboarding for silver layer only? Yes! Please follow below steps:
- Bronze Metadata preparation (example)
- Onboarding Job
- Option#1: DLT-META CLI
- Option#2: Manual Job Use below parameters
{ "onboard_layer": "silver", "database": "dlt_demo", "onboarding_file_path": "dbfs:/dlt-meta/conf/onboarding.json", "silver_dataflowspec_table": "silver_dataflowspec_table", "import_author": "Ravi", "version": "v1", "uc_enabled": "True", "overwrite": "True", "env": "dev" }
- option#3: Databircks Notebook
onboarding_params_map = {
"database": "uc_name.dlt_demo",
"onboarding_file_path": "dbfs:/dlt-meta/conf/onboarding.json",
"silver_dataflowspec_table": "silver_dataflowspec_table",
"overwrite": "True",
"env": "dev",
"version": "v1",
"import_author": "Ravi"
}
from src.onboard_dataflowspec import OnboardDataflowspec
OnboardDataflowspec(spark, onboarding_params_map, uc_enabled=True).onboard_silver_dataflow_spec()
Q. How to chain multiple silver tables after bronze table?
Example: After customers_cdc bronze table, can I have customers silver table reading from customers_cdc and another customers_clean silver table reading from customers_cdc? If so, how do I define these in onboarding.json?
You can run onboarding for additional silver customer_clean table by having onboarding file and silver transformation with filter condition for fan out.
Run onboarding for slilver layer in append mode(“overwrite”: “False”) so it will append to existing silver tables. When you launch DLT pipeline it will read silver onboarding and run DLT for bronze source and silver as target
Q. How can I do type1 or type2 merge to target table?
- Using DLT’s dlt.apply_changes we can do type1 or type2 merge.
- DLT-META have tag in onboarding file as
bronze_cdc_apply_changes
orsilver_cdc_apply_changes
which maps to DLT’s apply_changes API.
"silver_cdc_apply_changes": {
"keys":[
"customer_id"
],
"sequence_by":"dmsTimestamp",
"scd_type":"2",
"apply_as_deletes":"Op = 'D'",
"except_column_list":[
"Op",
"dmsTimestamp",
"_rescued_data"
]
}
Q. How can I write to same target table using different sources?
- Using DLT’s dlt.append_flow API we can write to same target from different sources.
- DLT-META have tag in onboarding file as bronze_append_flows and silver_append_flows dlt.append_flow API is mapped to
[
{
"name":"customer_bronze_flow1",
"create_streaming_table":false,
"source_format":"cloudFiles",
"source_details":{
"source_path_dev":"tests/resources/data/customers",
"source_schema_path":"tests/resources/schema/customer_schema.ddl"
},
"reader_options":{
"cloudFiles.format":"json",
"cloudFiles.inferColumnTypes":"true",
"cloudFiles.rescuedDataColumn":"_rescued_data"
},
"once":true
},
{
"name":"customer_bronze_flow2",
"create_streaming_table":false,
"source_format":"delta",
"source_details":{
"source_database":"{uc_catalog_name}.{bronze_schema}",
"source_table":"customers_delta"
},
"reader_options":{
},
"once":false
}
]
Q. How to add autloaders file metadata to bronze table?
DLT-META have tag source_metadata in onboarding json under source_details
"source_metadata":{
"include_autoloader_metadata_column":"True",
"autoloader_metadata_col_name":"source_metadata",
"select_metadata_cols":{
"input_file_name":"_metadata.file_name",
"input_file_path":"_metadata.file_path"
}
}
include_autoloader_metadata_column
flag will add _metadata column to target bronze dataframe.autoloader_metadata_col_name
if this provided then will be used to rename _metadata to this value otherwise default issource_metadata
select_metadata_cols:{key:value}
will be used to extract columns from _metadata. key is target dataframe column name and value is expression used to add column from _metadata column