Ingestion
Impulse's DefaultSolver reads from a
silver layer of three required tables: container_metrics,
channel_metrics, and channels. Two further tables, container_tags
and channel_tags, are fully optional — add them only if you want
tag-based container filtering or EAV channel selection
(query.channel(channel_name="Engine_RPM")); without channel_tags,
channels are selected directly from columns on channel_metrics. The full
schema is on the Silver Layer ER Diagram. This page
is for engineers who already have measurement data (CSV, MDF4, a
vendor-specific binary, or Delta with a different shape) and need a starting
point for landing it in that layout.
Impulse does not ship an ingestion component. The library reads from the silver layer; producing it is your responsibility. Landing your data in the shape below during ingest is the simplest path. If reshaping is impractical for your situation, see Adapting to existing data layouts at the bottom of this page.
If your data already lives in Delta with different physical column names
than the contract below, you do not need to rewrite it. Impulse supports a
per-table physical-to-internal column-name mapping for every silver table
via SolverConfig. See
Column-name remapping with SolverConfig
below.
1. The contract
The full schema is on the ER diagram page. When ingesting your own data, the required invariants are:
-
container_idis the primary key oncontainer_metricsand the foreign key on every other table. One container is one recording (one test drive, one bench run, one telemetry session). Pick a stable integer/long ID per recording. -
(container_id, channel_id)identifies a channel within a container. Channel IDs are local to their container —channel_id = 1in container A has nothing to do withchannel_id = 1in container B. -
channelssupports two formats. The query engine accepts either:- Raw — one row per sample:
(container_id, channel_id, timestamp, value). - RLE — one row per stable interval:
(container_id, channel_id, tstart, tend, value). Run-length encoded data, where identical consecutive values are collapsed into intervals to significantly reduce processing time during analysis.
An optional boolean
is_plausiblecolumn lets the solver drop implausible samples when configured to (drop_implausible_data=TrueonDefaultSolver). - Raw — one row per sample:
The tag tables are optional, strict EAV — add them only if you want
tag-based selection. container_tags is (container_id, key, value);
channel_tags is (container_id, channel_id, key, value). TSAL then selects
recordings and signals by tag key, e.g. query.channel(channel_name="Engine_RPM")
looks up channel_tags.value where key = 'channel_name'. Without
channel_tags, channel selectors match columns on channel_metrics instead.
The remaining columns on container_metrics and channel_metrics
(timestamps, durations, mean/min/max, etc.) are not fixed by the engine —
they are surfaced into the gold-layer dimensions through your
report configuration. Add the columns your
queries need; you do not have to match the demo schema column-for-column.
2. Worked example: the demo CSVs
The repository ships pre-shaped silver-layer fixtures at
demos/data/reporting/:
container_metrics.csv
container_tags.csv
channel_metrics.csv
channel_tags.csv
channels.csv # raw format: (container_id, channel_id, timestamp, value)
The Getting Started notebook
(demos/getting_started.ipynb)
loads them into Delta tables in five lines:
import os, pandas as pd
csv_dir = os.path.join(DEMOS_DIR, "data", "reporting")
for t in ["container_metrics", "container_tags",
"channel_metrics", "channel_tags", "channels"]:
(spark.createDataFrame(pd.read_csv(f"{csv_dir}/{t}.csv"))
.write.mode("overwrite")
.saveAsTable(f"{CATALOG}.{SCHEMA}.{TABLE_PREFIX}_{t}"))
If your data is already in this shape, that is your ingestion. The rest of this page is for the cases where it isn't.
3. The general pipeline shape
Real-world ingestion of measurement data on Databricks tends to follow the same skeleton, regardless of input format:
- File detection. Raw files arrive in a Unity Catalog Volume. Use
Auto Loader
(
cloudFiles) to detect them and append a discovery row to astatusDelta table you control. - Format-specific decode. A Spark job picks up unprocessed rows from
status, opens each file with the appropriate reader (asammdf for MDF4, the CSV reader for CSV, a vendor SDK for proprietary binary), and writes decoded numeric samples to a bronze Delta table. - Bronze → silver. Either write samples directly as raw
channels, or collapse consecutive identical samples per(container_id, channel_id)into intervals (RLE). Derive the four metadata tables (*_tags,*_metrics) from per-recording and per-channel attributes captured during decode. - Run-status tracking. Mark each
run_idsucceeded or failed instatus. On failure, roll back any partial silver writes for thatrun_idso the silver layer stays transactional with respect to source files. - Maintenance. Periodically
OPTIMIZEthe silver tables.channelsis by far the largest — cluster or Z-order it oncontainer_id,channel_id.
This is a pattern, not a recipe. Implement only the steps your situation
needs (e.g. one-shot loads can skip Auto Loader and the status table
entirely).
4. Format-specific notes
CSV
The five-line loader in section 2 works as-is when the CSVs already match
the silver-layer shape. If your CSV uses different column names, rename
them in a select(...) before saveAsTable. If columns are spread across
multiple files (e.g. one CSV per signal), reshape during decode so each
container's samples land in channels together.
MDF4 (ASAM)
A Databricks solutions accelerator for ingesting raw MDF4 data into the silver-layer model is in preparation. The pattern below describes the underlying approach.
Decode each file with asammdf in
a Spark UDF. For each numeric channel, emit
(container_id, channel_id, timestamp, value) rows into a bronze Delta
table, then run a Spark job that derives channels (raw or RLE) and the
metadata tables. Honor MDF4's per-sample invalidation bits — drop or mark
invalid samples before RLE encoding (the is_plausible column on channels
is the natural place to record them).
Already in Delta but in a different shape
Write a one-shot ETL: SELECT from your existing tables and saveAsTable
into the five silver tables. The most common gap is missing tags. If your
source data carries metadata as wide columns on the recordings table
(vehicle_brand, vehicle_model, ...), unpivot them into
(container_id, key, value) rows before writing to container_tags.
Vendor-specific binary
The MDF4 pattern generalises: decode with the vendor SDK, emit numeric
samples to bronze, collapse to silver. If the vendor SDK is not Spark-native,
run the decode in a mapPartitions UDF and accept that the decode stage is
your throughput bottleneck.
Adapting to existing data layouts
Reshaping into the silver-layer shape during ingest is the recommended path for new deployments. If your data already lives in Delta tables with different column names or a fundamentally different layout — and rewriting that data is impractical — Impulse offers two escape hatches.
Column-name remapping with SolverConfig
SolverConfig
declares per-table mappings from your physical column names to the
engine's internal names (container_id, channel_id, tstart, tend,
value, key, ...). Each silver table has its own TableConfig
section with a column_name_mapping dict and an optional filters dict
for equality scoping (project/toolbox/etc.). The mapping is applied
once, when each table is read; everything downstream uses the
internal names.
Use this when the logical shape of your silver layer matches Impulse's expectations — same set of tables and relationships — but the column names differ. See Solver column mappings and filters for the full schema.
Set query_engine.solver_config in your report config. DefaultSolver
consumes every section that applies to the tables you have configured —
column mappings, per-table filters, project_id, and the
channel_mapping / unit_conversion sections.
Trade-off either way: this gives you naming flexibility and per-table
scoping filters without writing code, but the underlying tables must
still follow the silver-layer relationships (EAV tag tables,
per-(container_id, channel_id) channels rows, etc.) and the internal
key names (container_id, channel_id) themselves are fixed
constants. Their column types are not fixed, though — container_id
in particular may be a long, int, or string; the engine adopts
whatever type your tables use, as long as it is consistent across them.
For different relationships or composite keys, see custom solvers below.
Custom solvers
For physical layouts that do not match the silver-layer relationships at
all — no EAV tag tables, alias lookup tables instead of channel_tags,
computed-column joins, JSON-encoded values, multi-column composite keys
that need pre-processing, etc. — you can implement a custom solver by
subclassing
QuerySolver
(or one of the existing solvers) and registering it in your report config.
This is significantly more invested than the SolverConfig path: you take
on responsibility for the four solver pipeline stages
(filter_container_tags, filter_container_metrics, filter_channel_tags,
filter_channel_metrics) and the solve method. Some advanced deployments
do this — e.g. when the customer's silver layer pre-dates Impulse and
synthesises Impulse-shaped views via SQL CTEs at query time. If you find
yourself heading down this path, it is usually worth first asking whether
a one-time ETL job to produce the standard silver-layer shape would be
cheaper.
The general rule: SolverConfig for naming differences, custom solver
for structural differences, ETL into the standard shape for everything
else.