Databricks Declarative Pipelines (Lakeflow / DLT)
Category: Data-Pipeline · Areas: data, infra
Description
Category
data-pipeline
Areas
data, infra
Platform
Platform-specific (Databricks). Lakeflow Spark Declarative Pipelines (SDP) — formerly Delta Live Tables (DLT) — is Databricks’ declarative ETL framework. It is one component of Lakeflow (with Lakeflow Connect for ingestion and Lakeflow Jobs, formerly Workflows, for orchestration). Existing DLT code runs unchanged; the current API names are preferred going forward.
Boundary
This concern owns declarative ETL on Databricks — defining datasets and transformations and data-quality expectations, and letting the framework manage the dependency DAG, incremental processing, and quality enforcement. It must not duplicate its neighbors:
- Generic data modeling /
domain-driven-designowns what the data means — the logical model the pipeline’s datasets realize. This concern owns how the transformation is declared and run. Do not re-derive the domain model in pipeline terms. enterprise-integration-patterns(EIP) owns asynchronous messaging and integration across system boundaries — channels, routers, idempotent receivers, dead-letter handling for message flow. Declarative pipelines own in-pipeline batch/streaming dataset transformation with a framework-managed DAG. They relate at ingestion (a pipeline may read a Kafka/ Kinesis source), but the expectation/DAG/incremental machinery here is not EIP’s channel/router machinery — name the pipeline pattern; leave cross-system messaging to EIP. (A pipeline reading from a broker still respects EIP’s at-least-once/idempotency reality at the ingestion edge.)unity-cataloggoverns the datasets the pipeline produces (its streaming tables and materialized views land in acatalog.schema); this concern produces them. Do not restate the grant model here.testing/verificationown test strategy and the evidence gate. This concern owns the declarative data-quality expectations that run as part of the pipeline — a complementary, in-pipeline quality control, not a replacement for unit/e2e testing.
Components
A pipeline is a set of declarative datasets; the framework builds the dependency DAG from how datasets reference each other, resolves order automatically, and processes data incrementally.
Datasets — declared, not orchestrated by hand
- Streaming table — incrementally processes an append/streaming source.
Python:
@dp.table(legacy@dlt.table); SQL:CREATE OR REFRESH STREAMING TABLE. - Materialized view — a transformation kept up to date. Python:
@materialized_view(legacy via@dlt.tableon a batch query); SQL:CREATE OR REFRESH MATERIALIZED VIEW. - Temporary view — intermediate, not published. Python:
@temporary_view(legacy@dlt.view). - Python import:
from pyspark import pipelines as dp(legacyimport dlt).
Data-quality expectations — declared on the dataset
Constraints evaluated on each row; three actions:
- warn (default / retain) — invalid records are kept and the violation is
reported in metrics.
- Python:
@dp.expect("valid_ts", "ts > '2012-01-01'") - SQL:
CONSTRAINT valid_ts EXPECT (ts > '2012-01-01')
- Python:
- drop — invalid records are dropped before write; reported in metrics.
- Python:
@dp.expect_or_drop("valid_page", "page_id IS NOT NULL") - SQL:
... EXPECT (...) ON VIOLATION DROP ROW
- Python:
- fail — invalid records halt the pipeline update.
- Python:
@dp.expect_or_fail("valid_count", "count > 0") - SQL:
... EXPECT (...) ON VIOLATION FAIL UPDATE
- Python:
- Grouped:
@dp.expect_all(dict),@dp.expect_all_or_drop(dict),@dp.expect_all_or_fail(dict). - Violations surface in the pipeline’s Data Quality tab / event log metrics (warn and drop record metrics; fail halts the run).
Framework-managed concerns
- Automatic dependency resolution and DAG construction from dataset references.
- Incremental processing for streaming tables; CDC via AUTO CDC /
APPLY CHANGESfor change-data-capture upserts. - Automatic lineage (composes with
unity-catalog). - Development vs production execution modes (dev: fast iteration / reused compute; prod: fresh compute, retries).
- Unit testing of transformation logic.
Constraints
Declare datasets and transformations; let the framework orchestrate
- Datasets are declared (streaming table / materialized view / temporary view) and the framework builds the dependency DAG, resolves order, and processes incrementally. Do not hand-roll orchestration, dependency wiring, or manual incremental bookkeeping that the framework owns.
- Use the streaming table for incremental/append sources and the materialized view for kept-fresh transformations — chosen deliberately, not by defaulting everything to one.
Every dataset declares data-quality expectations
- Each published dataset declares expectations with the action chosen to match intent: warn to observe-and-keep, drop to quarantine bad rows from the target, fail to halt on a violation that must never pass. A published dataset with no expectations is a quality gap.
- fail is for invariants that must never reach downstream; drop for rows that should not pollute the target but should not stop the pipeline; warn for observe-only. Reserve fail for genuine must-not-pass invariants (it halts the run).
CDC uses the framework’s change-apply, not hand-rolled merges
- Change-data-capture upserts use AUTO CDC /
APPLY CHANGES, not a hand-written merge that re-implements the framework’s CDC handling.
Datasets land in Unity Catalog
- The pipeline’s published tables/views land in a governed
catalog.schemaand are governed byunity-catalog(grants, ownership, lineage). Do not write to ungoverned/legacy locations.
Dev vs prod modes are distinct
- Use development mode for iteration and production mode for scheduled runs (fresh compute, retries) — do not ship dev-mode shortcuts to production.
Drift Signals (anti-patterns to reject in review)
- Hand-rolled orchestration / dependency wiring / manual incremental bookkeeping that the framework manages → declare datasets and let SDP build the DAG and process incrementally
- A published dataset with no expectations → declare expectations; choose warn/drop/fail to match the data-quality intent
- fail used where a row should merely be dropped (needlessly halting the pipeline), or warn used where a bad row must never reach downstream → match the action to intent
- A hand-written CDC merge re-implementing change handling → use AUTO CDC /
APPLY CHANGES - Pipeline output written to an ungoverned/legacy location instead of a
Unity Catalog
catalog.schema→ publish into the governed namespace - Dev-mode behavior (reused compute, skipped retries) shipped to production → run scheduled pipelines in production mode
- Expectations treated as a substitute for unit/e2e testing → expectations
are in-pipeline data-quality control; compose with
testing/verification
When to use
A product that runs ETL on the Databricks lakehouse as declarative pipelines
— ingest + transform into streaming tables and materialized views with
data-quality expectations, letting the framework own orchestration and
incremental processing. This is the data-pipeline member of the Databricks
family; select it together with unity-catalog (which governs the datasets
it produces) and, when the product is a Databricks-hosted app reading those
datasets, databricks-apps. It is composable (no slot); areas: data, infra
scopes its practices to the data and pipeline/infra work items.
Do not select it for ETL that does not run on Databricks, or for a thin app with no pipeline — use the generic data-modeling / EIP concerns there, and relate (do not duplicate) them here.
Artifact Impact
Selecting this concern requires these artifacts to change (a selected concern absent from them is drift):
- ADR: dataset shape (streaming-table vs materialized-view), expectation policy (warn/drop/fail), CDC strategy
- TD: declarative datasets + framework-managed DAG/incremental; AUTO CDC / APPLY CHANGES; dev vs prod modes
- DATA_DESIGN: published streaming tables/materialized views landing in a governed catalog.schema
- TEST_PLAN: per-dataset data-quality expectations + transformation-logic unit tests
ADR References
Record an ADR for the pipeline’s dataset shape (streaming-table vs
materialized-view boundaries), the data-quality expectation policy
(warn/drop/fail per dataset), and the CDC strategy (APPLY CHANGES). A material
uncertainty (source-system semantics, ingestion volume, CDC ordering) is a
tech-spike, not a silent assumption (see
workflows/references/concern-resolution.md).
Practices by activity
Agents working in any of these activities inherit the practices below via the bead’s context digest.
These practices govern declarative ETL on Databricks — declaring datasets,
transformations, and data-quality expectations and letting the framework manage
orchestration and incremental processing. They do not restate the logical data
model (domain-driven-design), the grant model (unity-catalog), or
cross-system messaging (enterprise-integration-patterns) — see the boundary in
concern.md.
Requirements (Frame activity)
- Confirm ETL runs on Databricks as declarative pipelines (Lakeflow / SDP, formerly DLT).
- Identify the sources (cloud storage, Kafka/Kinesis/etc.) and the published datasets the product needs, plus the data-quality invariants per dataset.
Design
- Decide each dataset’s shape: streaming table (incremental/append source) vs materialized view (kept-fresh transformation) vs temporary view (intermediate, unpublished).
- Design the data-quality expectation policy per dataset, choosing warn/drop/fail to match intent (observe-and-keep / quarantine / must-not-pass).
- Design CDC via AUTO CDC /
APPLY CHANGESwhere upserts from a change feed are needed — not a hand-rolled merge. - Target the pipeline output at a governed Unity Catalog
catalog.schema.
Implementation
- Import
from pyspark import pipelines as dp(legacyimport dltstill runs). - Declare datasets with
@dp.table(streaming table) /@materialized_view/@temporary_view, or the SQLCREATE OR REFRESH STREAMING TABLE/MATERIALIZED VIEW— let the framework build the dependency DAG and process incrementally. Do not hand-roll orchestration or incremental bookkeeping. - Declare expectations on each published dataset:
- warn:
@dp.expect("name", "predicate")/CONSTRAINT ... EXPECT (...) - drop:
@dp.expect_or_drop(...)/... ON VIOLATION DROP ROW - fail:
@dp.expect_or_fail(...)/... ON VIOLATION FAIL UPDATE - grouped:
@dp.expect_all/expect_all_or_drop/expect_all_or_fail
- warn:
- Use AUTO CDC /
APPLY CHANGESfor change-data-capture upserts. - Run scheduled pipelines in production mode (fresh compute, retries); use development mode only for iteration.
Testing / Verification
- Unit-test the transformation logic (the framework supports this).
- Verify expectations behave: feed a row that violates a drop expectation and confirm it is dropped from the target (and counted in metrics); feed a row that violates a fail expectation and confirm the update halts — observed in the Data Quality tab / event log, not assumed.
- Verify the published datasets land in the governed Unity Catalog
catalog.schemaand lineage is captured.
Boundary with neighbors
See concern.md for the canonical Boundary (vs domain modeling,
enterprise-integration-patterns, unity-catalog, testing /
verification). Composition in the Databricks family: this concern produces
the data; unity-catalog governs it; databricks-apps hosts apps that
consume it — each owns its piece, none restates the others.
Quality Gates
- ETL is expressed as declarative datasets (streaming tables / materialized views) with the framework owning the dependency DAG and incremental processing — no hand-rolled orchestration.
- Every published dataset declares data-quality expectations, with the action (warn/drop/fail) chosen to match intent; fail is reserved for must-not-pass invariants.
- A drop expectation is verified to drop violating rows from the target (with metrics), and a fail expectation is verified to halt the update — observed in the event log / Data Quality tab.
- CDC upserts use AUTO CDC /
APPLY CHANGES, not a hand-written merge. - Published datasets land in a governed Unity Catalog
catalog.schema(not a legacy/ungoverned location); scheduled runs use production mode.