Data Architecture

Purpose

Data Architecture is the highest-authority structural artifact for data pipeline design in the Design activity. Its unique job is to describe the durable pipeline shape: ingestion patterns, medallion layer topology, streaming vs. batch semantics, transformation patterns, governance boundaries, quality gates, and critical performance or cost tradeoffs.

Data Architecture is not a data model (captured in Data Design), implementation plan, or ADR. It is the bridge between PRD (kind: data) (requirements) and implementation: “given these requirements, here is how the pipeline is structured.”

Example

Show a worked example of this artifact

---
ddx:
  id: example.data-architecture.customer-360
  # Previous depends_on: example.data-prd.customer-360 — dropped when
  # data-prd collapsed into prd as kind: data variant (ADR-008). No
  # equivalent example.prd.customer-360 is yet published.
---

# Data Architecture: Customer-360 Analytics

## Scope

This architecture covers the Customer-360 medallion pipeline: daily batch ingestion
of Salesforce accounts, opportunities, and Stripe customers, subscriptions, invoices,
and charges into a Databricks Lakehouse. It includes Bronze raw-layer storage,
Silver reconciliation and deduplication, and Gold fact/dimension tables for
analytics queries. Historical loads of 12 months are supported; incremental daily
loads begin in week 2. Streaming ingestion, ML training stores, and external data
warehouse federation are outside v1 scope.

## Level 1: System Context

| Element | Type | Purpose | Protocol |
|---------|------|---------|----------|
| Salesforce | External source | Customer accounts, opportunities, ownership | HTTPS REST API; daily full export |
| Stripe | External source | Customers, subscriptions, invoices, charges | HTTPS REST API; daily full export (webhook in v2) |
| Databricks Lakehouse | Data Platform | Medallion storage and compute for ingestion and queries | Databricks SQL; PySpark jobs |
| BI Tool (Tableau/Sigma) | Consumer | Sales and finance dashboards querying Gold tables | Databricks SQL via ODBC |
| Data Engineer | Role | Orchestrates jobs, monitors SLAs, maintains schemas | Databricks workflows, notebooks |

```mermaid
graph TB
    SF[Salesforce<br/>Accounts + Opps] -->|HTTPS API<br/>Daily export| DBX[Databricks Lakehouse<br/>Bronze + Silver + Gold]
    Stripe[Stripe<br/>Customers + Subscriptions<br/>+ Invoices + Charges] -->|HTTPS API<br/>Daily export| DBX
    DBX -->|Databricks SQL| BI[BI Tool<br/>Sales Dashboards]
    DBX -->|Databricks SQL| DE[Data Engineer<br/>Monitoring]
```

## Level 2: Medallion Architecture

### Bronze Layer (Raw)

Immutable copies of source system exports, organized by system and entity.

| Table | Source | Partitioning | Retention | Notes |
|-------|--------|--------------|-----------|-------|
| bronze.salesforce_accounts | Salesforce API | date_loaded | 90 days | Full daily export; preserves all fields |
| bronze.salesforce_opportunities | Salesforce API | date_loaded | 90 days | Full daily export; includes closed_date |
| bronze.stripe_customers | Stripe API | date_loaded | 90 days | Full daily export; includes metadata tags |
| bronze.stripe_subscriptions | Stripe API | date_loaded | 90 days | Full daily export; includes status changes |
| bronze.stripe_invoices | Stripe API | date_loaded | 90 days | Full daily export; raw line items |
| bronze.stripe_charges | Stripe API | date_loaded | 90 days | Full daily export; includes payment outcomes |

**Quality**: No transformation; SLA violations block Silver load until Bronze is complete.

### Silver Layer (Deduplicated & Reconciled)

Cleaned, deduplicated, and reconciled data with lineage and quality flags.

| Table | Source(s) | Partitioning | Retention | Key Transformations |
|-------|-----------|--------------|-----------|---------------------|
| silver.dim_customer | bronze.salesforce_accounts + bronze.stripe_customers | customer_id | 3 years | 1:1 Salesforce-to-Stripe match via email; hash PII; null-check on account names |
| silver.dim_date | N/A (calendar) | date_key | 5 years | Standard calendar table; fiscal month, quarter, year |
| silver.fct_subscription_event | bronze.stripe_subscriptions | subscription_id, event_date | 3 years | Deduplicate on Stripe subscription ID; flag late-arriving rows; join to dim_customer |
| silver.fct_payment_transaction | bronze.stripe_charges + bronze.stripe_invoices | charge_id, payment_date | 3 years | Flatten invoice line items; join charge to invoice and subscription; hash card brand |
| silver.reconciliation_log | N/A | load_date | 90 days | Count of matched/unmatched pairs per load; reconciliation confidence scores |

**Quality**: PII hashing, null validation, late-arriving fact flags, join lineage recorded.

### Gold Layer (Aggregated Facts)

Business-ready tables for analytics and reporting.

| Table | Business Use | Grain | Partitioning | Retention |
|-------|------------------|-------|--------------|-----------|
| gold.fct_monthly_revenue | Sales forecasting, revenue metrics | 1 row per customer per month | customer_id, year_month | 3 years |
| gold.fct_subscription_health | Churn risk scoring, subscription metrics | 1 row per subscription | subscription_id, as_of_date | 3 years |
| gold.dim_customer_account | Account overview, drill-down | 1 row per customer | customer_id | 3 years |

**Computations**:
- `fct_monthly_revenue`: Sums paid invoices grouped by customer and calendar month; includes subscription state
- `fct_subscription_health`: Latest subscription status, months active, failed payment count, aging of unpaid invoices
- `dim_customer_account`: Joins Salesforce account attributes with current Stripe subscription status

## Level 3: Data Flow

```mermaid
sequenceDiagram
    participant SF as Salesforce API
    participant Stripe as Stripe API
    participant DBX as Databricks
    participant Bronze as Bronze Tables
    participant Silver as Silver Tables
    participant Gold as Gold Tables
    participant BI as BI Tool

    SF->>DBX: Daily export (accounts, opps)
    Stripe->>DBX: Daily export (customers, subs, invoices, charges)
    DBX->>Bronze: Land raw data; validate schema and completeness
    Note over DBX: Reconciliation: match Salesforce-Stripe via email
    Bronze->>Silver: Deduplicate, hash PII, join and flag late arrivals
    Note over Silver: Check reconciliation accuracy (98% threshold)
    Silver->>Gold: Aggregate facts and dimensions
    Gold->>BI: SQL query for dashboards
    Note over BI: Sales forecast, churn alerts, AR aging
```

## Level 4: Deployment and Compute

### Orchestration

| Component | Technology | Schedule | Resource | SLA |
|-----------|----------|----------|----------|-----|
| Salesforce Export Job | Databricks Workflow + PySpark | 10pm UTC daily | 2-worker job cluster, 8 DBU | Complete by 2am UTC |
| Stripe Export Job | Databricks Workflow + PySpark | 10pm UTC daily | 2-worker job cluster, 8 DBU | Complete by 2am UTC |
| Reconciliation + Silver Load | Databricks Workflow + SQL | 3am UTC daily (after Bronze) | 2-worker job cluster, 8 DBU | Complete by 5am UTC |
| Gold Aggregation + Refresh | Databricks Workflow + SQL | 5am UTC daily (after Silver) | 2-worker job cluster, 8 DBU | Complete by 7am UTC |

### Compute Sizing

- **Job Cluster**: 2 workers, 8 DBU/hour per cluster
- **Estimated Monthly Cost**: 4 jobs × 8 DBU × 30 days = 960 DBU ≈ $480 USD
- **Query Workload**: +50 DBU/month for analyst ad-hoc queries (estimate)
- **Total Budget**: ≤ $500 USD/month

### Storage

| Layer | Format | Location | Retention Policy |
|-------|--------|----------|------------------|
| Bronze | Delta | s3://main-catalog/customer_360_bronze/ | Delete after 90 days |
| Silver | Delta | s3://main-catalog/customer_360_silver/ | Delete after 3 years (Delta VACUUM) |
| Gold | Delta | s3://main-catalog/customer_360_gold/ | Delete after 3 years (Delta VACUUM) |

## Quality Attributes

| Attribute | Target | Strategy | Verification |
|-----------|--------|----------|--------------|
| Data Freshness | Gold tables available by 7am UTC daily | Orchestrated daily batch completing 5am; monitor job logs for failures | Scheduled report execution; query execution logs |
| Reconciliation Accuracy | ≥ 98% Salesforce-Stripe matched pairs | Fuzzy email matching in Silver; confidence scoring on match quality | Daily reconciliation_log audit; manual spot-check |
| Lineage Traceability | 100% of Gold rows trace to Bronze source records | Preserve source IDs and load timestamps through all layers | Audit queries joining Gold → Silver → Bronze |
| Cost Containment | ≤ $500 USD/month | Monitor job runtime and query execution time; set alarms on DBU overage | Monthly billing dashboard in Databricks |

## Key Design Decisions

| Decision | Rationale | Tradeoffs |
|----------|-----------|-----------|
| Daily batch, not streaming | Stripe webhook integration costs 2+ weeks; batch fully validates; sales SLA accepts 24-hour latency | Query latency ≤ 24 hours; no real-time churn alerts; easier to replay failed days |
| Separate Bronze/Silver/Silver schemas | Data governance: PII isolation, access control per layer, easy to backfill one layer without reprocessing others | More tables to maintain and document; requires clear naming conventions |
| Salesforce-Stripe match via email + fuzzy | Email is the most reliable cross-system identifier; fuzzy matching handles case and domain normalization | ≠ 100% accuracy; requires manual linking for edge cases; depends on email data quality |
| Flatten Stripe invoice line items in Silver | Simplifies Gold aggregations; avoids multi-row-per-invoice complexity in joins | Denormalizes at Silver (but Silver is allowed to denormalize for analytics) |
| Hash card brand (not full card) in Silver | PCI compliance: no raw card tokens or full numbers stored | Aggregate metrics cannot distinguish card issuer; acceptable for v1 |

## Future Considerations

- **Streaming Subscriptions**: Stripe webhooks in v2 for sub-minute payment latency
- **ML Feature Store**: Separate feature-engineering layer for churn-scoring models
- **Cross-System Orchestration**: Airflow/dbt Cloud for multi-workspace lineage
- **Snowflake Federation**: External tables for cost optimization if query volume scales

Reference

Activity	Design — Decide how to build it. Capture trade-offs, contracts, and architecture decisions.
Default location	`docs/helix/02-design/data-architecture.md`
Requires	None
Enables	None
Informs	Data Quality Expectations Technical Design Solution Design
Referenced by	Data Quality Expectations Implementation Plan Runbook
Generation prompt	Show the full generation prompt # Data Architecture Generation Prompt Document the data pipeline architecture that the team needs to build, review, operate, and evolve the data product. ## Purpose Data Architecture is the highest-authority structural artifact for data pipeline design in the Design activity. Its unique job is to describe the durable pipeline shape: ingestion patterns, medallion layer topology, streaming vs. batch semantics, transformation patterns, governance boundaries, quality gates, and critical performance or cost tradeoffs. Data Architecture is not a data model (captured in Data Design), implementation plan, or ADR. It is the bridge between PRD (kind: data) (requirements) and implementation: "given these requirements, here is how the pipeline is structured." ## Reference Anchors Use these local resource summaries as grounding: - `docs/resources/databricks-lakehouse-medallion-architecture.md` grounds medallion topology (Bronze/Silver/Gold layer responsibilities, transformations, and quality gates). - `docs/resources/databricks-auto-loader.md` grounds cloud-native ingestion patterns for incremental, scalable, schema-aware source connectors. - `docs/resources/databricks-streaming-tables.md` grounds declarative streaming and materialized views for real-time transformations and quality enforcement. - `docs/resources/databricks-sdp.md` grounds SDP lineage, governance, and quality-first design through `EXPECT ... ON VIOLATION ...` clauses and contract-driven pipeline composition. ## Focus - Sketch the medallion layer flow: what lands in Bronze, what transformations happen in Silver, what business tables live in Gold. - Name ingestion patterns (Auto Loader, Streaming Tables, batched SQL, CDC) and why each is used for its source. - Document transformation semantics: idempotence, exactly-once vs. at-least-once, stateful operations, and how schema evolution is handled. - Specify governance and quality checkpoints: where data is validated, which layers enforce which contracts, and how SLA compliance is monitored. - Call out critical performance or cost tradeoffs: partitioning strategy, clustering, retention policy, incremental refresh vs. full rebuild. ## Role Boundary Data Architecture describes pipeline topology and data flow, not the detailed data model (Data Design), not implementation sequences (Implementation Plan), and not individual quality checks (Data Quality Expectations). Non-Databricks platforms: see `docs/resources/databricks-platform-substitution.md` for the equivalent terms on Snowflake, BigQuery, and on-prem stacks. The artifact shape and prompt stay the same. ## Completion Criteria - Medallion layer diagram or description is clear (what lands where, why). - Each layer's transformation responsibilities are explicit. - Ingestion patterns name actual technologies and explain why each is used. - Quality gates are named (where validation happens, what contracts are enforced). - Performance/cost tradeoffs are visible (partitioning, clustering, retention, refresh strategy). - Deployment topology is concrete (number of clusters, auto-scaling, failover). - Major decisions link to PRD (kind: data) requirements or include inline rationale.
Template	Show the template structure --- ddx: id: data-architecture --- # Data Architecture Platform- and pipeline-level shape of the data product: medallion topology, processing-framework choices, governance model, and pipeline-level quality contracts. Entity-level modelling (logical schema, access patterns, constraints, migration) lives in [[data-design]]. ## Overview [Describe the data product being architected, the business problem it solves, and the system context. Name the key data flows and platform fit. Reference [[prd]] (kind: data) for the requirements and success metrics this architecture must satisfy.] ### Scope [What data flows and systems are covered. What is deliberately out of bounds. Which requirements from [[prd]] (kind: data) drive the design decisions.] ### System Context \| External System \| Role \| Protocol \| Data Volume \| \|-----------------\|------\|----------\|------------\| \| [Source system] \| [Role in the pipeline] \| [API, batch export, CDC] \| [Order-of-magnitude per period] \| \| [Consumer system] \| [How it consumes Gold] \| [Delta share, SQL, API] \| [Query volume] \| ```mermaid graph TB A["Source A"] -->\|ingest\| B["Data Platform"] C["Source B"] -->\|ingest\| B B -->\|consumption layer\| D["BI / Reporting"] B -->\|feature store\| E["ML Platform"] ``` ## Medallion Topology ### Layer Strategy [State the medallion strategy: Bronze (raw), Silver (validated), Gold (consumption). For each layer, name the transformation scope, quality gates, and consumer responsibilities. Justify the choice against [[prd]] (kind: data) freshness and quality requirements.] ### Bronze Layer (Raw Ingestion) - Purpose: Land source data in its native form without transformation. - Source integration pattern: [Auto Loader, Streaming Tables, scheduled batch import, CDC] - Schema handling: [Strict / inferred / evolution policy] - Retention policy: [Rationale tied to cost and replay needs] Responsibilities: - Ingest all records from source. - Preserve source schema exactly (no renames or coercion). - Tag records with ingest timestamp and source-system identifier. - Quarantine records that fail schema validation. Quality gates: ingest-metadata presence, no column truncation, source availability watchdog. ### Silver Layer (Validated and Transformed) - Purpose: Cleansed, deduplicated, business-logic-ready data. - Deduplication strategy: [Key + ordering rule] - Type coercion / null policy: [Defaults vs reject] - Referential integrity: [Which FK relationships are enforced and how] Join strategy (pipeline-level — entity-level joins live in [[data-design]]): \| Join \| Source Layers \| Type \| Cardinality \| Latency Impact \| \|------\|---------------\|------\|-------------\|----------------\| \| [Logical join name] \| [Left / Right] \| [Inner / Outer] \| [1:1 / 1:N] \| [Qualitative] \| Quality gates: PK uniqueness, NOT NULL on critical columns, row-count reconciliation with Bronze within tolerance. ### Gold Layer (Consumption) - Purpose: Business-ready tables optimised for consumer queries. - Optimisation strategy: [Partitioning, clustering / z-order, materialised views — at the pipeline level, not column-level] - Retention policy: [Compliance and analytics horizon] Consumption tables (entity definitions live in [[data-design]]): \| Table \| Use Case \| Consumers \| Freshness Target \| \|-------\|----------\|-----------\|------------------\| \| [Gold table name] \| [Use case from PRD (kind: data)] \| [Persona] \| [Target tied to SLA] \| Quality gates: aggregate reconciliation with Silver, referential integrity across Gold, latency within consumer SLA. ## Data Flow [Describe how data moves through the medallion layers. Clarify ingestion frequency, transformation latency, and refresh strategy.] ```mermaid graph LR A["Source"] -->\|ingest pattern\| B["Bronze"] B -->\|transform job\| C["Silver"] C -->\|aggregate job\| D["Gold"] D -->\|published\| E["Consumers"] ``` ### Incremental vs Full Refresh - Bronze: [CDC / append / full reload — rationale] - Silver: [Incremental keys / full recalc — rationale] - Gold: [Append-only / snapshot / merge — rationale] ## Processing Semantics ### Streaming vs Batch Decision \| Layer \| Strategy \| Rationale \| SLA Implication \| \|-------\|----------\|-----------\|-----------------\| \| Bronze \| [Streaming / Batch / Incremental] \| [Why] \| [Freshness achieved] \| \| Silver \| [Streaming / Batch / Incremental] \| [Why] \| [Freshness achieved] \| \| Gold \| [Streaming / Batch / Incremental] \| [Why] \| [Freshness achieved] \| ### Processing Framework - Framework: [Databricks SQL, PySpark, dbt, Streaming Tables, Flink, …] - Orchestration: [Workflows, Airflow, dbt Cloud, Dagster, …] - Failure handling: [Retry policy, dead-letter queue, manual intervention] - Idempotence / exactly-once posture: [Per layer] - Schema evolution policy: [Auto-add / manual approval / strict] ### Latency and Throughput Targets \| Stage \| Latency Target \| Throughput Target \| Binding Constraint \| \|-------\|----------------\|-------------------\|--------------------\| \| Source → Bronze \| [From PRD (kind: data) SLA] \| [Order of magnitude] \| [Rate limit, API quota] \| \| Bronze → Silver \| [From PRD (kind: data) SLA] \| [Order of magnitude] \| [Compute / dedup cost] \| \| Silver → Gold \| [From PRD (kind: data) SLA] \| [Order of magnitude] \| [Query complexity] \| ## Pipeline-Level Quality Contracts [Express the contracts the pipeline enforces at each layer boundary. Column-level field rules belong in [[data-quality-expectations]]; this section names which contracts the architecture commits to enforce and where.] ### Bronze → Silver - Schema contract: [What Silver requires of Bronze] - Volume contract: [Acceptable row-count delta] - Freshness contract: [Max ingest lag before Silver is held] - Violation handling: [Alert / hold / quarantine] ### Silver → Gold - Uniqueness contract: [Which keys are unique at Gold] - Referential contract: [Which FK relationships are guaranteed] - Aggregate-reconciliation contract: [Sums and counts must agree within tolerance] - Violation handling: [Reject / rollback / alert] ### Cross-Layer Contracts \| Contract \| Assertion \| If Violated \| \|----------\|-----------\|-------------\| \| [Row count Bronze → Silver] \| [Within tolerance] \| [Alert + manual audit] \| \| [Cardinality Silver → Gold] \| [Stable across refresh] \| [Reject until reconciled] \| \| [FK integrity across Gold] \| [No orphans] \| [Quarantine + alert] \| Detailed `EXPECT` clauses, field-level constraints, and freshness predicates live in [[data-quality-expectations]]. ## Governance and Access Control ### Identity and Access Model \| Role \| Catalog Scope \| Layer Access \| Permissions \| \|------\|---------------\|--------------\|-------------\| \| [Role from PRD (kind: data) consumers] \| [Catalog / schema] \| [Bronze / Silver / Gold] \| [SELECT / MODIFY / EXECUTE] \| ### Data Classification and Retention \| Layer \| Classification \| Sensitive Categories \| Retention Policy \| Masking Policy \| \|-------\|----------------\|----------------------\|------------------\|----------------\| \| Bronze \| [Class] \| [Categories — not specific columns; those live in data-design] \| [Policy tied to compliance] \| [Who sees raw] \| \| Silver \| [Class] \| [Categories] \| [Policy] \| [Who sees masked vs raw] \| \| Gold \| [Class] \| [Categories] \| [Policy] \| [Default masking for BI] \| ### Fine-Grained Access Control - Row-level security: [Tenant / region predicate — policy, not the predicate code, which lives in [[data-design]]] - Column-level security: [Which classifications are masked for which roles] - Dynamic views: [Masking-function strategy] ## Platform Design ### Catalog Organisation ``` [catalog] ├── [schema] │ ├── [bronze table family] │ ├── [silver table family] │ └── [gold table family] ├── metadata │ ├── pipeline_runs │ └── quality_metrics ``` ### Compute Strategy \| Workload \| Compute Tier \| Sizing Approach \| Rationale \| \|----------\|--------------\|-----------------\|-----------\| \| Bronze ingestion \| [Tier] \| [Auto-scale bounds / fixed] \| [Continuous vs scheduled] \| \| Silver transformation \| [Tier] \| [Sizing approach] \| [Batch vs streaming] \| \| Gold consumption \| [Tier] \| [Sizing approach] \| [Query pattern] \| Cost-shaping levers (qualitative — concrete numbers belong in operational runbooks, not the architecture): - Spot / preemptible instances for retryable workloads. - Auto-termination of idle clusters. - Partition pruning and clustering for scan reduction. - Materialised vs on-demand aggregates. ### Storage Strategy \| Layer \| Format \| Partitioning \| Clustering / Optimisation \| \|-------\|--------\|--------------\|---------------------------\| \| Bronze \| [Delta / Iceberg / …] \| [By date / source] \| [Compaction policy] \| \| Silver \| [Format] \| [By key / date] \| [Z-order / cluster keys] \| \| Gold \| [Format] \| [By query predicate] \| [Materialised views / cache] \| ### Platform Features in Use \| Feature \| Use Case \| Configuration Note \| \|---------\|----------\|--------------------\| \| [Auto Loader / equivalent] \| [Bronze ingestion] \| [Trigger mode, schema mode] \| \| [Streaming Tables / equivalent] \| [Bronze → Silver] \| [Trigger / latency target] \| \| [Pipeline orchestrator] \| [End-to-end refresh] \| [Schedule / dependency] \| \| [Governance catalog] \| [Access + lineage] \| [Cross-team sharing posture] \| For non-Databricks platforms, see [`docs/resources/databricks-platform-substitution.md`](../../../../../docs/resources/databricks-platform-substitution.md) for the platform-equivalent terms. ## Decisions and Tradeoffs ### Key Architecture Decisions \| Decision \| Choice \| Rationale \| Alternative Considered \| Consequence \| \|----------\|--------\|-----------\|------------------------\|-------------\| \| [Medallion layering] \| [Choice] \| [Why] \| [Alternative] \| [Tradeoff] \| \| [Streaming vs batch per layer] \| [Choice] \| [Why] \| [Alternative] \| [Tradeoff] \| \| [Compute tier per workload] \| [Choice] \| [Why] \| [Alternative] \| [Tradeoff] \| ### Performance vs Cost Tradeoffs - [Real-time vs near-real-time ingestion — freshness gain vs sustained compute cost] - [Materialised vs on-demand Gold aggregates — query latency vs storage] - [Spot vs on-demand compute — cost savings vs interruption risk] ### Known Risks and Mitigations \| Risk \| Mitigation \| \|------\|------------\| \| [Source rate limit causes backlog] \| [Backoff + queue buffering + lag alert] \| \| [PII exposure in Bronze] \| [Masked views + audit logs] \| \| [Schema drift from source] \| [Schema registry + manual approval gate] \| --- ## Review Checklist - [ ] Scope clearly states which data flows are in / out of bounds. - [ ] Medallion topology names Bronze / Silver / Gold purposes and transformation rules. - [ ] Data flow diagrams show how data moves through layers and to consumers. - [ ] Processing semantics explicitly state streaming vs batch per layer with latency targets tied to [[prd]] (kind: data). - [ ] Pipeline-level quality contracts name which contracts each layer boundary enforces; detailed `EXPECT` clauses are deferred to [[data-quality-expectations]]. - [ ] Failure handling specifies what happens when a contract fails (alert, reject, quarantine, rollback). - [ ] Access control model covers identity, row-level, column-level, and sensitive-data masking at the policy level. - [ ] Platform design names catalog organisation, compute tiering, and storage strategy without committing to hardcoded cost numbers. - [ ] Decisions and tradeoffs document key choices with rationale and alternatives considered. - [ ] Cross-layer contracts are defined (reconciliation, cardinality, no orphans). - [ ] SLA per layer is documented (freshness, latency, availability) and traces to [[prd]] (kind: data). - [ ] No `[TBD]`, `[TODO]`, or `[NEEDS CLARIFICATION]` markers remain. - [ ] Entity-level details (logical schema, indexes, migrations, store selection) are deferred to [[data-design]]. - [ ] For non-Databricks platforms, terms map via `docs/resources/databricks-platform-substitution.md`.