ADR-012: Runbook owns incident response procedures; monitoring-setup owns detection only
Source identity (from
02-design/adr/ADR-012-runbook-owns-incident-response.md):
ddx:
id: ADR-012
depends_on:
- helix.prdADR-012: Runbook owns incident response procedures; monitoring-setup owns detection only
| Date | Status | Deciders | Related | Confidence |
|---|---|---|---|---|
| 2026-05-30 | Accepted | HELIX maintainers | plan-2026-05-30-artifact-types-and-concerns-audit, monitoring-setup, runbook | High |
Context
| Aspect | Description |
|---|---|
| Problem | The 2026-05-30 artifact-types-and-concerns audit flagged an ownership collision between monitoring-setup and runbook: both currently define incident-response routing. monitoring-setup’s template carries an Incident Response section (escalation paths, response procedures) that duplicates and competes with runbook’s Common Incident Procedures and escalation content. With two artifact types claiming the same surface, operators must reconcile two sources of truth at the moment they can least afford to. |
| Current State | monitoring-setup/template.md includes an ## Incident Response section. runbook/template.md includes ## Common Incident Procedures (with per-incident sections and a security/data-safety incident path) and explicit escalation routing. The two artifact types overlap on response procedures and escalation rather than partitioning detection from response. |
| Requirements | The catalog must assign incident-response ownership to exactly one artifact type. The other type must stop defining that surface so operators have a single canonical procedure document during an incident. |
Decision
monitoring-setup owns detection only: SLI/SLO definitions, alert
routing inputs, threshold tuning, and the observability surface that
produces signals.
runbook owns incident response: the procedures operators run when
those signals fire — recovery steps, common incident procedures,
escalation paths, and incident commander routing.
In Phase 3, the Incident Response section is removed from the
monitoring-setup template. Any content from that section not already
covered moves to runbook. After Phase 3, the catalog validator can
treat presence of an Incident Response (or equivalent response-procedure)
H2 in monitoring-setup as a drift signal.
Key Points: monitoring-setup = detection surface | runbook = operator procedure | Phase 3 removes Incident Response from monitoring-setup template | content not already in runbook moves there
Alternatives
| Option | Pros | Cons | Evaluation |
|---|---|---|---|
| Keep both artifact types defining incident response; rely on authors to keep them in sync | No catalog change | The audit found they are already out of sync; two sources of truth during an incident is the worst possible time for ambiguity | Rejected: status quo the audit flagged |
Move detection ownership into runbook and remove monitoring-setup | Single artifact type for the whole operational surface | Detection (SLI/SLO definitions, alert thresholds) and response (procedures, escalation) are authored by different people at different cadences; collapsing them loses that separation | Rejected: conflates two genuinely distinct authoring surfaces |
monitoring-setup owns detection; runbook owns response; Phase 3 removes the Incident Response section from monitoring-setup | Clean partition aligned to the actual purpose of each artifact type; single source of truth for procedures; enforceable by validator | Requires a content migration for any monitoring-setup content not already in runbook | Selected: smallest sufficient ownership fix |
Consequences
| Type | Impact |
|---|---|
| Positive | Operators have a single canonical document (runbook) for incident procedures and escalation. |
| Positive | monitoring-setup becomes focused on its actual purpose — the observability surface — without bleeding into operator procedure. |
| Positive | The catalog validator can flag any future reintroduction of response procedures into monitoring-setup as drift. |
| Positive | Authoring cadences are no longer entangled: detection thresholds can evolve without re-touching response procedures, and vice versa. |
| Negative | Phase 3 must perform a content migration from monitoring-setup’s Incident Response section to runbook for anything not already covered. |
| Negative | Existing monitoring-setup artifacts authored under the previous contract need a one-time edit to remove the Incident Response section. |
| Neutral | The two artifact types continue to live side-by-side in the 05-deploy activity; only their boundaries change. |
Risks
| Risk | Prob | Impact | Mitigation |
|---|---|---|---|
Content in monitoring-setup’s Incident Response section is lost rather than migrated | L | H | Phase 3 migration explicitly diffs the removed content against the runbook template before deleting; anything not already covered is added to runbook |
Authors continue to add response procedures to monitoring-setup after Phase 3 | M | M | Catalog validator flags Incident Response (or equivalent response-procedure) H2 in monitoring-setup as drift |
| The detection/response partition is unclear at the boundary (e.g. alert routing) | M | M | Detection ends at “signal produced and routed to a destination”; response begins at “operator receives signal and acts.” Alert routing destinations (PagerDuty, channel) are detection; what the recipient does is response |
| Existing references from outside the catalog point to the removed section | L | L | Phase 3 ships a redirect note in commit message and updates in-tree references in the same change |
Validation
| Success Metric | Review Trigger |
|---|---|
monitoring-setup/template.md contains no Incident Response section after Phase 3 | A PR reintroduces an Incident Response H2 (or equivalent response-procedure section) into monitoring-setup |
All response-procedure content from the removed section is present in runbook (either pre-existing or migrated) | Phase 3 lands without a diff confirming migration coverage |
The catalog validator flags monitoring-setup artifacts that carry response-procedure sections | A monitoring-setup artifact ships with response-procedure content and validation passes |
Operators consulting runbook during an incident find escalation and procedure content without needing to cross-reference monitoring-setup | An incident retrospective surfaces split-source-of-truth as a contributing factor |
References
- Plan: artifact-types-and-concerns audit (2026-05-30)
- PRD
workflows/activities/05-deploy/artifacts/monitoring-setup/template.mdworkflows/activities/05-deploy/artifacts/runbook/template.md