Skip to content

Monitoring Setup

Purpose

Service-specific observability and alerting setup required before or during rollout.

Example

Show a worked example of this artifact
---
ddx:
  id: example.monitoring-setup.depositmatch
  depends_on:
    - example.deployment-checklist.depositmatch
    - example.security-architecture.depositmatch
    - example.metric-definition.depositmatch
---

# Monitoring Setup

## Service Summary

- Service: DepositMatch CSV-first pilot
- Signals that matter most: import success, match-review availability, review
  decision latency, cross-tenant authorization denials, restricted telemetry
  violations, and audit-log write failures.

## Metrics Collection

| Category | Metrics | Notes |
|----------|---------|-------|
| Application | request latency, 5xx rate, import job duration, import failure count | Split by endpoint and workload |
| System | worker queue depth, database connection saturation, object-storage errors | Only page when user work is blocked |
| Business | imports completed, suggestions reviewed, manual match overrides | Dashboard only during pilot |
| Security | authorization denials, unsafe CSV cells neutralized, telemetry redaction failures, support grant usage | Alerts only for likely incident or control failure |
| Quality | match suggestion acceptance rate, ambiguous suggestion rate | Feeds Iterate metrics, not paging |

## Alerting Rules

| Alert | Condition | Action |
|-------|-----------|--------|
| Import pipeline unavailable | `POST /imports` 5xx rate above 5% for 10 minutes or import queue stalled for 15 minutes | Page primary on-call; use runbook import triage |
| Review decision audit failure | Any accepted/rejected decision fails to write an audit event | Page primary on-call; disable decision endpoint if unresolved |
| Cross-tenant authorization anomaly | More than 5 denied cross-firm/client attempts for one actor in 10 minutes | Notify security lead; inspect actor and preserve logs |
| Restricted telemetry violation | Any prohibited fixture value appears in telemetry assertion or production log scan | Page security lead; rotate affected logs per runbook |
| Support access outside approved window | Any support grant used after expiry or without approval | Page security lead and incident coordinator |

## Dashboards

| Dashboard | Must Show |
|-----------|-----------|
| Operations | availability, latency, error rate, import queue depth, database saturation |
| Pilot Outcome | imports completed, suggestions reviewed, accepted suggestions, manual overrides |
| Security Controls | authorization denials, support grants, audit-log failures, telemetry violations |
| Import Diagnostics | file validation failures, unsupported encodings, malformed rows, formula-neutralized cells |

## Logs and Tracing

### Logging
- Required fields: `timestamp`, `level`, `service`, `trace_id`, `firm_id`,
  `client_id`, `actor_id`, `event_type`, `import_id` where applicable.
- Prohibited fields: raw account numbers, invoice details, payer identifiers,
  client names, and raw CSV row values.
- Retention: application logs hot for 14 days; security/audit events retained
  for the pilot audit window.

### Tracing
- Critical journeys: upload CSV, normalize import, generate suggestions,
  accept/reject match, export review log.
- Sampling: sample all failed imports and audit-write failures; baseline sample
  successful review flow at 10%.

## Health Checks

| Check | Endpoint or Mechanism | What It Verifies |
|-------|-----------------------|------------------|
| Liveness | `GET /health/live` | API process is running |
| Readiness | `GET /health/ready` | database reachable, migrations current, object storage reachable |
| Worker readiness | `GET /health/workers` | import worker can claim jobs and write results |
| Audit readiness | synthetic review-decision write in staging | audit writer can persist decision events |

## SLI/SLO Tracking

| Indicator | SLI | SLO |
|-----------|-----|-----|
| Import availability | successful import requests / total valid import requests | 99% during pilot business hours |
| Review availability | successful match queue and decision requests / total requests | 99.5% during pilot business hours |
| Review latency | p95 match queue response time | under 750 ms |
| Audit durability | decisions with audit event / total decisions | 100% |
| Telemetry safety | restricted telemetry violations | 0 |

### Error Budget
- Any audit durability or telemetry safety breach consumes the full pilot error
  budget and blocks further rollout until the runbook action is complete.
- Import/review SLO burn above 25% in a day triggers rollout hold and owner
  review.

## Incident Response

### Response Entry Points
- Primary owner or schedule: DepositMatch on-call engineer.
- Secondary owner or schedule: platform lead.
- Immediate containment actions: disable import, pause review decisions, or
  disable review-log export through feature flags.
- Existing runbook links: `docs/helix/05-deploy/runbook.md`.

### Routing
- Primary: application on-call.
- Secondary: platform lead.
- Escalation: product owner for pilot-impact decisions; security lead for data
  exposure or control failure.

Reference

ActivityDeploy — Ship to users with appropriate operational support, monitoring, and rollback plans.
Default locationdocs/helix/05-deploy/monitoring-setup.md
RequiresNone
EnablesNone
InformsDeployment Verification
Metrics Dashboard
HELIX documentsdocs/helix/05-deploy/monitoring-setup.md
Generation prompt
Show the full generation prompt
# Monitoring Setup Generation Prompt

Create a concise, service-specific monitoring setup for this deployment.

Reference Anchors

Use these local resources as grounding:

  • docs/resources/google-sre-monitoring-distributed-systems.md grounds operational signals, dashboards, alerting, and the four golden signals.

Focus

  • Include only the metrics, logs, alerts, dashboards, and tracing needed to operate this service.
  • Define measurable thresholds, routing, and escalation where they matter.
  • Connect health checks and SLOs to rollout safety and rollback decisions.
  • Avoid generic observability boilerplate that does not change operator behavior.
  • Separate user-impacting alerts from dashboards that are only for diagnosis.

Completion Criteria

  • Core metrics and dashboards are defined.
  • Alert thresholds and routing are explicit.
  • Logging, tracing, and health-check expectations are clear.
  • The setup is specific enough to support deployment and incident response.
  • Every page-worthy alert has an operator action or runbook entrypoint.
Template
Show the template structure
---
ddx:
  id: monitoring-setup
---

Monitoring Setup

Service Summary

  • Service: [service name]
  • Signals that matter most: [availability, latency, throughput, errors, business KPIs]

Metrics Collection

Category Metrics Notes
Application [Latency, throughput, error rate] [By endpoint or workload if needed]
System [CPU, memory, disk, network] [Only what affects service health]
Business [KPI names] [Only if operationally relevant]
Custom [Metric name] [Why it matters]

Alerting Rules

Alert Condition Action
[Critical alert] [Page threshold] [Page path]
[Warning alert] [Notification threshold] [Notify path]

Dashboards

Dashboard Must Show
Operations [Health, latency, errors, dependency status]
Business [Adoption or outcome metrics]
Technical [Resource use, queues, caches, jobs]

Logs and Tracing

Logging

  • Required fields: timestamp, level, service, trace_id, message
  • Retention: [hot and cold retention]

Tracing

  • Critical journeys: [What must be traceable]
  • Sampling: [Baseline and overrides]

Health Checks

Check Endpoint or Mechanism What It Verifies
Liveness GET /health/live [Process is running]
Readiness GET /health/ready [Dependencies, capacity, migrations]

SLI/SLO Tracking

Indicator SLI SLO
Availability [Formula] [Target]
Latency [Formula] [Target]
Quality [Formula] [Target]

Error Budget

  • [Budget and escalation thresholds]

Incident Response

Response Entry Points

  • Primary owner or schedule: [Who handles the first page]
  • Secondary owner or schedule: [Who backs up the first responder]
  • Immediate containment actions: [Safe first actions before deeper procedures exist]
  • Existing runbook links (optional): [Add links once a runbook exists]

Routing

  • Primary: [Schedule]
  • Secondary: [Schedule]
  • Escalation: [Manager]