Skip to content

Monitoring Setup — `ddx-server`

Monitoring Setup — ddx-server

Service Summary

  • Service: ddx-server (loopback 127.0.0.1:7743 by default; opt-in Tailscale exposure via tsnet).
  • Signals that matter most: server liveness, worker pool health, bead-queue depth (open / ready / blocked), claim freshness (orphan-recovery health), and provider-call success rate.

Metrics Collection

CategoryMetricsNotes
ApplicationWorker count (healthy / unhealthy), in-flight executions, recent execution success / failure counts, average execution durationFrom ddx server workers list --json; emitted to .ddx/exec-runs.jsonl for historical reads
SystemProcess memory (RSS), CPU, file-descriptor count, .ddx/agent-logs/ disk usageOperator workstation top / du -sh; not actively scraped
BusinessBead queue depth by status (open, in_progress, blocked, closed over the last 24h), close rateFrom ddx bead list --status <s> --json; surfaced via helix status
CustomOrphan-recovery sweeps performed, claims released, stale-claim age distributionComputed by HELIX from events[] arrays during helix status

Alerting Rules

AlertConditionAction
Server unhealthyGET /healthz returns non-200 for 2 consecutive checksOperator notification: ddx-server down (no automated paging — single-operator service)
All workers unhealthyddx server workers list --json reports zero healthy workers for 60sOperator-visible: helix status and ddx server workers list show the failure
Queue blocked depth growingddx bead list --status blocked --json count grows by ≥3 in the same helix run cycleOperator-visible warning in helix status; surfaces as backlog drift
Stale claims exceed thresholdAny claimed-at older than HELIX_ORPHAN_THRESHOLD (default 7200s) without a recovery sweepNext helix run cycle should sweep; if it doesn’t, operator runs ddx bead unclaim manually
Provider error spike>5 consecutive provider 4xx/5xx responses in .ddx/agent-logs/Operator-visible: tail of agent log; check API key + provider status

Dashboards

DashboardMust Show
OperationsGET /healthz status; worker pool size and health; in-flight executions; latest provider errors
BusinessBead counts by status; close rate over the last 24h; focused-epic state; ready-queue depth
Technical.ddx/agent-logs/ disk usage; recent execution evidence size in .ddx/exec-runs.d/; tracker file size and last-modified timestamp

In practice the “dashboards” surface for a single-operator developer service is helix status plus ddx server workers list. There is no Grafana / Datadog setup; alerting is operator-driven.

Logs and Tracing

Logging

  • Required fields (per agent-execution log entry): timestamp, level, service, run-id, bead-id, message. Provider names appear; provider request bodies and API keys never appear.
  • Retention: .ddx/agent-logs/ is gitignored runtime state. Operators prune on their own schedule; HELIX does not auto-rotate.
  • Authoritative execution audit: .ddx/exec-runs.jsonl (one record per agent run) plus the per-run evidence directory under .ddx/exec-runs.d/<run-id>/. These are durable and committed when appropriate.

Tracing

  • Critical journeys: helix run cycle → bead claim → ddx agent execute-bead dispatch → provider call → result capture → bead update / close.
  • Sampling: 100% — every agent run produces a durable execution record by design. Sampling is not configurable; the audit trail is the product.

Health Checks

CheckEndpoint or MechanismWhat It Verifies
LivenessGET http://127.0.0.1:7743/healthzProcess is running and responsive on the loopback bind
Readinessddx server workers list --json (returns ≥1 healthy worker)Worker pool is healthy and capable of executing
Tracker integrityddx bead list --status open --json exits 0.ddx/beads.jsonl is parseable and locks acquire cleanly
Provider connectivityddx agent execute-bead --dry-run <ready-bead>API key resolves; provider TLS handshake succeeds

SLI/SLO Tracking

IndicatorSLISLO
Availability(/healthz 200 responses) / (total checks) over a 24h operator session≥99% during active sessions
Execution success(Closed beads with status: closed) / (Total claimed beads) over a helix run cycle≥80% (the rest are blockers, supersessions, or stop-for-guidance — all valid outcomes)
Tracker durability(Cycles with no torn writes) / (Total cycles)100% (any tear is a release-blocker bug)

Error Budget

Because this is a developer-local service, the “error budget” is not formal. The operator stops the service and investigates whenever any SLI above degrades; there is no auto-escalation.

Incident Response

Response Entry Points

  • Primary owner or schedule: The operator running ddx-server.
  • Secondary owner or schedule: HELIX maintainers via the repo issue tracker for reproducible bugs in the service itself.
  • Immediate containment actions: ddx server stop is always safe — claims are durable in .ddx/beads.jsonl and orphan recovery picks up stale claims on next run.
  • Existing runbook links: runbook.md — the canonical incident-response surface.

Routing

  • Primary: Operator on the workstation.
  • Secondary: HELIX maintainers (issue tracker).
  • Escalation: Model-provider support for provider-side outages.