Runbook — `ddx-server`
Runbook — ddx-server
Service Summary
- Service or component:
ddx-server— the long-running platform service that owns the bead tracker (.ddx/beads.jsonl), agent worker pool, and execution evidence under.ddx/exec-runs.d/. - Primary function: Accept
ddx agent execute-bead/ddx agent execute-loopdispatch from HELIX skills, run the resulting model-provider calls, and persist execution records that HELIX consumes during review. - Business impact if degraded: HELIX
helix runandhelix buildcannot advance the bead queue. Existing claims may go stale (orphan-recovery threshold default 7200s) but no data is lost —.ddx/beads.jsonlis durable on disk under git. - Ownership team: HELIX maintainers (this is a developer-local service; the operator running it is the on-call).
- On-call rotation: N/A — single-operator service. The operator is expected
to be present when
ddx-serveris running. - Environments covered: One per repo working tree. The default deployment
is
systemd --useron the operator’s workstation, listening on127.0.0.1:7743.
Operator Entry Points
| Situation | First dashboard, log, or query | First command or check | Owner |
|---|---|---|---|
helix run reports BLOCKED with no obvious bead-side cause | helix status | ddx server workers list and curl -fsS http://127.0.0.1:7743/healthz | Operator |
| Agent dispatch hangs or times out | .ddx/agent-logs/<latest> | ddx server workers list then ddx server stop / ddx server start | Operator |
| Tracker writes appear torn or out of order | git diff -- .ddx/beads.jsonl | ddx bead list --status in_progress --json to find unreleased claims | Operator |
| Port 7743 is in use at startup | ss -lntp '( sport = :7743 )' | Identify the conflicting process; kill it or change the bind port | Operator |
| Model-provider auth errors | .ddx/agent-logs/<latest> | echo "${OPENROUTER_API_KEY:0:8}…" to verify env, then re-source rc | Operator |
Dependencies and Failure Boundaries
| Dependency or boundary | Why it matters | Failure signal | Fallback or escalation |
|---|---|---|---|
| Model provider (OpenRouter, Anthropic, etc.) | Every agent call routes here | 4xx/5xx in agent logs; ddx agent execute-bead exits with provider error | Switch provider via env (OPENROUTER_API_KEY → ANTHROPIC_API_KEY etc.); operator-driven |
.ddx/beads.jsonl on local filesystem | Tracker durability | ddx bead list errors; corrupt-JSON parse failures | Restore from git: git checkout .ddx/beads.jsonl after stopping the server |
| Loopback port 7743 | Local control-plane access | Bind failure on startup | Kill conflicting process; or set DDX_SERVER_PORT to a free port and update ddx server start config |
~/.ddx/ user-state dir | Server-managed state outside the repo | Permission errors at startup | Verify ownership; recreate if missing (state is reproducible from repo) |
Tailscale (tsnet) sidecar (opt-in) | Remote control-plane access | Tailscale connectivity errors | Service still works on loopback; tailnet failure is non-blocking |
Alert Triage
| Alert or symptom | Likely causes | Immediate checks | Stop and escalate when |
|---|---|---|---|
helix run repeatedly returns NEXT_ACTION: WAIT | No ready beads; or queue-drain gate is blocking on missing context | ddx bead ready --json; helix status; check focused-epic state | Beads exist as ready but workers idle — restart ddx-server |
ddx server workers list shows zero healthy workers | Server crashed; or all workers wedged on a long model call | Check .ddx/agent-logs/<latest>; check systemctl --user status ddx-server | Crash loops more than 3x — capture logs and stop, do not restart blindly |
claimed-at ages exceed HELIX_ORPHAN_THRESHOLD (default 7200s) | Worker died without --unclaim; orphan recovery not yet swept | Wait for next sweep, or run ddx bead unclaim <id> manually after confirming the worker is dead | Recovery does not free the claim — investigate before forcing |
| Provider 401/403 spikes | API key revoked, expired, or rate-limited | echo "${OPENROUTER_API_KEY:0:8}…"; provider dashboard | Key is valid but provider rejects — escalate to provider support |
| Tracker file shows torn writes | Concurrent direct edit during a live run | git diff .ddx/beads.jsonl; check events[] for the affected bead | The bead events[] log does not match observable state — restore from git |
Common Incident Procedures
Stuck Claim After Worker Death
- Trigger:
ddx bead list --status in_progressshows a bead with staleclaimed-at(older than the orphan threshold) and the recordedclaimed-pidis no longer running. - Immediate actions:
- Confirm the PID is dead:
ps -p <claimed-pid>returns nothing. - Capture the bead state:
ddx bead show <id> --json > /tmp/<id>-stuck.json. - Run
ddx bead unclaim <id>to release the claim.
- Confirm the PID is dead:
- Validation:
ddx bead show <id>reportsstatus: openand no claim metadata.helix runresumes and either reclaims or skips per the queue ordering.
- Escalate to: N/A (operator-only service).
Tracker File Corruption
- Trigger:
ddx bead listexits non-zero with a JSON parse error, orgit diff -- .ddx/beads.jsonlshows a malformed line. - Immediate actions:
- Stop the server:
ddx server stop. - Capture evidence:
cp .ddx/beads.jsonl /tmp/beads-corrupt.jsonl. - Restore from git:
git checkout .ddx/beads.jsonl. - Replay any lost work by inspecting recent agent logs and re-issuing
ddx bead updatecalls if needed.
- Stop the server:
- Validation:
ddx bead list --status open --json | jq lengthreturns the expected count.helix statusshows a coherent queue.
- Escalate to: HELIX maintainers via repo issues if corruption is reproducible.
Provider Auth Failure
- Trigger: Agent logs show repeated 401 / 403 from the model provider.
- Immediate actions:
- Verify the env var is set in the server’s environment:
systemctl --user show-environment | grep -i api_key(presence only — never echo the value). - Re-source the operator rc file or
direnv reloadand restart the server:ddx server stop && ddx server start. - If the key is genuinely expired, rotate at the provider, update the env, and restart.
- Verify the env var is set in the server’s environment:
- Validation:
- One successful
ddx agent execute-bead --dry-runagainst any ready bead.
- One successful
- Escalate to: Provider support if the key is valid but rejected.
Rollback and Recovery
Rollback Entry Conditions
- A
ddx-serverupgrade caused crash loops on startup. - A schema-incompatible bead was written by a newer DDx version and the
current
ddx beadcannot parse it.
Rollback Procedure
- Stop the running server:
ddx server stop. - Reinstall the previous DDx version (operator’s preferred path —
typically
miseorasdf). - If the tracker file was rewritten by the newer version, restore the
previous tracker state:
git checkout HEAD~1 -- .ddx/beads.jsonl(after capturing the current file for forensics). - Restart:
ddx server start.
Recovery Validation
curl -fsS http://127.0.0.1:7743/healthzreturns200 OK.ddx server workers listshows at least one healthy worker.helix run --dry-runproduces a coherent next-action summary against the current queue.
Routine Operations
| Operation | Trigger or cadence | Command or workflow | Verification |
|---|---|---|---|
| Orphan-claim sweep | Automatic on each helix run cycle | Internal — no operator action required | ddx bead list --status in_progress shows no claims older than HELIX_ORPHAN_THRESHOLD |
.ddx/agent-logs/ rotation | Operator discretion (logs are gitignored runtime state) | Manual: prune older than 30 days | Disk usage in .ddx/agent-logs/ stays bounded |
| Worker pool restart | After provider config or env change | ddx server stop && ddx server start | One successful execute against a ready bead |
| API-key rotation | When provider issues a new key | Update env in operator rc; restart server | Single dry-run against a ready bead succeeds |
If no recurring operational tasks exist beyond these, no other periodic procedures are documented. Other system maintenance (OS updates, disk cleanup) belongs to the operator’s host-level workflow, not this service.
Escalation and Communications
- Primary on-call: The operator running
ddx-server. - Secondary escalation: HELIX maintainers via the repo issue tracker for
reproducible bugs in
ddx-serveritself. - Incident coordinator or manager: N/A — single-operator service.
- External dependency or vendor support: Model provider support (OpenRouter / Anthropic / etc.) for provider-side outages or credential issues.
References
- Deployment checklist:
deployment-checklist.md - Monitoring setup:
monitoring-setup.md - Architecture:
../02-design/architecture.md - Security architecture:
../02-design/security-architecture.md - DDx/HELIX boundary contract:
../02-design/contracts/CONTRACT-001-ddx-helix-boundary.md