Skip to content

Runbook — `ddx-server`

Runbook — ddx-server

Service Summary

  • Service or component: ddx-server — the long-running platform service that owns the bead tracker (.ddx/beads.jsonl), agent worker pool, and execution evidence under .ddx/exec-runs.d/.
  • Primary function: Accept ddx agent execute-bead / ddx agent execute-loop dispatch from HELIX skills, run the resulting model-provider calls, and persist execution records that HELIX consumes during review.
  • Business impact if degraded: HELIX helix run and helix build cannot advance the bead queue. Existing claims may go stale (orphan-recovery threshold default 7200s) but no data is lost — .ddx/beads.jsonl is durable on disk under git.
  • Ownership team: HELIX maintainers (this is a developer-local service; the operator running it is the on-call).
  • On-call rotation: N/A — single-operator service. The operator is expected to be present when ddx-server is running.
  • Environments covered: One per repo working tree. The default deployment is systemd --user on the operator’s workstation, listening on 127.0.0.1:7743.

Operator Entry Points

SituationFirst dashboard, log, or queryFirst command or checkOwner
helix run reports BLOCKED with no obvious bead-side causehelix statusddx server workers list and curl -fsS http://127.0.0.1:7743/healthzOperator
Agent dispatch hangs or times out.ddx/agent-logs/<latest>ddx server workers list then ddx server stop / ddx server startOperator
Tracker writes appear torn or out of ordergit diff -- .ddx/beads.jsonlddx bead list --status in_progress --json to find unreleased claimsOperator
Port 7743 is in use at startupss -lntp '( sport = :7743 )'Identify the conflicting process; kill it or change the bind portOperator
Model-provider auth errors.ddx/agent-logs/<latest>echo "${OPENROUTER_API_KEY:0:8}…" to verify env, then re-source rcOperator

Dependencies and Failure Boundaries

Dependency or boundaryWhy it mattersFailure signalFallback or escalation
Model provider (OpenRouter, Anthropic, etc.)Every agent call routes here4xx/5xx in agent logs; ddx agent execute-bead exits with provider errorSwitch provider via env (OPENROUTER_API_KEYANTHROPIC_API_KEY etc.); operator-driven
.ddx/beads.jsonl on local filesystemTracker durabilityddx bead list errors; corrupt-JSON parse failuresRestore from git: git checkout .ddx/beads.jsonl after stopping the server
Loopback port 7743Local control-plane accessBind failure on startupKill conflicting process; or set DDX_SERVER_PORT to a free port and update ddx server start config
~/.ddx/ user-state dirServer-managed state outside the repoPermission errors at startupVerify ownership; recreate if missing (state is reproducible from repo)
Tailscale (tsnet) sidecar (opt-in)Remote control-plane accessTailscale connectivity errorsService still works on loopback; tailnet failure is non-blocking

Alert Triage

Alert or symptomLikely causesImmediate checksStop and escalate when
helix run repeatedly returns NEXT_ACTION: WAITNo ready beads; or queue-drain gate is blocking on missing contextddx bead ready --json; helix status; check focused-epic stateBeads exist as ready but workers idle — restart ddx-server
ddx server workers list shows zero healthy workersServer crashed; or all workers wedged on a long model callCheck .ddx/agent-logs/<latest>; check systemctl --user status ddx-serverCrash loops more than 3x — capture logs and stop, do not restart blindly
claimed-at ages exceed HELIX_ORPHAN_THRESHOLD (default 7200s)Worker died without --unclaim; orphan recovery not yet sweptWait for next sweep, or run ddx bead unclaim <id> manually after confirming the worker is deadRecovery does not free the claim — investigate before forcing
Provider 401/403 spikesAPI key revoked, expired, or rate-limitedecho "${OPENROUTER_API_KEY:0:8}…"; provider dashboardKey is valid but provider rejects — escalate to provider support
Tracker file shows torn writesConcurrent direct edit during a live rungit diff .ddx/beads.jsonl; check events[] for the affected beadThe bead events[] log does not match observable state — restore from git

Common Incident Procedures

Stuck Claim After Worker Death

  • Trigger: ddx bead list --status in_progress shows a bead with stale claimed-at (older than the orphan threshold) and the recorded claimed-pid is no longer running.
  • Immediate actions:
    1. Confirm the PID is dead: ps -p <claimed-pid> returns nothing.
    2. Capture the bead state: ddx bead show <id> --json > /tmp/<id>-stuck.json.
    3. Run ddx bead unclaim <id> to release the claim.
  • Validation:
    • ddx bead show <id> reports status: open and no claim metadata.
    • helix run resumes and either reclaims or skips per the queue ordering.
  • Escalate to: N/A (operator-only service).

Tracker File Corruption

  • Trigger: ddx bead list exits non-zero with a JSON parse error, or git diff -- .ddx/beads.jsonl shows a malformed line.
  • Immediate actions:
    1. Stop the server: ddx server stop.
    2. Capture evidence: cp .ddx/beads.jsonl /tmp/beads-corrupt.jsonl.
    3. Restore from git: git checkout .ddx/beads.jsonl.
    4. Replay any lost work by inspecting recent agent logs and re-issuing ddx bead update calls if needed.
  • Validation:
    • ddx bead list --status open --json | jq length returns the expected count.
    • helix status shows a coherent queue.
  • Escalate to: HELIX maintainers via repo issues if corruption is reproducible.

Provider Auth Failure

  • Trigger: Agent logs show repeated 401 / 403 from the model provider.
  • Immediate actions:
    1. Verify the env var is set in the server’s environment: systemctl --user show-environment | grep -i api_key (presence only — never echo the value).
    2. Re-source the operator rc file or direnv reload and restart the server: ddx server stop && ddx server start.
    3. If the key is genuinely expired, rotate at the provider, update the env, and restart.
  • Validation:
    • One successful ddx agent execute-bead --dry-run against any ready bead.
  • Escalate to: Provider support if the key is valid but rejected.

Rollback and Recovery

Rollback Entry Conditions

  • A ddx-server upgrade caused crash loops on startup.
  • A schema-incompatible bead was written by a newer DDx version and the current ddx bead cannot parse it.

Rollback Procedure

  1. Stop the running server: ddx server stop.
  2. Reinstall the previous DDx version (operator’s preferred path — typically mise or asdf).
  3. If the tracker file was rewritten by the newer version, restore the previous tracker state: git checkout HEAD~1 -- .ddx/beads.jsonl (after capturing the current file for forensics).
  4. Restart: ddx server start.

Recovery Validation

  • curl -fsS http://127.0.0.1:7743/healthz returns 200 OK.
  • ddx server workers list shows at least one healthy worker.
  • helix run --dry-run produces a coherent next-action summary against the current queue.

Routine Operations

OperationTrigger or cadenceCommand or workflowVerification
Orphan-claim sweepAutomatic on each helix run cycleInternal — no operator action requiredddx bead list --status in_progress shows no claims older than HELIX_ORPHAN_THRESHOLD
.ddx/agent-logs/ rotationOperator discretion (logs are gitignored runtime state)Manual: prune older than 30 daysDisk usage in .ddx/agent-logs/ stays bounded
Worker pool restartAfter provider config or env changeddx server stop && ddx server startOne successful execute against a ready bead
API-key rotationWhen provider issues a new keyUpdate env in operator rc; restart serverSingle dry-run against a ready bead succeeds

If no recurring operational tasks exist beyond these, no other periodic procedures are documented. Other system maintenance (OS updates, disk cleanup) belongs to the operator’s host-level workflow, not this service.

Escalation and Communications

  1. Primary on-call: The operator running ddx-server.
  2. Secondary escalation: HELIX maintainers via the repo issue tracker for reproducible bugs in ddx-server itself.
  3. Incident coordinator or manager: N/A — single-operator service.
  4. External dependency or vendor support: Model provider support (OpenRouter / Anthropic / etc.) for provider-side outages or credential issues.

References