Skip to content

Plan — HELIX self-improvement from the yolo-demo arc (2026-05-24, rev 2: codex-reviewed)

Plan — HELIX self-improvement from the yolo-demo arc (2026-05-24, rev 2: codex-reviewed)

Context

Distilled from an autonomous “one-shot a working app at max autonomy” build (~/Projects/helix-yolo/): HELIX drove a real, accessible Next.js + bun invoicing app (build 0, 123 unit + 32 e2e, WCAG AA) across several codex rounds (NOT-READY → SHIP-WITH-FIXES). Seven failure modes recurred. Fixes captured as three buckets: 1) docs/ governing specs, 2) workflows/ templates/prompts/examples + actions/concerns/references/ratchets, 3) the skill (skills/helix/SKILL.md). Each learning spans the buckets (spec authorizes → template/action operationalizes → skill performs). Rev 2 incorporates codex review.

The seven learnings

  • L1 — Claims-vs-reality is the dominant artifact defect (template-conformant docs claiming tests/coverage/metrics that don’t exist). Prototype built (uncommitted): ASSERTED_UNBACKED + phantom-claim ratchet.
  • L2 — Autonomy is first-class. Scaffolding already exists but is dangling: input.md defines low/medium/high semantics governed by FEAT-011 / TD-011, which were removed in the collapse (823aa1ac). High autonomy must also infer concern selection.
  • L3 — Concerns library is the lever but was inert (bare API, no UI, until selection forced).
  • L4 — Meta-gates exist but don’t run in the loop (validate/align/check).
  • L5 — Progressive evolve beats re-splat; convergence ≠ “reviewer says SHIP.”
  • L6 — Templates should bake in AC↔test traceability (stable AC IDs, G/W/T, matrix, layer allocation).
  • L7a — Headless activation is phrasing-dependent (slash inert; by-name works). L7b — composed concerns should ship known friction (bun:sqlite under next buildbun --bun).

Bucket 1 — docs/ spec updates

#ChangeFile(s)Done when
L2Restore FEAT-011 (slider-autonomy) + TD-011 to resolve the dangling refs (input.md:16, CONTRACT-002 recorded hash, CONTRACT-001). Keep the existing low/medium/high vocab + semantics; extend with: concern-inference at high autonomy, resolution precedence, hard-stop invariant, and explicit “autonomy changes checkpoint density, never collapses the 7-activity loop.”docs/helix/01-frame/features/FEAT-011-slider-autonomy.md, docs/helix/02-design/technical-designs/TD-011-slider-autonomy-implementation.mdboth files exist; CONTRACT-002 hash re-blessed; input.md refs resolve
L2Confirm PRD autonomy goaldocs/helix/01-frame/prd.md:97already reworded (current tree) — just verify on commit
L2ADR for the autonomy spectrum decision (cite FEAT-011/TD-011)docs/helix/02-design/adr/ADR-0NN-autonomy-spectrum.mdADR exists, cited by FEAT-011 + skill
L1New FEAT-016 — artifact-honesty / claims-vs-reality (codex: split out, it’s global artifact honesty not just tests). Cross-ref from FEAT-008 + FEAT-014.docs/helix/01-frame/features/FEAT-016-artifact-honesty.mdFEAT exists; describes ASSERTED_UNBACKED + zero-floor ratchet
L4/L5Meta-gates-in-loop + evolve-default + convergence criterionextend FEAT-014-workflow-coverage.mdFEAT mandates self-validation mode-gates + states convergence = “verified + finding-classes folded into gates”
L3Concern selection mandatory + propagation gateextend FEAT-006-concerns-practices-context-digest.mdFEAT requires concern selection in frame + a propagation gate
L6AC↔test traceability is a template requirementextend FEAT-008-artifact-template-quality.mdrequires stable AC IDs + matrix + layer allocation
L7aHeadless activation phrasingdocs/install/*.md, cross-ref FEAT-013install docs teach by-name activation; slash noted interactive-only

Bucket 2 — workflows/ template / prompt / example / action updates

#ChangeFile(s)Done when
L1ASSERTED_UNBACKED + claims-vs-reality check (built, uncommitted)workflows/actions/reconcile-alignment.md (≈:216-235)fixture naming a nonexistent test is flagged + blocks
L1Phantom-claim zero-floor sub-ratchet (built, uncommitted)workflows/ratchets.md (≈:135-144)documented; tied to acceptance-criteria ratchet
L4Self-validation as workflow-mode gates (NOT literal “run validate+align+check” — avoid recursion)workflows/actions/implementation.md, evolve.md, + measure/report actionseach names a verify-activity gate that fails on findings
L3Concern selection a required frame step + auto-infer at high autonomy; check/polish are propagation gates only (not selection)workflows/actions/frame.md, workflows/references/concern-resolution.mdframe with no concerns flags; autonomy=high auto-selects; check flags missing propagation
L7bComposed-concern friction + conflict handlingworkflows/concerns/react-nextjs/practices.md, workflows/concerns/typescript-bun/practices.md, workflows/references/concern-resolution.mdbun:sqlite/next-build friction + bun --bun documented
L6Stable AC IDs (US-x-ACn) + Given/When/Then in user storiesworkflows/activities/01-frame/artifacts/user-stories/{template.md,prompt.md,example.md,meta.yml}meta required_sections enforce AC IDs; example shows them
L6Build on the existing story-test-plan AC-mapping (template.md:26-31); test-plan aggregates strategy — don’t duplicate the story-level matrixworkflows/activities/03-test/artifacts/{story-test-plan,test-plan}/*story-test-plan AC matrix strengthened; test-plan aggregates + layer allocation
L5Evolve-default + convergence criterion (not principles.md)workflows/actions/evolve.md, workflows/actions/fresh-eyes-review.md, reconcile-alignment.md (or new review-integration reference)stated + referenced from loop/review actions

Bucket 3 — skill (skills/helix/SKILL.md) — normative & mode-based only

#ChangeDone when
L2Autonomy section: low/medium/high + semantics (checkpoint density, assumption-recording, concern-inference depth, stack depth), resolution precedence (per-invocation override → artifact frontmatter / project policy → runtime default; no CLAUDE.md), hard-stop invariant, and “never collapses the 7-activity loop”SKILL.md autonomy section; routes honor the level
L3Frame route makes concern selection prominent; at high autonomy auto-selects by inferring product natureSKILL.md frame route + autonomy interplay documented
L4Point build/evolve/review routes at the workflow-mode self-validation gates (reference modes, not literal commands)SKILL.md references the gate-bearing actions
L7aDo NOT put runtime-specific “Skill tool_use” expectations in SKILL.md (runtime-neutrality). Activation mechanics live in docs/install/, tests/install/, FEAT-013/014; SKILL.md stays mode-basedno runtime-specific assertions added to skill body

Resolved open questions (per codex)

  • Claims-vs-reality home: its own FEAT-016 (global artifact honesty), cross-ref’d from FEAT-008/FEAT-014.
  • Autonomy resolution precedence: per-invocation override → artifact frontmatter / project policy → runtime default. No CLAUDE.md dependency.
  • Autonomy vocab: reuse existing low/medium/high (input.md/REFERENCE/EXECUTION) — do not introduce a new vocabulary.
  • Evolve/convergence home: the evolve/fresh-eyes-review/reconcile-alignment actions (or a new review-integration reference), not workflows/principles.md.

Sequencing & dependencies

  • Phase 0 (commit what’s built): L1 ratchet (reconcile-alignment.md + ratchets.md) + PRD :97 (already reworded). Re-bless ddx hashes; commit.
  • Phase 1 (honesty + meta-gates): L1 (new FEAT-016, fixture, enforcement) + L4 (mode-gates). Highest leverage.
  • Phase 2 (autonomy): L2 — restore FEAT-011/TD-011 (reconcile CONTRACT-002 hash + input.md), extend with concern-inference + precedence + loop invariant → SKILL.md autonomy section. Foundational for L3.
  • Phase 3 (concerns): L3 (depends on L2) + L7b.
  • Phase 4 (templates): L6 (independent; builds on story-test-plan).
  • Phase 5 (activation + convergence policy): L7a (install/tests/FEAT, folds into the live integration-test epic) + L5.

Invariants

  • Runtime neutrality: autonomy + concern signals in runtime-neutral artifacts (frontmatter/project policy), never CLAUDE.md; no runtime-specific assertions in the skill body.
  • High autonomy changes checkpoint density only — never collapses the seven-activity loop (enforce in FEAT-011/ADR/skill).
  • Every change keeps the full repo gates green (the repo test suite / just test + git diff --check, not only tests/validate-skills.sh) and re-blesses the ddx review hashes it touches.

codex verdict on rev 1: SOUND-WITH-CHANGES; this rev 2 incorporates all top-3 must-fixes + bucket-fit notes.