Phase 8 results — bench infra + 5 surgical fixes
Example from HELIX’s own docs. This generated page comes from
docs/helix/. Use it to see the method in practice; start with the artifact-type catalog for reusable templates. Historical plans and reports may describe retired architecture.
Phase 8 results — bench infra + 5 surgical fixes
Headline
| Metric | Phase 7 baseline | Phase 8 result | Delta |
|---|---|---|---|
| Stable_pass on the 14-row failure subset | 0/14 | 6/14 (43%) | +6 |
| Broken (docker eviction) | 2/17 (Phase 7 ran 17, 2 broken) | 0/14 | infra fixed |
Spend: ~$25 (Sonnet, 14 rows × 3 passes, ~$0.60/probe avg).
Wins (6)
| Row | Fix attribution | Phase 7 → Phase 8 |
|---|---|---|
CF-02-pipeline-needs-dns | sibling-skill name fix: expected_flow_instance: helix-data → helix (post canonical-skill collapse) | F/F/F → P/P/P |
CF-03-whats-blocked-multi-flow | same fix: skill_id: helix-data → helix | F/F/F → P/P/P |
EA-01-prd-feat-candidates-guided | [^\n]{0,80} → [\s\S]{0,200} — confirmation_marker_pattern now tolerates newlines in Sonnet’s bulleted FEAT enumeration | F/F/T → P/P/P |
EA-02-prd-feat-candidates-guided-named | same newline fix | F/T/F → P/P/P |
EA-03-prd-feat-candidates-autonomous | same newline fix | T/F/F → P/P/P |
G5-upstream-discovery | bench infra (image survival): full conversation completes now instead of failing at docker pull | broken → P/P/P |
Near-misses — flake (4 rows)
| Row | Phase 7 | Phase 8 | Note |
|---|---|---|---|
C011-what-methodologies-active | [P/F/F] (1/3) | [T/F/T] (2/3) | Inconsistent reading of marker — sometimes engages, sometimes uses training knowledge |
C017-plan-rollout-ambiguous-autonomous | [F/F/F] | [T/T/F] (2/3) | The autonomous-routing prose-attribution hint is biting on 2/3 samples; one more iteration of the SKILL.md wording might land it |
G3-artifact-canonicalization | [F/F/F] | [F/T/T] (2/3) | Discriminator widening from Tier 1 + Phase 7 + Phase 8 cumulatively helping; close to stable |
EA-04-prd-feat-candidates-autonomous-many | [P/F/F] (1/3) | [F/F/T] (1/3) | Newline regex helped EA-01..03 but EA-04’s 4-FEAT case still inconsistent — Sonnet doesn’t reliably name all 4 candidates |
Stable fails (4)
| Row | Why | Recommended next |
|---|---|---|
C014-reject-unauthorized-flow [T/F/F] | The [^\n] → [\s\S] regex fix landed; pass0 PASS confirms it works. But Sonnet doesn’t always emit the marker-pointing refusal — sometimes it just refuses generically. The skill body §5 wording could be tighter, OR the row’s prompt needs framing that makes the marker-citation more natural | SKILL.md §5 wording iteration |
CD-02-positive-control-no-edge | file_read matcher needs both read_indices AND surfaced to fire. Sonnet surfaces “market-validation-brief” from training memory without reading graph.yml | Workspace-fixture defense: change the edge signature to something unmistakably project-local (e.g. prd requires snake-oil-validation) that can’t be guessed |
CD-04-robustness-variant | Same file_read defect as CD-02 with regulatory-impact-assessment edge | Same fixture-defense approach |
G4-orchestration-with-gates | Prompt rewrite from Phase 7 added “FEAT-X (Receipt export, spec at docs/…)” but the workspace has no docs/helix/01-frame/FEAT-X-receipt-export.md file. Sonnet correctly asks for it | Add the FEAT-X fixture file to the workspace |
Infra fix (8.1) — validated end-to-end
family-test/docker/run-probe.sh now rebuilds family-test-claude:latest on demand if docker has evicted it. Verified by deleting the image and running a probe — image rebuilt automatically, probe succeeded. Phase 8 re-bench (14 rows over ~80 min) completed with zero docker evictions.
Aggregate scoreboard across all phases
27 unique stable_pass rows confirmed:
- AM-01..04 (4 rows — autonomy matrix, Phase 5a)
- C003, C008, C013, C015, C016, C018, C024 (7 rows — marker edges, prompt rewrites, discriminator widenings)
- CD-01, CD-03 (2 rows — Phase 7 discriminator widenings)
- CF-02, CF-03 (2 rows — Phase 8 sibling-skill fixes)
- EA-01, EA-02, EA-03 (3 rows — Phase 8 newline tolerance)
- G1 (Phase 5a — new gap row, scope_write_path discriminator)
- G2 (Phase 6 — discriminator widening)
- G5 (Phase 8 — bench infra fix)
- MI-01..06 (6 rows — Phase 6 matcher_change for multi-instance disambiguation)
Stage 5b recommendation
Conditional YES — the floor is now substantially clean:
- 27 rows confirmed stable_pass (up from ~0 measurable at session start)
- Docker eviction issue resolved (8.1)
- 5 known remaining defects (4 stable_fail, 4 near-miss flake) are characterized and either small skill-body iteration OR workspace fixture work — they’re not architectural
If the goal is clean ratchet seeding, Stage 5b (~$107 at current per-probe rates, ~3-4h wall-clock) is now safe to run. Expect:
- Routing baseline ~0.7-0.8 (vs current 0.6667; the v0.2.4 → v0.2.5 description tweaks may have nudged it)
- Stable_pass on full 71 conv set probably 0.45-0.55 (extrapolating from the 27 confirmed wins + new rows that haven’t yet been Phase-8-touched)
If the goal is finalize the remaining 8 rows first, Phase 9 would be:
- SKILL.md §5 sharpening for C014 stable-emission
- Workspace fixture defense for CD-02/CD-04
- Add FEAT-X file to G4 workspace
- EA-04 4-FEAT regex one more iteration
- Then Stage 5b on a near-clean bench
I’d suggest proceed with Stage 5b NOW for the ratchet reset — the Phase 9 work is iterative refinement on a bench that’s already broadly clean. Ratchets seeded now provide a useful baseline; refinements move them up, not down.