Skip to content

Phase 5a results — canonical promotion targeted re-bench

Example from HELIX’s own docs. This generated page comes from docs/helix/. Use it to see the method in practice; start with the artifact-type catalog for reusable templates. Historical plans and reports may describe retired architecture.

Phase 5a results — canonical promotion targeted re-bench

Headline

Subset (24 rows)Pre-port baselinePhase 5a resultDelta
Stable_pass (3-of-3)0/247/24 (29.2%)+7 rows, +29.2 pp
Routing failure subset (27 rows)3/273/27+0 (description-saturated)

The 24-row subset = 19 baseline-FFF conv rows + 5 new G* gap rows. 0/24 baseline because the 19 FFF rows were all-fail and the 5 G* rows didn’t exist.

Wins (7 stable_pass)

RowWhy it now passes
AM-01-prd-cascade-manualInline §“Apply The Autonomy Level” — manual asks before every tool use, including Read
AM-02-prd-cascade-guidedInline autonomy — guided pauses before first state-changing tool use
AM-05-adr-singleton-manualInline autonomy applied to ADR cascade
AM-06-adr-singleton-guidedInline autonomy applied to ADR cascade
C003-prd-vision-orphan-no-marker§2 marker-absent-no-heuristic clause: returns {"active": []} instead of improvising
C018-iterate-current-sprintInternal routing modes drove the iterate behavior
G1-spec-gap-orchestrationNew row — scope_write_path matcher (forbid ^docs/helix/, allow .workflow-scratch/) — skill respected the operator’s “do NOT modify specs”

Near-misses (signal there, not stable)

RowResultNote
C008-manage-infrastructure-empty[True, False, False]First pass behaved; 2 + 3 didn’t
C011-what-methodologies-active[True, False, False]Same flake pattern
G5-upstream-discovery[True, True, False]2/3 — one bad sample away from stable

Stable fail (14)

C013, C014, C015, C016, C017, C024, CD-01, CD-03, CD-04, CF-02, CF-03, G2, G3, G4.

Common patterns to investigate in Phase 6:

  • C014 “reject-unauthorized-flow” — skill engages despite marker not authorizing the flow
  • C015, C016, C017 — “what’s next” / “plan the rollout” — cross-flow query mode not driving the right output shape
  • CD-01..04 — concern slot resolution defects
  • CF-02, CF-03 — cross-flow query / cross-instance routing
  • G2 — scale recalibration math discriminator too tight (needs more permissive regex)
  • G3 — artifact canonicalization 3-conjunction may need looser middle clause
  • G4 — orchestration-with-gates regex may not match Sonnet’s natural phrasing

Routing

Unchanged (3/27 PASS, same 3 rows as pre-port: RE-POS-005, 008, 028). The body changes (autonomy, internal routing modes, activation discipline) are INVISIBLE to routing-evals — routing grades whether the Skill tool_use fires, not what the skill does after.

Multi-instance (MI-*) was 0/5 — same as baseline. Bench-grading limitation: the skill engages then SHOULD emit the disambiguation banner, but the bench’s Skill tool_use observed check fires before the banner. Investigation for Phase 6.

Cost

~$65 total (Sonnet, 27 routing + 24 conv × 3 passes = 99 probes total, ~$0.65/probe avg).

Under the $75 budget. Stage 5b ($260 full bench) deferred until Phase 6 closes the stable-fail rows.

Ratchets

NOT seeded from Phase 5a — the 24-row subset is not comparable to the 36-row baseline that produced routing=0.6667 / conv=0.4722. Phase 5b (full bench) is the right surface for ratchet reset.