Phase 5a results — canonical promotion targeted re-bench

Example from HELIX’s own docs. This generated page comes from docs/helix/. Use it to see the method in practice; start with the artifact-type catalog for reusable templates. Historical plans and reports may describe retired architecture.

Phase 5a results — canonical promotion targeted re-bench

Headline

Subset (24 rows)	Pre-port baseline	Phase 5a result	Delta
Stable_pass (3-of-3)	0/24	7/24 (29.2%)	+7 rows, +29.2 pp
Routing failure subset (27 rows)	3/27	3/27	+0 (description-saturated)

The 24-row subset = 19 baseline-FFF conv rows + 5 new G* gap rows. 0/24 baseline because the 19 FFF rows were all-fail and the 5 G* rows didn’t exist.

Wins (7 stable_pass)

Row	Why it now passes
AM-01-prd-cascade-manual	Inline §“Apply The Autonomy Level” — manual asks before every tool use, including Read
AM-02-prd-cascade-guided	Inline autonomy — guided pauses before first state-changing tool use
AM-05-adr-singleton-manual	Inline autonomy applied to ADR cascade
AM-06-adr-singleton-guided	Inline autonomy applied to ADR cascade
C003-prd-vision-orphan-no-marker	§2 marker-absent-no-heuristic clause: returns `{"active": []}` instead of improvising
C018-iterate-current-sprint	Internal routing modes drove the iterate behavior
G1-spec-gap-orchestration	New row — `scope_write_path` matcher (forbid `^docs/helix/`, allow `.workflow-scratch/`) — skill respected the operator’s “do NOT modify specs”

Near-misses (signal there, not stable)

Row	Result	Note
C008-manage-infrastructure-empty	[True, False, False]	First pass behaved; 2 + 3 didn’t
C011-what-methodologies-active	[True, False, False]	Same flake pattern
G5-upstream-discovery	[True, True, False]	2/3 — one bad sample away from stable

Stable fail (14)

C013, C014, C015, C016, C017, C024, CD-01, CD-03, CD-04, CF-02, CF-03, G2, G3, G4.

Common patterns to investigate in Phase 6:

C014 “reject-unauthorized-flow” — skill engages despite marker not authorizing the flow
C015, C016, C017 — “what’s next” / “plan the rollout” — cross-flow query mode not driving the right output shape
CD-01..04 — concern slot resolution defects
CF-02, CF-03 — cross-flow query / cross-instance routing
G2 — scale recalibration math discriminator too tight (needs more permissive regex)
G3 — artifact canonicalization 3-conjunction may need looser middle clause
G4 — orchestration-with-gates regex may not match Sonnet’s natural phrasing

Routing

Unchanged (3/27 PASS, same 3 rows as pre-port: RE-POS-005, 008, 028). The body changes (autonomy, internal routing modes, activation discipline) are INVISIBLE to routing-evals — routing grades whether the Skill tool_use fires, not what the skill does after.

Multi-instance (MI-*) was 0/5 — same as baseline. Bench-grading limitation: the skill engages then SHOULD emit the disambiguation banner, but the bench’s Skill tool_use observed check fires before the banner. Investigation for Phase 6.

Cost

~$65 total (Sonnet, 27 routing + 24 conv × 3 passes = 99 probes total, ~$0.65/probe avg).

Under the $75 budget. Stage 5b ($260 full bench) deferred until Phase 6 closes the stable-fail rows.

Ratchets

NOT seeded from Phase 5a — the 24-row subset is not comparable to the 36-row baseline that produced routing=0.6667 / conv=0.4722. Phase 5b (full bench) is the right surface for ratchet reset.