Feature Specification: FEAT-010 — Testing Strategy Templates
Example from HELIX’s own docs. This generated page comes from
docs/helix/. Use it to see the method in practice; start with the artifact-type catalog for reusable templates. Historical plans and reports may describe retired architecture.
Source identity (from
01-frame/features/FEAT-010-testing-strategy-templates.md):
ddx:
id: FEAT-010
depends_on:
- helix.prd
- FEAT-006
- FEAT-008
review:
self_hash: 099a209181da34a93b78420c9ea8e21453f6af8e0c3b69bd125a8fc8e3d29306
deps:
FEAT-006: 1711c08bf5e041cd041c762594f278d35744351d4a25b8251566de2dd778abd3
FEAT-008: 898b8004c60c52c73185f68b907445f057b797bcc03706bab227b181cc07281f
helix.prd: 2b22383538b33c6ecee57f43d85128dfef7d56254766b757aa36439e35f2bfc9
reviewed_at: "2026-05-25T22:13:43Z"Feature Specification: FEAT-010 — Testing Strategy Templates
Feature ID: FEAT-010 Status: Draft Priority: P1 Owner: HELIX maintainers
Overview
“How do I know my software works?” is the central challenge of agentic development. HELIX should provide structured guidance on testing approaches — not test frameworks, but structured prompts and patterns that help agents write better tests and help humans define what “works” means.
Problem Statement
- Current situation: HELIX has a Plan activity that converts acceptance criteria into tests, but no structured templates for different testing strategies. Agents default to unit tests and miss integration patterns, property-based testing, performance benchmarking, and flow testing.
- Pain points:
- Acceptance criteria are often too vague to generate meaningful tests.
- Agents write narrow unit tests that pass but miss integration failures.
- No standard way to define performance targets and verify them.
- Property-based testing (invariants that hold across all inputs) is rarely used because agents aren’t prompted for it.
- Resilience testing (what happens when things fail?) is never considered unless explicitly requested.
Functional Requirements
FR-1: Acceptance Test Generation
Templates and prompts for generating test scaffolding directly from acceptance criteria. Given a feature spec with acceptance criteria, the template guides agents to produce:
- One test per acceptance criterion.
- Setup/teardown that matches the criterion’s preconditions.
- Assertions that map directly to the criterion’s expected outcomes.
FR-2: Property-Based Testing Patterns
Templates for defining invariants — properties that should hold across all inputs. Includes:
- How to identify good properties for a given domain.
- How to express properties as test assertions.
- When property-based testing adds value vs. example-based testing.
- Integration with common property-testing libraries (e.g., fast-check, proptest, hypothesis).
FR-3: Integration and Flow Testing
Templates for testing multi-step workflows where individual units pass but the composition fails:
- API contract testing between components.
- End-to-end flow testing for user-facing workflows.
- State machine testing for stateful protocols.
FR-4: Performance Benchmarking
Templates for defining and measuring performance targets:
- How to define measurable performance criteria (latency percentiles, throughput, cost budgets).
- How to write reproducible benchmarks.
- How to integrate performance checks into the build loop.
- How to set ratchet floors for performance metrics.
FR-5: Concern-Specific Testing
Testing templates that activate based on project concerns:
a11y: accessibility audit patterns, screen reader testing.o11y: observability verification (are metrics emitted? are traces connected?).i18n: locale testing, string extraction verification.security: input validation, authentication boundary testing.
FR-6: AC→Test Coverage Floor
Every acceptance criterion must trace to ≥1 test that exercises it — drives
the criterion’s action and asserts its observable outcome. The story test plan’s
per-AC matrix names the asserting behavior (not just a test name) so coverage is
mechanically checkable. A criterion with no exercising test is a blocking
coverage gap (reconcile-alignment Acceptance Criteria Validation), with one
escape hatch: a recorded reviewed exception (“manual verification accepted” /
“non-automatable AC”) with evidence. A weak or irrelevant test does not count as
coverage — it is classified UNTESTED / ASSERTED_UNBACKED. This is a floor on
minimum rigor; tests beyond one-per-AC are always welcome.
FR-7: E2E Core-Flow Coverage
A UI web app fills the e2e-framework slot (FEAT-006; default e2e-playwright).
Selecting the tool is not coverage: at least one core user-flow must have a
browser e2e test that runs green against the running app — a config plus an
unexecuted .spec file does not satisfy it. The run-core-flow gate lives in the
e2e-playwright practices; reconcile-alignment treats a core flow whose e2e was
never run against the app as UNTESTED.
FR-8: Verification Discipline (the evidence gate)
Aggressive end-to-end verification is a required discipline, carried by the
always-on verification concern (quality-attribute, areas: all). Its
distinct job is “work is NOT DONE until observed evidence of the running
system exists” — it is the evidence gate, not a second test strategy.
Boundary (no duplication): testing owns test strategy/frameworks;
e2e-framework owns the e2e tooling; verification owns the evidence gate.
The discipline requires, before “done”:
- Re-review — an adversarial second pass against the acceptance criteria and the integration risks the change touched, not a self-affirming check.
- Whole-stack exercise — real user flows driven against the running system end-to-end (the full stack), catching locally-correct / globally-wrong work. “Done” = whole stack exercised with recorded evidence, not unit-green.
- Recorded evidence artifacts — the command run + its exit status, the target URL/env, the core flows exercised, and a short re-review checklist against the ACs and integration risks.
- Verify-don’t-trust — never assert a result that was not observed; a self-reported “complete” is a hypothesis to verify (extends the measure action’s M4).
The discipline honors exceptions: library / docs-only / non-buildable work
(no running stack to exercise) and products where full-stack e2e is genuinely
infeasible (record the specific reason and substitute the strongest observable
evidence). High autonomy auto-selects verification for buildable products
(FEAT-011; concern-resolution), honoring these exceptions.
FR-9: AC-Citation Traceability
On top of the AC→test coverage floor (FR-6), every covering test must cite the
acceptance criterion it exercises in a canonical, parseable syntax — a
structured tag @covers <AC-ID> (e.g. @covers US-001-AC3) using the story’s
stable US-<n>-AC<m> ID — so AC→test traceability is machine-checkable rather
than guessed. This is an additional gate on top of exercise+pass+satisfy,
never a replacement. reconcile-alignment classifies a test that exercises and
passes but does not cite the AC ID as UNCITED_COVERAGE (not covered for
traceability; the fix is to add the citation, not write a new test) — distinct
from UNTESTED (no exercising test) and from ASSERTED_UNBACKED (cites an AC it
does not exercise — a phantom claim under FEAT-016 artifact honesty). The
story test plan’s per-AC matrix records each test’s @covers citation.
Non-Functional Requirements
NFR-1: Framework Agnostic
Templates must describe what to test and how to think about testing, not mandate specific test frameworks. Framework-specific details belong in concern practices (FEAT-006), not here.
NFR-2: Composable
Templates can be combined — a feature may need acceptance tests, property tests, and performance benchmarks simultaneously.
Acceptance Criteria
- Acceptance test generation template exists and can produce test scaffolding from a feature spec’s acceptance criteria.
- Property-based testing template exists with at least three domain-specific examples.
- Integration/flow testing template exists with multi-step workflow coverage.
- Performance benchmarking template exists with ratchet floor integration.
- At least two concern-specific testing templates exist (e.g., a11y, o11y).
- Templates are referenced from
/helix frameand/helix planprompts so agents use them when generating test strategies. - Every acceptance criterion traces to ≥1 exercising test; an untested AC is a blocking coverage gap unless a reviewed manual/non-automatable exception is recorded with evidence (FR-6).
- A UI web app fills
e2e-framework(defaulte2e-playwright) and has ≥1 core user-flow covered by a browser e2e that runs green against the running app — tool selection alone does not satisfy this (FR-7). - The
verificationconcern exists (quality-attribute, areas: all) as the evidence gate, with the testing/e2e-framework/verification boundary stated and its exceptions recorded; high autonomy auto-selects it for buildable products. “Done” requires recorded evidence artifacts (command+exit, URL/env, flows exercised, re-review checklist) — not unit-green (FR-8). - Every covering test cites the AC ID it exercises in the canonical
@covers <AC-ID>syntax; a test that exercises and passes but does not cite isUNCITED_COVERAGE(fix = add the citation, not a new test), distinct fromUNTESTEDandASSERTED_UNBACKED(FR-9; FEAT-016).
Constraints
- Must work with the existing concern system (FEAT-006).
- Must integrate with HELIX’s Plan activity workflow.
- Templates are advisory — they guide agents but don’t enforce specific tools.
Out of Scope
- Implementing actual test framework integrations (that’s concern-level).
- Chaos/fault injection testing (defer until resilience concerns exist).
- Mutation testing (interesting but not essential for initial release).