Skip to content

Sample Data

Category: Quality Attributes · Areas: data

Description

Category

quality-attribute

Areas

data

Boundary

This concern owns governed app seed/demo data — the data a data-backed product ships with so it feels real and exercises its UI states. It is distinct from testing, and the two must not blur:

  • testing owns test fixtures, factories, mocks, isolation, and assertion-specific data — the throwaway data a single test constructs to exercise one behavior and then tears down. (Its “fake data over fixtures” practice governs test data.)
  • sample-data owns the seed/demo dataset the running product is populated with — the rows a reviewer, a demo, or a screenshot actually sees. Its job is to make a data-backed product feel real and to drive every UI state (empty, full, long, overflowing, boundary, every status/enum variant).

“e2e runs against realistic data” is not this concern collapsing into testing: sample-data produces the governed seed the running system loads; testing decides what a test asserts. Keep them separate.

Components

  • Semantic fake-data library — a semantically-aware generator (names, emails, addresses, prices, dates, lorem) as a tech-stack-specific named default, following the active runtime concern:

    • typescript-bun@faker-js/faker
    • python-uvFaker
    • rust-cargofake
    • other stacks → the idiomatic semantic faker for that runtime

    This is the default tool, not the determinism mechanism (see Constraints).

  • Seed script — a governed, idempotent, re-runnable deliverable that populates the running product’s datastore with the generated dataset.

  • Curated edge-case records — literal rows inside the seed script that guarantee specific boundary cases the generator would otherwise hit only by chance.

Constraints

Determinism comes from the seed, not the library

  • Determinism is produced by an explicit seed passed to the faker, a stable locale/config, a pinned dependency version, and a deterministic generation order — not by the choice of library. Re-running the seed script with the same seed must produce the same dataset.
  • Record or print the seed so a dataset can be reproduced exactly.

Generate varied shapes and schema-relevant edge cases

  • Generate varied data — not N copies of one row. Vary string lengths, optional-field presence, value ranges, and timestamps.
  • Cover schema/domain-relevant edge cases so every UI state renders: empty (zero rows / empty collections), long (long strings, long lists), large (big numbers, large counts), boundary (min/max, just-over/just-under limits), and all status/enum variants.
  • Edge cases are bounded by default, not invalid-by-reflex — generate the realistic boundaries the schema and domain allow, not malformed data the product is not required to accept.

The seed script is a governed deliverable

  • The seed script is a first-class, idempotent/re-runnable deliverable, not an afterthought. Re-running it converges to the same governed dataset rather than duplicating rows.
  • Guardrails:
    • Set an explicit faker seed for reproducibility.
    • Use clearly synthetic, non-PII values — never real or realistic personal data.
    • Never execute against, or mix with, production data or a production datastore. Seed only non-production environments.
  • Avoid thin, ad-hoc, hardcoded-only data (a couple of uniform rows). But curated literal edge-case records inside the governed seed script are encouraged — they are often the clearest way to guarantee a specific boundary case exists.

Drift Signals (anti-patterns to reject in review)

  • A couple of thin, hardcoded, uniform rows as the only seed data → generate a varied dataset via the semantic faker plus curated edge-case records
  • No semantic faker dependency for a data-backed product → add the tech stack’s named default (@faker-js/faker, Faker, fake, …)
  • Seed data with no edge cases (no empty / long / large / boundary / all-status coverage) → add schema-relevant edge cases so every UI state renders
  • Nondeterministic seed generation (no explicit seed, unstable locale, unpinned faker version, unordered generation) → make it reproducible
  • Real or realistic PII in sample data → replace with clearly synthetic values
  • A seed script that can run against / mix with production → restrict it to non-production environments

When to use

Any data-backed product — one whose value shows through data it stores and renders. High autonomy auto-selects this concern for data-backed products (see workflows/references/concern-resolution.md); areas: data scopes its practices to data-layer work items. Compose with the tech-stack concern (which fixes the faker library) and with UX concerns (the varied data is what exercises the empty/overflow/large-number states the UI must handle).

Artifact Impact

Selecting this concern requires these artifacts to change (a selected concern absent from them is drift):

  • DATA_DESIGN: governed seed/demo dataset covering empty/long/large/boundary + all status-enum UI states
  • IMPLEMENTATION_PLAN: idempotent re-runnable seed script with explicit faker seed; non-PII, non-production only

ADR References

Practices by activity

Agents working in any of these activities inherit the practices below via the bead’s context digest.

These practices govern the app seed/demo dataset a data-backed product ships with — the data that makes it feel real and drives every UI state. They do not govern test data: testing owns fixtures, factories, and assertion-specific data (see the boundary in concern.md).

Frame

  • Use the semantic fake-data library that is the named default for the active runtime concern:
    • typescript-bun@faker-js/faker
    • python-uvFaker
    • rust-cargofake
    • other stacks → the idiomatic semantic faker for that runtime
  • Add it as a project dependency at a pinned version. The library is the default generator, not the source of determinism.

Design

Determinism comes from how the generator is driven, not from which one:

  • Pass an explicit seed to the faker at the top of the seed script.
  • Fix a stable locale/config so generated values do not shift with the environment.
  • Pin the faker version so an upgrade cannot silently change the dataset.
  • Generate in a deterministic order — no reliance on map/set iteration order or wall-clock time.
  • Record/print the seed so any dataset can be reproduced exactly.
  • Produce varied rows — vary string lengths, optional-field presence, value ranges, and timestamps. Never N copies of one row.
  • Cover the schema/domain-relevant edge cases so every UI state renders:
    • empty — zero rows / empty collections (the empty-state UI)
    • long — long strings, long lists (overflow / truncation UI)
    • large — big numbers, large counts (formatting / pagination UI)
    • boundary — min/max and just-over/just-under schema limits
    • all status/enum variants — one record per status so every badge/branch renders
  • Keep edge cases bounded — realistic extremes the schema and domain allow, not malformed data the product need not accept.

Build

  • The seed script is idempotent and re-runnable: running it again converges to the same governed dataset rather than appending duplicates.
  • Guardrails:
    • Explicit faker seed — reproducible runs.
    • Clearly synthetic, non-PII values — never real or realistic personal data.
    • Never against / mixed with production — seed only non-production environments; the script must refuse a production target.
  • Prefer the faker for bulk variety; add curated literal edge-case records inside the same governed script for boundary cases you must guarantee. Avoid thin hardcoded-only data as the whole dataset.

Test

  • A semantic faker library is a pinned dependency for the data-backed product.
  • A governed, idempotent seed script exists and populates the running product.
  • The seed dataset is varied and covers empty / long / large / boundary / all-status edge cases (verifiable by row-count + variety in the seeded store).
  • Generation is deterministic: explicit seed, stable locale, pinned version, ordered generation.
  • Sample data is clearly synthetic (no PII) and the seed never targets production.

Cross-cutting

Boundary with testing

  • sample-data populates the running product; testing constructs throwaway data for a single assertion. Do not seed the app from test fixtures, and do not assert product behavior against the demo seed instead of purpose-built test data.