Skip to content

Testing Layers

Unit, contract, integration, visual regression — each layer for a different gap.

Testing is not a uniform activity. Different layers catch different mistakes. The corpus pattern is to use each layer for what it is good at and not lean on one layer to do the work of another.

The layers

LayerWhat it catchesAuthored byFrequency
UnitLogic errors inside a functionDeveloperEvery story
ContractBoundary errors between caller and calleeDeveloperEvery API/service boundary
IntegrationWiring errors across modulesDeveloper + QAEvery Epic
End-to-end (Gherkin)The whole-flow scenarioQA writes, Developer implementsEvery story's amigos output
Visual regressionUnintended UI changesDesigner + QAEvery UI change
AccessibilityKeyboard, screen reader, contrastDesigner + QAEvery UI change
Performance / loadSLO breaches under expected loadTech Lead + QAPer Epic when ility-relevant
ExploratoryThe unknown unknownsQAPre-merge

Unit tests

Smallest unit of test. Single function or module. Mocks out the world. Runs in milliseconds. The corpus rule: a story with no unit tests is a story whose logic was never written down twice. Twice is the discipline — once in the implementation, once in the test that proves it does what was named.

Unit tests are not the place for Gherkin scenarios that span multiple modules. Those go in integration or end-to-end.

Contract tests

Boundary tests. Given this caller sends this shape, the service returns this shape; given this caller sends a malformed shape, the service returns this error.

The corpus rule: every API the project exposes has at least one contract test per endpoint per documented response. The contract tests are derived from the API contract written in Volume III Part 7.

Integration tests

Wire several modules together. Often spin up real dependencies — a real database, a real Redis, a stubbed third-party. Catch the wiring mistakes that no unit test sees.

Integration tests are slower. The corpus pattern: write enough of them to cover the Epic's main flows; do not write so many that the pipeline becomes painful.

End-to-end / Gherkin

The Gherkin scenarios from amigos (Volume III Part 5) become e2e tests. Run in a browser-driver against the real frontend, real backend, real database. They are the slowest, most fragile, and most valuable.

The corpus pattern: every story has at least the required-for-prediction Gherkin scenarios as e2e. Negative cases as e2e where they cross system boundaries. The Gherkin lives next to the story in source control; the test code is generated from or aligned with the Gherkin.

Visual regression

For UI work, the rendered output is compared against an approved baseline. Failures surface as image diffs. The Designer is the approver — yes that is the intended change or no that is a regression.

The baselines come from Figma frames. Every named state has a baseline image. The Designer can update baselines when they intended the change.

Accessibility tests

Automated checks for contrast, ARIA, keyboard reachability. Manual checks for screen reader output and keyboard flow. Both run pre-merge for any UI change.

The corpus rule: accessibility failures block merge unless explicitly accepted as known issue with a remediation date. We'll fix it later without a date is the chain failing the discipline.

Performance / load

For Epics where ility selection (Volume III Part 8) names performance as material, a load test runs against staging. The test is shaped against the expected load profile.

The corpus pattern: load tests are prediction-checked, like everything else. We expect the new endpoint to hold p95 under 200ms at 200 RPS — checked.

Exploratory testing

The QA, with the brief and the journey map open, uses the feature like the named person would. Outside the scenarios. Looking for the moments that nobody named.

The corpus pattern: every pre-merge QA includes 30+ minutes of exploratory testing on the major stories. The output is the QA report (next section).

Pre-merge QA verification

Before a PR can merge, the QA verifies on the branch:

  • Gherkin scenarios from amigos pass.
  • Both flag-on and flag-off paths work.
  • Edge cases the QA imagined during exploration.
  • Accessibility baseline holds.

The verification is a checked artifact, not a Slack message. The QA writes a short report — what was tested, what was explored, what surprised. The report lives next to the PR.

QA report

The artifact at the end of pre-merge QA.

text
QA Report — PR #482 (GRD-142 Hebrew name support)
QA: Mira
Date: 2026-05-22

Tested (Gherkin):
  ✅ Hebrew name renders correctly on first load
  ✅ Mixed-form Hebrew name renders correctly
  ✅ Rare unicode form falls back gracefully
  ✅ Edit attempt is now disabled (story explicitly removes the workaround)

Explored:
  - Tried 12 names with various unicode forms; all rendered
  - Tried with extreme name length (84 chars); rendered with truncation
  - Tried Hebrew name in queue search (passes; bonus discovery)
  - Tried with screen reader; name read correctly

Surprises:
  - Unicode-fallback log line is duplicated when the same name is
    rendered twice on the same page. Filed GRD-148 (P3, cosmetic).

Not tested:
  - Bulk export view (out of scope per brief)

Accessibility:
  ✅ Contrast unchanged
  ✅ Keyboard nav unchanged
  ✅ Screen reader reads names correctly (NVDA, VoiceOver)

Visual regression:
  ✅ One intended change accepted (queue row height +2px for RTL names)

Pre-merge: APPROVED

The report is what the chain reads later — at signal reading, at postmortem, at retrospective. It is the QA's structured record of what they witnessed.

Test maintenance

Tests that are flaky, slow, or wrong are themselves chain debt. The corpus pattern: when a test is repeatedly failing for the wrong reasons, the test is fixed or deleted, not retried.

A test suite is part of the codebase. It is reviewed, maintained, and pruned. A 4,000-test suite that nobody trusts is worse than a 400-test suite that is solid.

Part 6 — Release Gate →

200apps · How We Work · NWIRE