Skip to content

A brief that didn't witness

The most common Volume II failure. The brief looks structurally correct. The prediction has all five fields. And reality, when it answers, says the team measured the wrong thing.

The artefact

Excerpt — Feature Brief, "Grading Flow v2", March 2026

Experience snapshot: Graders today are slowed down by repetitive friction in the LMS. The new grading flow will reduce this friction and improve their grading experience.

Prediction:

  • Baseline: 45 minutes per grading cycle (estimated from internal usage data)
  • Target: under 20 minutes per cycle
  • Check date: 2026-06-15
  • Check method: query the time-on-task analytics dashboard
  • Owner: Alex (PO)

Sign-off: PO ✅ · Designer ✅ · Tech Lead ✅

The brief looks structurally complete. Every Volume II checkbox is ticked. The trio signed off. The cycle ran. Six weeks later, the dashboard query returned 18 minutes — better than the target. The team celebrated. Two weeks after that, support tickets started coming in: graders saying the new flow was worse. CSAT was down. The customer's head of department asked for a meeting.

What's wrong?

Stop. Read the artefact again. Three things are wrong before you read the diagnosis. Find at least two.

Diagnosis (open when ready)

1. The experience snapshot is generic

Graders today are slowed down by repetitive friction in the LMS.

This is not an experience snapshot. It is a sentence about a category of person, not a moment in a named person's day. Compare with the corpus standard:

It is Wednesday morning. Gal sits down at 08:50 with her coffee and opens the LMS to grade the morning's batch of CS101 finals. The fourth submission is from a student named Yael Rosenberg-Hayut. The system displays her name with the hyphen and accents broken…

The corpus rule, from Volume II · Person & Moment: every brief begins with a named person whose life will change. Without a named person, the brief is a hypothesis about an aggregate. The team is now solving for an average that no one experiences.

2. The baseline was not witnessed

Baseline: 45 minutes per grading cycle (estimated from internal usage data)

Estimated from internal usage data is the warning sign. The dashboard's time-on-task metric measures wall-clock — the time the LMS tab is in the foreground. It does not measure focus. A grader who alt-tabs to a spreadsheet to fix a Hebrew name is off-task in the dashboard but fully on-task in their actual work.

The brief should have witnessed the activity directly — sat next to a named grader for one full session, with a stopwatch. 47 minutes mean, n=12, captured 2026-04-22 by direct observation of three named graders is what the brief should say. The 45-minute number is a query result, not a measurement of the activity.

When the cycle ran, the dashboard reported 18 minutes — but it was now reporting on a different pattern: graders had stopped alt-tabbing because the new flow handled Hebrew names. The dashboard time fell. The actual focused-grading time also fell, but less than the dashboard suggested. And graders disliked the new flow for an unrelated reason that the brief never witnessed: it removed the workaround they had built for something else — a typo-correction step they had been doing as a side effect.

3. The check method is the same as the baseline method

Check method: query the time-on-task analytics dashboard.

Same dashboard, same metric. This is the third silent failure. A check method that re-uses the baseline's measurement infrastructure is not a check — it's a recursion. Reality cannot disagree with the prediction because the dashboard cannot disagree with itself.

The check method should have been the same shape as the baseline witnessing: three observation sessions across three named graders, in the field. If the cycle had run that check, the third grader would have surfaced the typo-correction loss in the first 90 seconds.

The fix

text
Experience snapshot:
  It is Wednesday morning. Gal sits down at 08:50 with her coffee
  and opens the LMS to grade the morning's batch of CS101 finals.
  Of the seven submissions in front of her, four have Hebrew names.
  Each one she opens, she reads, copies the name to a spreadsheet
  to verify it, types her grade, types feedback. By 09:14 she has
  graded one. By 10:00 she has graded four. By 10:30 she stops
  for water and the morning is gone.

Prediction:    Gal completes a grading cycle in under 15 minutes
               of focused work.

Baseline:      47 minutes per cycle (mean), n=12, captured
               2026-04-22 by direct observation of three named
               graders. Note: time-on-task dashboard reports 38 min
               for the same sessions — the gap is alt-tab time
               graders spend correcting Hebrew names.

Target:        <15 minutes (mean) of focused-grading time across
               n>=8 observed cycles.

Check date:    2026-06-15.

Check method:  Three observation sessions across three named
               graders, in the field, with a stopwatch. Time-on-task
               event log used as cross-check, NOT as primary signal.

Owner:         Alex (PO).

The diagnosis names three changes. The team would have caught the third one — the typo-correction loss — in the first observation session. It would have been a Discovery finding before code was written.

Where this comes from in the chain

This failure traces to Discovery (Level 2). The brief looks like a Volume II artefact and ticks Volume II's checkboxes, but the underlying observation work was skipped. The cost of that skip was paid downstream — across a full Volume IV execution and into a Volume V signal reading that contradicted the team's customer relationship.

The remedy is Volume II discipline:

  • Observation — witness, do not survey.
  • Person & Moment — name the person, name the moment.
  • Assumption Surfacing — the brief should have listed graders use the LMS time-on-task dashboard time as their actual time as a not witnessed assumption.

See also

200apps · How We Work · NWIRE