Skip to content
Michi v2026.05.20
Save the Tokens

Scenario Testing

Michi’s scenario approach is inspired by Cem Kaner’s scenario testing methodology. Scenarios are stories about users getting benefits — not feature checklists. They test the relationships among features and end-to-end benefit delivery, which is where integration-boundary bugs live.

Kaner defines four criteria:

  1. Motivating — the story matters to a real user
  2. Credible — it could actually happen
  3. Complex — it exercises multiple features or components together
  4. Easy to evaluate — pass/fail is unambiguous

“A user syncs their Claude Code chats and finds them searchable” is a scenario. “POST /api/chat returns 200” is a test case. The scenario tests the whole chain; the test case tests one link.

Each scenario is decomposed into executable steps:

Scenario: User syncs chats and searches for them
Given the API server and data store are running
And no previous test data exists
When a user submits 3 chats via the API
And one chat contains the word "refactoring" in a message
Then all 3 chats appear in the data store within 5 seconds
When the user searches for "refactoring"
Then exactly 1 result is returned
And it is the chat containing that word

The Given-When-Then format bridges from story to executable verification. The steps are concrete enough for the agent to run them autonomously (Level A) or for you to run them manually (Level C).

Scenarios are living project assets, not one-time artifacts:

  1. Generate — during planning, co-design with the agent
  2. Prioritize — which scenarios matter most for this milestone?
  3. Allocate — task spec (agent sees them) vs. holdout (agent doesn’t)
  4. Execute — during verification, post-milestone
  5. Assess — during debrief. Did they catch anything? Were they useful?
  6. Evolve — error analysis produces new scenarios; changed behavior updates existing ones

The most valuable scenarios come from actual failures. For each bug found, ask: “what scenario would have caught this?” Then write it.

  1. Scenarios as task specifications. Included in the plan doc. The agent works toward them. Improves quality but the agent can over-fit to them.

  2. Scenarios as holdout verification. The agent never sees them. Run after the agent declares “done.” Catches “teaching to the test.”

  3. Scenarios as exploration guides. Used before any work to discover what tests should exist. Feeds into both patterns 1 and 2.

Kaner’s technique #3 asks: “how do disfavored users want to abuse your system?” Applied to the agent itself — not maliciously, but structurally:

  • Circular validation (testing its own assumptions)
  • Minimal compliance (passing the letter but not the spirit)
  • Silent scope reduction (doing less than asked without flagging it)
  • Plausible-looking output (results that look right but aren’t)

Scenarios designed with this lens catch a different class of problem than scenarios designed for end users.

“35% of bugs IBM found in the field had been exposed by tests but not recognized as bugs by the testers.”

Agents are even more susceptible to accepting plausible-looking output. This is why “easy to evaluate” is paramount — if pass/fail is ambiguous, the agent will get it wrong.

For more on the methodology, see Cem Kaner’s “An Introduction to Scenario Testing” (2003).