Scenario Testing
Michi’s scenario approach is inspired by Cem Kaner’s scenario testing methodology. Scenarios are stories about users getting benefits — not feature checklists. They test the relationships among features and end-to-end benefit delivery, which is where integration-boundary bugs live.
What makes a good scenario
Section titled “What makes a good scenario”Kaner defines four criteria:
- Motivating — the story matters to a real user
- Credible — it could actually happen
- Complex — it exercises multiple features or components together
- Easy to evaluate — pass/fail is unambiguous
“A user syncs their Claude Code chats and finds them searchable” is a scenario. “POST /api/chat returns 200” is a test case. The scenario tests the whole chain; the test case tests one link.
Given-When-Then decomposition
Section titled “Given-When-Then decomposition”Each scenario is decomposed into executable steps:
Scenario: User syncs chats and searches for them Given the API server and data store are running And no previous test data exists When a user submits 3 chats via the API And one chat contains the word "refactoring" in a message Then all 3 chats appear in the data store within 5 seconds When the user searches for "refactoring" Then exactly 1 result is returned And it is the chat containing that wordThe Given-When-Then format bridges from story to executable verification. The steps are concrete enough for the agent to run them autonomously (Level A) or for you to run them manually (Level C).
Scenario lifecycle
Section titled “Scenario lifecycle”Scenarios are living project assets, not one-time artifacts:
- Generate — during planning, co-design with the agent
- Prioritize — which scenarios matter most for this milestone?
- Allocate — task spec (agent sees them) vs. holdout (agent doesn’t)
- Execute — during verification, post-milestone
- Assess — during debrief. Did they catch anything? Were they useful?
- Evolve — error analysis produces new scenarios; changed behavior updates existing ones
The most valuable scenarios come from actual failures. For each bug found, ask: “what scenario would have caught this?” Then write it.
Three usage patterns
Section titled “Three usage patterns”-
Scenarios as task specifications. Included in the plan doc. The agent works toward them. Improves quality but the agent can over-fit to them.
-
Scenarios as holdout verification. The agent never sees them. Run after the agent declares “done.” Catches “teaching to the test.”
-
Scenarios as exploration guides. Used before any work to discover what tests should exist. Feeds into both patterns 1 and 2.
The agent as “disfavored user”
Section titled “The agent as “disfavored user””Kaner’s technique #3 asks: “how do disfavored users want to abuse your system?” Applied to the agent itself — not maliciously, but structurally:
- Circular validation (testing its own assumptions)
- Minimal compliance (passing the letter but not the spirit)
- Silent scope reduction (doing less than asked without flagging it)
- Plausible-looking output (results that look right but aren’t)
Scenarios designed with this lens catch a different class of problem than scenarios designed for end users.
Key Kaner warning
Section titled “Key Kaner warning”“35% of bugs IBM found in the field had been exposed by tests but not recognized as bugs by the testers.”
Agents are even more susceptible to accepting plausible-looking output. This is why “easy to evaluate” is paramount — if pass/fail is ambiguous, the agent will get it wrong.
For more on the methodology, see Cem Kaner’s “An Introduction to Scenario Testing” (2003).