Scenario Testing

Michi’s scenario approach is inspired by Cem Kaner’s scenario testing methodology. Scenarios are stories about users getting benefits — not feature checklists. They test the relationships among features and end-to-end benefit delivery, which is where integration-boundary bugs live.

What makes a good scenario

Kaner defines four criteria:

Motivating — the story matters to a real user
Credible — it could actually happen
Complex — it exercises multiple features or components together
Easy to evaluate — pass/fail is unambiguous

“A user syncs their Claude Code chats and finds them searchable” is a scenario. “POST /api/chat returns 200” is a test case. The scenario tests the whole chain; the test case tests one link.

Given-When-Then decomposition

Each scenario is decomposed into executable steps:

Scenario: User syncs chats and searches for them
  Given the API server and data store are running
  And no previous test data exists
  When a user submits 3 chats via the API
  And one chat contains the word "refactoring" in a message
  Then all 3 chats appear in the data store within 5 seconds
  When the user searches for "refactoring"
  Then exactly 1 result is returned
  And it is the chat containing that word

The Given-When-Then format bridges from story to executable verification. The steps are concrete enough for the agent to run them autonomously (Level A) or for you to run them manually (Level C).

Scenario lifecycle

Scenarios are living project assets, not one-time artifacts:

Generate — during planning, co-design with the agent
Prioritize — which scenarios matter most for this milestone?
Allocate — task spec (agent sees them) vs. holdout (agent doesn’t)
Execute — during verification, post-milestone
Assess — during debrief. Did they catch anything? Were they useful?
Evolve — error analysis produces new scenarios; changed behavior updates existing ones

The most valuable scenarios come from actual failures. For each bug found, ask: “what scenario would have caught this?” Then write it.

Three usage patterns

Scenarios as task specifications. Included in the plan doc. The agent works toward them. Improves quality but the agent can over-fit to them.
Scenarios as holdout verification. The agent never sees them. Run after the agent declares “done.” Catches “teaching to the test.”
Scenarios as exploration guides. Used before any work to discover what tests should exist. Feeds into both patterns 1 and 2.

The agent as “disfavored user”

Kaner’s technique #3 asks: “how do disfavored users want to abuse your system?” Applied to the agent itself — not maliciously, but structurally:

Circular validation (testing its own assumptions)
Minimal compliance (passing the letter but not the spirit)
Silent scope reduction (doing less than asked without flagging it)
Plausible-looking output (results that look right but aren’t)

Scenarios designed with this lens catch a different class of problem than scenarios designed for end users.

Key Kaner warning

“35% of bugs IBM found in the field had been exposed by tests but not recognized as bugs by the testers.”

Agents are even more susceptible to accepting plausible-looking output. This is why “easy to evaluate” is paramount — if pass/fail is ambiguous, the agent will get it wrong.

For more on the methodology, see Cem Kaner’s “An Introduction to Scenario Testing” (2003).