A/B/C Automability Levels
Not all verification can be automated. Categorizing scenarios by who evaluates them helps you plan what the agent can do alone, what needs judgment, and what needs a human.
The three levels
Section titled “The three levels”| Level | Evaluator | What it covers | Example |
|---|---|---|---|
| A | Agent runs autonomously | Concrete assertions with binary pass/fail | API returns 200, data lands in the store, search results match expected count |
| B | Agent runs, judgment evaluates | Output quality, clarity, reasonableness | Error message is helpful, output format is sensible, generated content is coherent |
| C | Human evaluates | UX feel, visual design, workflow naturalness | The UI feels right, the workflow makes sense, the tool is pleasant to use |
Level A: deterministic verification
Section titled “Level A: deterministic verification”The agent executes the scenario and checks the result. Pass or fail, no ambiguity.
Level A scenarios become mandatory post-milestone verification steps. They run as part of the michi-session verification checklist. If they fail, the milestone isn’t done.
What works well at Level A:
- API endpoint returns expected status codes and payloads
- Data appears in the correct store with the correct shape
- Search/query returns expected results in expected order
- CLI tool produces expected output for given input
- Build succeeds, tests pass, linting clean
What doesn’t work at Level A:
- “The error message is helpful” (who decides “helpful”?)
- “The output looks correct” (what does “looks” mean?)
- “The workflow is intuitive” (says who?)
If you find yourself writing a Level A scenario with vague criteria, it probably belongs at Level B or C.
Level B: judgment-assisted verification
Section titled “Level B: judgment-assisted verification”The agent runs the scenario, but evaluating the result requires judgment — either an LLM judge or heuristic assessment.
Level B is an active research area. Approaches include:
- LLM-as-judge: A separate model evaluates the output against criteria. Known biases: position, verbosity, self-enhancement. Recent research (Agent-as-a-Judge, ICML 2025) shows that equipping the judge with tool use reduces disagreement with human evaluation from 31% to 0.3%.
- Heuristic checks: Output length, format compliance, presence of expected sections. Not judgment per se, but structured enough to catch gross problems.
Level B scenarios are noted in the plan doc for debrief review. They’re not mandatory gates — they’re signals.
Level C: human evaluation
Section titled “Level C: human evaluation”You evaluate the result. There’s no substitute for a human using the software and assessing whether it actually works as intended.
Level C catches problems no other level can:
- UX friction that’s technically correct but feels wrong
- Visual layout issues
- Workflow gaps where the tool works but the sequence is unnatural
- Integration with external systems that can’t be simulated
In the trace tool project, Level C testing found five bugs after all Level A tests passed. Every one was the kind of thing only a human using real hardware would notice.
Promoting scenarios between levels
Section titled “Promoting scenarios between levels”Over time, scenarios should move from C → B → A by making their evaluation criteria more specific:
- “The search results are relevant” (Level C) → “The top 3 results contain the search term in their title” (Level A)
- “The error message is helpful” (Level B) → “The error message includes the field name and the constraint that was violated” (Level A)
Each promotion makes verification more automatable and more reliable. But some things — “does this feel right?” — stay at Level C. That’s fine. Know which is which.
Project-type variation
Section titled “Project-type variation”What’s automatable varies by project type:
| Project type | Strong at Level A | Hard to automate |
|---|---|---|
| REST API + DB | API calls, data store queries, schema validation | Auth flows, async processing |
| Web app | API layer, some UI with headless browser | Visual design, UX feel |
| CLI tool | Command execution, output checking | Interactive prompts, env-specific behavior |
| Browser extension | Build succeeds, unit tests | Real browser interaction, content injection |
| Distributed system | Individual service tests | Timing, eventual consistency, multi-service coordination |
The verification approach must be project-aware, not a universal template.