A/B/C Automability Levels

Not all verification can be automated. Categorizing scenarios by who evaluates them helps you plan what the agent can do alone, what needs judgment, and what needs a human.

The three levels

Level	Evaluator	What it covers	Example
A	Agent runs autonomously	Concrete assertions with binary pass/fail	API returns 200, data lands in the store, search results match expected count
B	Agent runs, judgment evaluates	Output quality, clarity, reasonableness	Error message is helpful, output format is sensible, generated content is coherent
C	Human evaluates	UX feel, visual design, workflow naturalness	The UI feels right, the workflow makes sense, the tool is pleasant to use

Level A: deterministic verification

The agent executes the scenario and checks the result. Pass or fail, no ambiguity.

Level A scenarios become mandatory post-milestone verification steps. They run as part of the michi-session verification checklist. If they fail, the milestone isn’t done.

What works well at Level A:

API endpoint returns expected status codes and payloads
Data appears in the correct store with the correct shape
Search/query returns expected results in expected order
CLI tool produces expected output for given input
Build succeeds, tests pass, linting clean

What doesn’t work at Level A:

“The error message is helpful” (who decides “helpful”?)
“The output looks correct” (what does “looks” mean?)
“The workflow is intuitive” (says who?)

If you find yourself writing a Level A scenario with vague criteria, it probably belongs at Level B or C.

Level B: judgment-assisted verification

The agent runs the scenario, but evaluating the result requires judgment — either an LLM judge or heuristic assessment.

Level B is an active research area. Approaches include:

LLM-as-judge: A separate model evaluates the output against criteria. Known biases: position, verbosity, self-enhancement. Recent research (Agent-as-a-Judge, ICML 2025) shows that equipping the judge with tool use reduces disagreement with human evaluation from 31% to 0.3%.
Heuristic checks: Output length, format compliance, presence of expected sections. Not judgment per se, but structured enough to catch gross problems.

Level B scenarios are noted in the plan doc for debrief review. They’re not mandatory gates — they’re signals.

Level C: human evaluation

You evaluate the result. There’s no substitute for a human using the software and assessing whether it actually works as intended.

Level C catches problems no other level can:

UX friction that’s technically correct but feels wrong
Visual layout issues
Workflow gaps where the tool works but the sequence is unnatural
Integration with external systems that can’t be simulated

In the trace tool project, Level C testing found five bugs after all Level A tests passed. Every one was the kind of thing only a human using real hardware would notice.

Promoting scenarios between levels

Over time, scenarios should move from C → B → A by making their evaluation criteria more specific:

“The search results are relevant” (Level C) → “The top 3 results contain the search term in their title” (Level A)
“The error message is helpful” (Level B) → “The error message includes the field name and the constraint that was violated” (Level A)

Each promotion makes verification more automatable and more reliable. But some things — “does this feel right?” — stay at Level C. That’s fine. Know which is which.

Project-type variation

What’s automatable varies by project type:

Project type	Strong at Level A	Hard to automate
REST API + DB	API calls, data store queries, schema validation	Auth flows, async processing
Web app	API layer, some UI with headless browser	Visual design, UX feel
CLI tool	Command execution, output checking	Interactive prompts, env-specific behavior
Browser extension	Build succeeds, unit tests	Real browser interaction, content injection
Distributed system	Individual service tests	Timing, eventual consistency, multi-service coordination

The verification approach must be project-aware, not a universal template.