Skip to content
Michi v2026.05.20
Save the Tokens

A/B/C Automability Levels

Not all verification can be automated. Categorizing scenarios by who evaluates them helps you plan what the agent can do alone, what needs judgment, and what needs a human.

LevelEvaluatorWhat it coversExample
AAgent runs autonomouslyConcrete assertions with binary pass/failAPI returns 200, data lands in the store, search results match expected count
BAgent runs, judgment evaluatesOutput quality, clarity, reasonablenessError message is helpful, output format is sensible, generated content is coherent
CHuman evaluatesUX feel, visual design, workflow naturalnessThe UI feels right, the workflow makes sense, the tool is pleasant to use

The agent executes the scenario and checks the result. Pass or fail, no ambiguity.

Level A scenarios become mandatory post-milestone verification steps. They run as part of the michi-session verification checklist. If they fail, the milestone isn’t done.

What works well at Level A:

  • API endpoint returns expected status codes and payloads
  • Data appears in the correct store with the correct shape
  • Search/query returns expected results in expected order
  • CLI tool produces expected output for given input
  • Build succeeds, tests pass, linting clean

What doesn’t work at Level A:

  • “The error message is helpful” (who decides “helpful”?)
  • “The output looks correct” (what does “looks” mean?)
  • “The workflow is intuitive” (says who?)

If you find yourself writing a Level A scenario with vague criteria, it probably belongs at Level B or C.

The agent runs the scenario, but evaluating the result requires judgment — either an LLM judge or heuristic assessment.

Level B is an active research area. Approaches include:

  • LLM-as-judge: A separate model evaluates the output against criteria. Known biases: position, verbosity, self-enhancement. Recent research (Agent-as-a-Judge, ICML 2025) shows that equipping the judge with tool use reduces disagreement with human evaluation from 31% to 0.3%.
  • Heuristic checks: Output length, format compliance, presence of expected sections. Not judgment per se, but structured enough to catch gross problems.

Level B scenarios are noted in the plan doc for debrief review. They’re not mandatory gates — they’re signals.

You evaluate the result. There’s no substitute for a human using the software and assessing whether it actually works as intended.

Level C catches problems no other level can:

  • UX friction that’s technically correct but feels wrong
  • Visual layout issues
  • Workflow gaps where the tool works but the sequence is unnatural
  • Integration with external systems that can’t be simulated

In the trace tool project, Level C testing found five bugs after all Level A tests passed. Every one was the kind of thing only a human using real hardware would notice.

Over time, scenarios should move from C → B → A by making their evaluation criteria more specific:

  • “The search results are relevant” (Level C) → “The top 3 results contain the search term in their title” (Level A)
  • “The error message is helpful” (Level B) → “The error message includes the field name and the constraint that was violated” (Level A)

Each promotion makes verification more automatable and more reliable. But some things — “does this feel right?” — stay at Level C. That’s fine. Know which is which.

What’s automatable varies by project type:

Project typeStrong at Level AHard to automate
REST API + DBAPI calls, data store queries, schema validationAuth flows, async processing
Web appAPI layer, some UI with headless browserVisual design, UX feel
CLI toolCommand execution, output checkingInteractive prompts, env-specific behavior
Browser extensionBuild succeeds, unit testsReal browser interaction, content injection
Distributed systemIndividual service testsTiming, eventual consistency, multi-service coordination

The verification approach must be project-aware, not a universal template.