Skip to content
Michi v2026.05.20
Save the Tokens

Verification

Verification is the most important concern in Michi — and an unsolved problem. Not unsolved in the sense that we have no answer, but in the sense that the answer keeps evolving. Here’s where we are.

In the first experiment: 221 tests passed. Zero failures. Two bugs escaped. Both were caught by a human running the code for real, not by any automated test.

Both bugs were at integration boundaries — the seams between packages where one part of the system assumes something about another. The agent’s tests validated each piece in isolation. The failure existed in the space between pieces.

Dry-run is not verification. A dry-run flag that skips the real API call tests everything except the most important thing. The CLI scanner’s dry-run passed perfectly while the real submission would have failed on schema validation.

Autonomous agent verification needs layers the agent doesn’t control:

LayerWhatCatchesWho writes
1. Unit testsFast, after every changeLogic errors, regressionsAgent
2. Integration testsReal-ish deps (e.g., in-memory DB)Operator conflicts, query bugsAgent
3. Smoke testsReal API call, real data storeSchema mismatches, validation gapsHuman (pre-written)
4. Holdout testsAgent-invisible behavioral testsAssumption errors, cross-package gapsHuman (pre-written)
5. Human verificationManual testing, exploratoryUX issues, visual bugs, workflow gapsHuman

Layers 1-2 are the agent’s domain. Layer 5 is yours. Layers 3-4 are the gap. They’re verification that uses real systems and that the agent can’t game. A single real API call per milestone would have caught both escaped bugs immediately.

The full reference covers each layer in detail, including smoke test design and holdout test principles.

This is a progressive relationship: stronger verification enables more autonomy, which enables more ambitious work, which demands stronger verification. You can only give the agent as much rope as your verification can catch.

“All tests pass” is necessary but not sufficient — the agent grades its own homework. Verification needs layers the agent didn’t author.

Verification isn’t a post-implementation check — it’s a planning artifact. During the planning phase, you and the agent co-design verification scenarios together:

  • Each scenario is a story about a user getting a benefit — not a feature checklist. “A user syncs their chats and finds them searchable” rather than “POST /api/chat returns 200.”
  • Scenarios are decomposed into Given-When-Then steps that the agent can execute.
  • Each is categorized by who can evaluate it:
    • Level A: Agent runs autonomously — API calls, data checks, binary pass/fail
    • Level B: Agent runs, but evaluation needs judgment — output quality, error message clarity
    • Level C: Human evaluates — UX feel, visual design, workflow naturalness

The scenarios from the first experiment — where all five bugs were found by Level C testing (a human actually using the tool) — taught us that even when Level A coverage is thorough, Level C catches a different class of problem entirely.

More on the A/B/C framework and scenario testing methodology in the reference section.

Designing for testability isn’t polish — it’s what makes verification viable, which makes autonomy viable.

Concrete example: in a verification experiment on a web app, the agent spent significant time discovering how to target UI elements — seven identical “open menu” buttons, custom dropdowns that aren’t native HTML, items visible only when scrolled. Adding data-testid attributes to interactive elements would have eliminated that friction entirely.

If you can’t write a straightforward test for it, that’s a design signal — not a reason to skip the test.