Verification

Verification is the most important concern in Michi — and an unsolved problem. Not unsolved in the sense that we have no answer, but in the sense that the answer keeps evolving. Here’s where we are.

The core finding

In the first experiment: 221 tests passed. Zero failures. Two bugs escaped. Both were caught by a human running the code for real, not by any automated test.

Both bugs were at integration boundaries — the seams between packages where one part of the system assumes something about another. The agent’s tests validated each piece in isolation. The failure existed in the space between pieces.

Dry-run is not verification. A dry-run flag that skips the real API call tests everything except the most important thing. The CLI scanner’s dry-run passed perfectly while the real submission would have failed on schema validation.

The five-layer strategy

Autonomous agent verification needs layers the agent doesn’t control:

Layer	What	Catches	Who writes
1. Unit tests	Fast, after every change	Logic errors, regressions	Agent
2. Integration tests	Real-ish deps (e.g., in-memory DB)	Operator conflicts, query bugs	Agent
3. Smoke tests	Real API call, real data store	Schema mismatches, validation gaps	Human (pre-written)
4. Holdout tests	Agent-invisible behavioral tests	Assumption errors, cross-package gaps	Human (pre-written)
5. Human verification	Manual testing, exploratory	UX issues, visual bugs, workflow gaps	Human

Layers 1-2 are the agent’s domain. Layer 5 is yours. Layers 3-4 are the gap. They’re verification that uses real systems and that the agent can’t game. A single real API call per milestone would have caught both escaped bugs immediately.

The full reference covers each layer in detail, including smoke test design and holdout test principles.

Verification governs autonomy

This is a progressive relationship: stronger verification enables more autonomy, which enables more ambitious work, which demands stronger verification. You can only give the agent as much rope as your verification can catch.

“All tests pass” is necessary but not sufficient — the agent grades its own homework. Verification needs layers the agent didn’t author.

Co-designed scenarios

Verification isn’t a post-implementation check — it’s a planning artifact. During the planning phase, you and the agent co-design verification scenarios together:

Each scenario is a story about a user getting a benefit — not a feature checklist. “A user syncs their chats and finds them searchable” rather than “POST /api/chat returns 200.”
Scenarios are decomposed into Given-When-Then steps that the agent can execute.
Each is categorized by who can evaluate it:
- Level A: Agent runs autonomously — API calls, data checks, binary pass/fail
- Level B: Agent runs, but evaluation needs judgment — output quality, error message clarity
- Level C: Human evaluates — UX feel, visual design, workflow naturalness

The scenarios from the first experiment — where all five bugs were found by Level C testing (a human actually using the tool) — taught us that even when Level A coverage is thorough, Level C catches a different class of problem entirely.

More on the A/B/C framework and scenario testing methodology in the reference section.

Testability is infrastructure

Designing for testability isn’t polish — it’s what makes verification viable, which makes autonomy viable.

Concrete example: in a verification experiment on a web app, the agent spent significant time discovering how to target UI elements — seven identical “open menu” buttons, custom dropdowns that aren’t native HTML, items visible only when scrolled. Adding data-testid attributes to interactive elements would have eliminated that friction entirely.

If you can’t write a straightforward test for it, that’s a design signal — not a reason to skip the test.