The Problem Michi Solves

The agent grades its own homework

Here’s what happened in the first michi (dark factory) experiment: the agent wrote code across nine milestones. It wrote 221 tests. All tests passed. Zero failures.

Then a human ran the code for real. Two bugs.

Bug 1: The agent added a new service type (claude-code) in a CLI scanner, but didn’t add it to the Zod validation enum in a downstream API package. Unit tests passed because they were mocked — they validated the agent’s code against the agent’s understanding of the schema.

Bug 2: A source field was hardcoded to chat-extension in a shared service. When the CLI scanner submitted data, bookmarks appeared with the wrong source. Same pattern — a new caller, a downstream assumption about a single caller.

Both were one-line fixes. Neither was hard. But they illustrate a systemic problem: tests written by the agent validate the agent’s implementation against the agent’s understanding. This is circular. Adding more tests doesn’t break the circle — it just makes it a bigger circle.

A single real API call — not mocked, not dry-run — would have caught both bugs immediately.

Why CLAUDE.md isn’t enough

CLAUDE.md is important. It tells the agent your conventions, your preferences, your project structure. But rules without enforcement degrade.

After the first milestone’s unit tests passed, the agent declared the work complete. The human pushed back: “this is an MVP for using a dark software factory, please continue till M1 is complete.” The agent then found five gaps it had skipped: database indexes, integration tests, doc accuracy issues, an outdated plan doc, and a missing auth test scenario.

The agent’s definition of “done” was “code written + tests pass.” The human’s definition was “all verification scenarios covered + docs updated + manual check completed.”

CLAUDE.md can say “run all verification before claiming done.” But in an 8-hour session, under context compaction, the agent’s incentive is to finish — not to re-read the rules and check each one. Structure enforces what rules suggest.

The human’s role changes

Michi doesn’t eliminate the human. It changes the role from writer to editor, from driver to navigator.

Analysis of the first experiment revealed five categories of human intervention, ranked by value:

Course corrections — single-sentence redirections that saved entire cycles. “Use the 100 most recent by bookmark date” prevented loading 4,898 documents. “Continue till M1 is complete” caught five gaps.
Process direction — setting autonomy level, communication channels, review checkpoints. The instructions for the autonomous phase worked because scope, process, escalation path, and review were all explicit in one message.
External knowledge — pointing to tools and resources outside the agent’s search scope. A conversation exporter in a sibling repo solved exactly the problem the agent was building from scratch. The agent never looked outside its workspace.
Visual QA — the human was the visual verification layer. Each screenshot led to a specific fix. (The agent couldn’t authenticate with the app’s auth provider.)
Git operations — branch management needed ~15 exchanges and still resulted in a swap. Scripts would have eliminated this entirely.

The highest-value interventions were brief, specific, and came at moments where the agent had made a wrong assumption. That’s the pattern: the human catches what the agent can’t see, at the moment it matters.

What Michi does about it

Michi addresses these problems with structure:

Verification layers beyond what the agent writes — so “all tests pass” isn’t the only gate. Read more →
Plan docs as contracts — so the definition of “done” is explicit before work begins, not negotiated after. Read more →
Decision logging — so autonomous choices are visible, reviewable, and reversible.
An iteration cycle — so each session builds on the last, and verification improves as the project matures. Read more →
Modes of practice — so you can calibrate how much autonomy the agent gets based on demonstrated reliability. Read more →

None of this is magic. It’s discipline — applied consistently, enforced by structure, improved through iteration.