The Why and How I Built Michi

The Question

Here’s the thing about AI coding agents: they’re powerful and unreliable at scale, in the same breath.

For years, the workflow was close paired — human drives, agent assists. Cursor, then Codex, then Claude Code. Each generation shifted the balance a little. You could trust the agent with bigger pieces. But “trust” still meant “I’m watching.”

In late 2025, something shifted. Claude got substantially better — not one thing, but everything at once. Model quality, skills, plugins, subagents. The friction of AI help flipped to clear leverage. Custom skills for common patterns. A sync tool to keep configurations consistent across machines. Daily use, across projects, always in paired mode.

Then in March 2026, StrongDM published their approach to running autonomous AI agents — software “dark factories” that run without human supervision. Not a platform org with dedicated tooling teams; a regular engineering team. They’d done it. That made it feel possible for a solo practitioner.

The question wasn’t “can AI agents write code?” They can. The question was: what would need to be true to trust one to work without you watching?

That question — run through a structured exercise of “what would the ideal look like, and what would need to be true to get there?” — produced a set of sub-questions about isolation, verification, session management, and progressive autonomy. Not a spec. A set of conditions that would need to hold.

The decision: run an experiment. A real project, a real codebase, real features. See what happens.

The Experiment

The target was Junkdrawer — a personal bookmark indexer: Node.js API, React web app, Chrome extension, backed by Elasticsearch and MongoDB. The task: merge an AI chat sync feature from a separate project into the monorepo. Nine planned milestones, from data model to migration. A dedicated Mac Studio. The agent gets a spec, a codebase, and a plan.

Three sessions. Planning (~1 hour), environment setup (~2 hours), then the long one: milestones 1 through 9, over 8+ hours.

What Worked

The codebase had clear patterns — route conventions, test structure, repository layout. The agent followed plugins-web.js to build plugins-chat.js, matched raw-content-repository.js for chat-raw-repository.js, replicated test fixtures. A well-patterned codebase turned out to be the single best accelerator for autonomous agent quality. The agent amplifies whatever conventions exist — good or bad.

A 400ms unit test suite was fast enough to run after every change. Plans written into markdown files survived context compaction better than conversation memory. Across three sessions and nine milestones, the human intervened about 15 times total.

The First Crack

During milestone 1, the agent declared the work complete after unit tests passed. The human pushed back: “This is an MVP for using a dark software factory. Please continue till M1 is complete, with verification and testing.”

The agent found 5 gaps it had missed: MongoDB indexes that were documented as needed but never created. Integration tests for key scenarios. An auth test. Inaccurate verification docs. An outdated file list in the plan. The agent’s own response: “You’re right — those should have been done before I said M1 was complete.”

The agent’s “done” was narrower than the human’s “done.” Tests pass ≠ milestone complete.

The Escalation

By milestone 5, confidence had built. The human said:

“I want to try something. Plan the remaining milestones with the model we’ve been using. Consider verification, refining the plan. Then execute all of them, M6-M9. Work autonomously. We will then review together. If you are blocked, make a best effort judgment call, document it, and we will review tomorrow.”

Clear scope. Clear judgment protocol. Clear review point. And quality dropped.

Context compaction degraded tool reliability. Early milestones: the agent reads a file once, edits confidently — 2 tool calls per modification. By milestone 7: the agent re-reads files it already read, falls back to shell commands for edits — 4+ tool calls per modification. The tight feedback loop that made milestones 1-5 clean was fraying.

The Founding Insight

The final count: 9 milestones. ~80 files. 221 tests. 0 test failures. And 2 bugs found by manual testing after the agent declared everything complete.

Both bugs were integration-boundary failures. One: a SERVICE_NAMES enum that didn’t include the new service type — the agent’s tests validated within the package, but downstream validation schemas rejected the unknown value. Two: a source field hardcoded to an existing caller’s identity — the agent’s package worked perfectly; the shared pipeline assumed only one caller existed.

Neither was a coding error. Both were assumptions nobody challenged. The agent wrote tests that validated its own understanding against its own implementation. Of course they passed.

The agent grades its own homework.

This became the founding insight. Not “agents write bugs” — all developers do. The insight is that the verification is circular. The agent tests what it knows about. Integration boundaries, schema mismatches, cross-package assumptions — these live outside what the agent tested, because they live outside what the agent knows to question.

The Response

The week after R1 was intense. Not a pause-and-redesign — the toolkit emerged through the same cycle it would eventually teach: explore the problem, brainstorm approaches, plan a response, execute, verify, document. The toolkit started being built with itself.

Verification as Governing Constraint

A five-layer verification strategy emerged:

Layer	What	Who writes it
Unit tests	Fast, after every change	The agent
Integration tests	Real-ish dependencies	The agent
Smoke tests	Real API calls, real data stores	The human
Holdout tests	Agent-invisible behavioral checks	The human
Human verification	Manual testing, exploratory	The human

Layer 3 was missing in R1. That’s where both bugs lived. The smoke tests would have caught the schema mismatch. The holdout tests would have caught the hardcoded identity.

The principle: you can only give the agent as much autonomy as your verification can catch. Better verification → more autonomy → more ambitious work → stronger verification needed. Progressive relationship, not a static threshold.

The Discipline Question

R1 showed that quality drops when discipline relaxes. The autonomous stretch (milestones 6-9) produced both escaped bugs. Not because autonomy is bad — because verification wasn’t enforced.

The response was deliberate rigidity. An “Iron Law”: no claiming done without running full verification. Not “I believe it passes” — run the command, read the output, then state the result. Red flags to watch for (“this decision is too minor to log” — log it). Common rationalizations the agent might use to skip steps (“verification is redundant, tests passed” — unit tests are one layer, scenarios catch what tests miss). These weren’t theoretical. They were patterns observed in R1.

But rigidity for everything would be its own problem. A bug fix doesn’t need nine milestones and a verification plan. This tension — between discipline and ceremony — produced two tools. The full session skill: rigid, every step matters, because the rigidity is the point. And the workshop: same core questions, no paperwork.

The workshop’s design crystallized in a single exchange. After working through what “lightweight Michi” should keep versus drop, the agent said: “That’s the pattern for the whole skill, isn’t it? Workshop encodes the questions Michi asks, not the artifacts Michi produces.”

One word back: “Yes.”

The discipline stays. The ceremony scales to the work.

The Cold-Start Problem

Long sessions — 2000+ turns over hours — turned out to preserve something that documents can’t: the spirit of the collaboration. By turn 500, the agent has absorbed design philosophy well enough to anticipate preferences. It follows the rules with understanding, not just compliance.

But switch computers or start a fresh session, and that understanding evaporates. Rules survive in configuration files. Working context doesn’t. The experience of starting fresh:

“Like the movie Memento. The tattoos are the permanent rules. The polaroids are the working context — perishable. The polaroids are what’s missing.”

Three things get lost at cold start:

Collaboration calibration. Configuration files read as rules. A long session follows them with understanding. A new session follows them mechanically.
Active mental model. Status documents capture the state of the work. They don’t capture the state of thinking about the work.
Established trust. After 500 turns, there’s a model of when to just act versus when to check. A fresh session is either too cautious or too presumptuous. Neither is right.

This is mitigated now — a memory file that captures “what a new session would be pained to lose.” Not project state (that’s in status docs), not code patterns (those are in the code), but collaboration patterns, corrections, confirmed approaches, non-obvious context. It helps. It doesn’t fully solve the problem.

The Meta-Loop

Something worth naming: the toolkit was being built by applying itself. Skills designed through structured exploration. Principles refined through the same assumption-surfacing they teach. A writing pass that tightened 22 files — cut filler, strengthened leads, varied sentence length. Not a redesign; a refinement of what was already emerging.

Each iteration made the next one better. Not circles — spirals. The process learned from its own application.

Real Tests

A toolkit that only works on the project that created it isn’t really a toolkit. It’s a rationalization. The real test: does it hold up on different projects, different contexts, different types of work?

The First Application

A CLI tool project — four milestones, 56 tests. The first spec referenced an npm package that didn’t exist. The agent dutifully tried to implement against it until the human intervened: “The spec is wrong. Consider it notes.”

This produced a principle: first-epic specs are hypotheses, not contracts. The accumulated Michi context that makes later specs accurate doesn’t exist yet. Validate dependencies. Surface assumptions about external APIs. When the spec is wrong on facts, correct and log the deviation.

Pace calibration. The agent came in fast — too fast. The human pulled it back: “Only take the single smallest next step.” Like a new pair programming partner, the first 30 minutes establish working style. Speed is earned through demonstrated understanding, not assumed.

By the fourth milestone, the agent had absorbed the design philosophy enough to propose dropping planned work before being told. Long sessions preserve alignment; the codebase accumulates examples of applied judgment. The spiral working.

Verification Vindicated

That same CLI tool project. The first epic (audio narration) passed all unit tests across four milestones. Then human testing — what the toolkit calls Level C — found five bugs. Orphaned processes. Oversized files. SDK version mismatches. Things the test suite structurally can’t catch because they only manifest in real execution, not mocked environments.

The second epic was planned after that debrief. The agent wrote in its plan: “From the first epic: unit tests with mocks caught zero of five bugs. For this epic, we can do real integration tests.” The second epic had real integration tests from the start, caught a permissions bug during the initial spike, and had zero post-verification bugs.

Not because the agent got smarter. Because the process learned what to test.

Beyond Code

A personal dev toolkit needed a research-informed roadmap. Not code — analysis, prioritization, recommendations. The agent started scoping coding milestones. The human redirected: “We need to remap your milestones. I’m not looking for code in this epic.”

The planning skill assumed code deliverables. The session skill’s verification checklist assumed tests. When the deliverable is a research document, “all tests pass” is meaningless. The toolkit’s code-centrism was an unstated assumption — surfaced by trying to use it for something else.

The response wasn’t a separate non-code toolkit. It was a calibration: the same cycle (explore → synthesize → checkpoint), the same discipline (surface assumptions, verify before claiming done, log decisions), different verification mechanics (exit criteria instead of test suites, discovery questions as legitimate output rather than blockers).

More projects followed. A personal knowledge base — exploration and analysis, not implementation. Infrastructure work. A project integrating with Notion. Each tested the toolkit in a different context. Each revealed adaptations. None broke the core cycle.

What’s Still Open

The core cycle holds: explore, plan, execute, verify, document. It works across code and non-code, across projects, across the toolkit itself. Verification as governing constraint is the load-bearing insight. Paired mode — human and agent thinking together — produces better work than either alone.

But honest accounting:

The cold-start problem is mitigated, not solved. Memory files help with project context. Collaboration calibration — the understanding behind the rules — still degrades at session boundaries.

The autonomy question is open. R1 showed that quality drops when the human steps back. Later projects showed that quality holds when the human stays engaged. Is truly autonomous operation possible, or is “paired with decreasing involvement” the actual ceiling? The verification infrastructure is better now. Whether it’s good enough to run a dark software factory with the lights off — that’s an experiment that hasn’t been run yet.

Scale is untested. This works for a solo practitioner on projects of moderate complexity. Teams, large codebases, multi-agent coordination — open questions, not claims.

The question from the beginning: what would need to be true to trust an autonomous agent? The answer is clearer now. You need verification the agent didn’t author. You need discipline that survives the agent’s tendency to rationalize. You need a human who knows when to slow down. And you need the honesty to say what you don’t know yet.

Michi is that honesty, embodied in a toolkit. It’s pre-1.0. It works. And the things it doesn’t know yet are documented as carefully as the things it does.