Tutorial: Your First Milestone
This is the core Michi loop. Every milestone — whether it’s a feature, a refactor, or a research task — follows the same three-phase cycle: plan → execute → debrief. Here’s what each phase looks like in practice.
Before you start
Section titled “Before you start”- Your project is bootstrapped (CLAUDE.md, PROJECT.md, STATUS.md exist)
- You have a small, well-scoped piece of work — a feature, bug fix, or research task
- You’re in Paired mode (human present, tight loop). The first epic is always Paired.
Phase 1: Plan (/michi-planning)
Section titled “Phase 1: Plan (/michi-planning)”Invoke the planning skill. The agent will:
- Explore the codebase — map existing patterns, trace data flows, understand conventions
- Surface assumptions — explicitly state what it’s assuming, then ask you to confirm or correct
- Ask clarifying questions — about scope, contracts, integration points
- Co-design verification scenarios — stories about users getting benefits, decomposed into Given-When-Then steps
- Write the plan doc — steps, acceptance criteria, scenarios, and empty sections for decisions and notes
What to watch for
Section titled “What to watch for”The assumption surfacing is the most valuable step. In one session, the agent said “add minimum needed” when the human had actually meant something broader. The human caught it: “that’s what I meant, though not what I said. Good catch.” Without the explicit assumption step, that gap between intent and expression would have become a gap in the implementation.
Review the plan before moving on. Does the scope match your intent? Are the scenarios verifiable? Is anything missing?
The output
Section titled “The output”A plan doc at docs/epics/<epic>/plans/<milestone>.md with:
## Steps— numbered implementation steps## Scenarios— co-designed verification scenarios## Decisions— empty (filled during execution)## Notes— empty (filled during execution)## Discussion— empty (filled during execution)
Phase 2: Execute (/michi-session)
Section titled “Phase 2: Execute (/michi-session)”Invoke the session skill. This skill is rigid — the discipline is the point.
The core loop is tight:
Change file → Run tests → See result → IterateThe agent implements against the plan, running the test suite after every file change. A fast test suite (under 5 seconds) makes this natural. A slow suite breaks the rhythm.
What to expect
Section titled “What to expect”- Tests running constantly. The agent should run the full suite after every change, not batch them.
- Decisions logged in real time. When the agent makes a choice — a library, an approach, a tradeoff — it logs it in the plan doc’s
## Decisionssection with what was decided, what alternatives existed, and why. - Scope discipline. If the work is expanding beyond the plan, the agent should log it and check with you rather than silently expanding.
What to watch for
Section titled “What to watch for”Premature “done.” This is the most common failure mode. In the first experiment, the agent declared a milestone complete after unit tests passed — and missed five gaps. The fix: explicit acceptance criteria in the plan doc, checked one by one.
“I believe it passes.” If the agent says something works without showing the output, push back. Evidence before claims. Run the command, read the output, then state the result.
Verification at the end. After all steps are complete, the agent runs the full verification checklist — not just unit tests, but the scenarios from the plan, a scope check (did the implementation stay within the plan?), and any cross-package checks.
A real example
Section titled “A real example”In a project building a CLI tool, the agent followed strict TDD throughout — 43 tests across 4 test files, all passing. A sustainability check mid-session caught duplicated code between formatters. The agent extracted the shared logic and re-ran tests. The discipline produced a clean, working tool in one session.
But tests aren’t everything. In another project, the agent completed four milestones with all unit tests passing. Then the human did Level C testing (actually running the tool with real hardware). Five bugs emerged that unit tests couldn’t have caught: an orphaned microphone process, oversized audio files, an SDK version mismatch, duplicate validators, and a path expansion issue. Every one was invisible to the test suite.
That’s why Michi doesn’t stop at “tests pass.”
Phase 3: Review (/michi-debrief)
Section titled “Phase 3: Review (/michi-debrief)”Invoke the debrief skill after the milestone is committed.
The debrief reviews what happened:
- Delivery assessment — what was planned vs. what was delivered? Acceptance criteria met?
- Decision review — were the logged decisions reasonable? Any to reverse or codify?
- Discussion triage — resolve items the agent flagged, defer what’s not ready, promote project-level questions
- Knowledge capture — learnings go to the journal, patterns to patterns.md, rules to CLAUDE.md
- Trust calibration — did trust increase or decrease? What autonomy level for next session?
What makes the debrief valuable
Section titled “What makes the debrief valuable”The debrief closes the loop. Without it, learnings stay in the conversation (which gets compacted) instead of flowing into durable docs (which persist across sessions).
In one session, a comparison between old and new implementations — something neither human nor agent planned during the session — caught two real bugs: missing tool call summaries and broken attribution. The human’s debrief note: “any port should always test old vs new.” That learning, captured in the journal, now informs future sessions.
After your first milestone
Section titled “After your first milestone”You now have:
- A plan doc with decisions and notes (the record of what happened)
- Committed code with passing tests
- A journal entry capturing what was learned
The next milestone’s planning will be better — you have context from this one. The agent knows the codebase patterns. The scenarios catalog has its first entries. This is the spiral: each iteration makes the next one richer.
If the work felt like it needed less ceremony, try michi-workshop next time — same discipline, lighter weight. If it felt like it needed more, the process scales up naturally with more scenarios, deeper sustainability checks, and explicit verification levels.