Troubleshoot and Fix a "Thing"

In short

Something is wrong — a test that used to pass now fails, a service won’t start, a feature behaves strangely. The job is to find the root cause before you touch any code.

The skill is /michi-workshop bugfix <description>. The iteration cycle runs weighted toward Explore: reproduce the problem first (a failing test is the target), then trace from symptom to root cause. Brainstorm is usually quick once you have the root cause — it often points directly at the fix. Plan states the scope explicitly. Execute makes the fix and runs the reproduction test. Verify checks real behavior, not just the test. Document closes with a journal entry and the regression test, which is what proves the bug stays fixed.

The key discipline: diagnose before you fix. A test that reliably reproduces the problem is the end of Explore — and the driver of the fix.

A sample session

This is from a session diagnosing a todo app that kept reporting the same set of items as needing a sync refresh, no matter how many times you ran it. The symptom pointed at state tracking; the root cause turned out to be a unit mismatch buried in the API service.

You: /michi-explore peek

Agent: Here’s where things stand. The project is a todo app — a frontend client and an API service with a small database layer. One epic closed recently. Branch main, clean working tree, no epic mid-flight.

That’s the orientation. What do you want to work on?

You: I want to follow up on the todo app — the sync keeps reporting the same items as needing a refresh, no matter how many times I run it. Always the same 47 items.

Agent: This is a real state-tracking bug worth investigating properly. The symptom — the “needs sync” count never decreasing — points at the timestamp bookkeeping on the API side. Let me read the actual code before forming a hypothesis.

[The agent reads the route that writes synced_at, the query that compares it against updated_at, and the helper that produces the updated_at value — tracing the unit each field is written in.]

Found it. It’s the API service, not the client — a unit mismatch.

The query compares updated_at against synced_at. Those two fields are written in different units: updated_at is epoch milliseconds (~1.7e12), synced_at is written in epoch seconds (~1.7e9). So after every sync, updated_at > synced_at is always true — every item re-qualifies, every time. 47 forever.

The client is working correctly. It asks the API what needs refresh and faithfully gets 47 back each time.

The fix is one line in the sync route — multiply by 1000 when writing synced_at so it stores milliseconds to match. The 47 existing entries self-heal: each item, once fetched after the fix, gets the correct value and drops out.

This is small and self-contained — a good /michi-workshop fix. Want me to take it that way?

You: /michi-workshop — let’s fix it.

Agent: Confirmed the diagnosis — and found something worth flagging: there’s an existing test that locks in the bug. It was written to assert epoch seconds, so it’s been passing against the wrong behavior. And the integration test never caught the mismatch because its fixtures used small consistent numbers for both fields — same scale, so the seconds-vs-milliseconds drift was invisible.

Plan:

Fix the sync route — write synced_at in milliseconds, correct the misleading comment.
Rewrite the bug-locking test to assert milliseconds.
Add a cross-layer regression test with realistic millisecond timestamps, so the unit drift can’t hide again.
Run the full suite.
Capture in the journal.

Verification: the route test fails before the fix, passes after. Full suite green. Good to proceed?

You: Just deploy?

Agent: No — deploy alone won’t fix it. The bug is in the source, so redeploying current code changes nothing. Order is: fix the source → verify → then deploy. Deploy is the last step, and it’s yours to run. Shall I do steps 1–4 first?

You: Sorry, I thought you already fixed it. Please proceed.

[The agent applies the one-line fix, rewrites the unit-asserting test, adds the cross-layer regression test with realistic ms-scale timestamps, then temporarily reverts the fix to confirm the new test genuinely fails against the unfixed code — then restores the fix and runs the full suite.]

Agent: All green — 84 unit + 22 integration. Bug-fix gate honored: the route test failed against the unfixed code (1748563200 seconds vs. expected 1748563200000 ms), passes with the fix.

Journal and bug ledger captured.

What was fixed: The sync route wrote synced_at in epoch seconds while the items collection’s updated_at is epoch milliseconds. The “needs sync” query was therefore always true — every item re-qualified on every fetch. The data-model spec had always said milliseconds; the route had drifted from it.

Your turn:

Commit the fix.
Deploy — the live API service keeps the bug until this runs.

After deploy, the 47 items self-heal over normal sync passes. No data migration needed.

The full guide

Something is wrong. A test fails that used to pass. A service won’t start. A build succeeds locally but fails in CI. A feature behaves strangely and you’re not sure why.

This guide is for that situation: something is broken, and you need to diagnose the root cause and fix it.

The fix itself is usually the small part. Most of the work is diagnosis — finding out exactly what broke and why before you touch any code. Getting that order right is what separates a clean fix from an afternoon of flailing.

The iteration cycle for a fix

Michi’s iteration cycle runs every kind of work: Explore → Brainstorm → Plan → Execute → Verify → Document. For a bug fix, the cycle is weighted toward Explore. Diagnosis is most of the job.

The skill that carries this is /michi-workshop in bug-fix mode: /michi-workshop bugfix <description>.

Explore — diagnose

Reproduce the problem first. A bug you can’t reproduce reliably is a bug you can’t verify you’ve fixed. Build the reproduction case before anything else — ideally as a failing test.

Then trace to the root cause, not the symptom. A 404 from an API endpoint is a symptom. The root cause might be a missing route, a misconfigured build step that strips the route in production, or a proxy rule that never got updated. Don’t fix the symptom; find what broke.

Useful questions to drive Explore:

When did this start? (What changed?)
Is it consistent or intermittent?
Does it happen in all environments or just one?
What’s the smallest input that reproduces it?

If the failing test doesn’t exist yet, write it now. The reproduction case is the test — it defines “broken” precisely. This is the end of Explore: you can reliably reproduce the problem and you understand why it’s happening.

Brainstorm — settle the fix

Once you have the root cause, figure out how to address it. Usually this is quick — the root cause often points directly to the fix. The key question: does your proposed fix address the root cause, or just mask the symptom?

If the root cause reveals a design problem rather than a bug, escalate. A bug fix should scope to this bug. Design work belongs in /michi-planning, not a bug-fix session.

Plan — agree before you code

State what you’re going to change and confirm it before touching any code. For a simple fix this is brief — a few sentences. The goal is to make the scope explicit so you don’t drift.

Execute — fix and test

Make the fix. Then run the reproduction test: it should now pass. Run the full suite: nothing else should have broken.

This is TDD applied to bugs — the reproduction case you wrote in Explore is the test that drives the fix.

Verify — real verification

“My test passes” is not done. Verify against the actual behavior that was broken:

Does the thing that was broken now work?
Does the full suite pass?
Is there anything the tests can’t catch — a deploy, a live service, an environment dependency — that needs a manual check?

False-pass verification (a test that reports success without exercising the real behavior) is a specific failure mode here. Make sure your test actually exercises the broken path.

Document — close the loop

A journal entry: what broke, the root cause, what the fix was, and that a regression test was added. For trivial fixes this is one or two sentences. The regression test is the most important artifact — it’s what proves the bug stays fixed across future changes.

Sub-cases

A code bug

The clearest case. A function returns the wrong value, an edge case isn’t handled, a state machine reaches an invalid state.

Build the reproduction test first. When it’s reliable, trace to the root cause in the code. Fix, verify the test passes, run the suite.

A broken build or deploy

Works locally, fails in CI or production. The usual suspects: environment differences (Node version, env vars, file paths), steps that only run in the pipeline (minification, tree-shaking, asset compilation), or configuration that’s different between environments.

Explore means comparing environments — what’s different between where it works and where it doesn’t? The answer is usually in that diff. Check the pipeline logs carefully; the actual error is often buried a few lines before the failure line.

A broken environment or service

A service won’t start, won’t connect, or stopped surviving reboots. Environment drift is the common cause — something changed outside the code (a system package, a port conflict, a credential that expired, a path that moved).

Explore means isolating what changed. Check logs first — startup errors and connection failures usually say exactly what’s wrong. If logs don’t help, narrow the environment: does the problem reproduce on a fresh machine? In a container? With a specific dependency pinned?

Confusing behavior

Something behaves unexpectedly and you’re not sure if it’s a bug or a misunderstanding of intended behavior. This sub-case starts with a question rather than a reproduction case: is this actually broken?

Explore here means understanding the intended behavior first — spec, docs, tests that define the expected output — before deciding whether the actual behavior diverges from it. Sometimes the answer is “the behavior is correct, the understanding was wrong.” That’s still a fix, just to documentation or mental model rather than code.

Pitfalls

Chasing symptoms. You find the proximate cause — the line that throws — and fix it, only to have the same problem recur from a different path. The root cause is upstream. Keep asking “why” until the answer is the actual problem, not just the nearest one.

Flailing. You try one fix, it doesn’t work. You try another. After the third failed attempt, stop. Reframe. Usually: you don’t have a reliable reproduction case yet, which means the Explore phase isn’t done. Build a solid repro first, then fix.

Skipping the regression test. You found the bug, you fixed it, it works now. Without a test, the same bug can return in a future change and you won’t know until it surfaces again. The test is not optional — it’s what closes the loop.

False-pass verification. A test that passes without actually exercising the broken behavior. Check that your reproduction test genuinely fails against the unfixed code and genuinely passes against the fixed code.

Scope creep in a fix. “While I’m in here, I should also clean up…” Bug-fix mode means this bug. Related improvements belong in a separate, explicit piece of work.

Escalating too late. If the reproduction and root cause analysis reveal that the fix requires changing a contract, redesigning a component, or touching multiple systems in non-obvious ways — that’s a design problem, not a bug fix. Escalate to /michi-planning before going further.

A worked example

A web API endpoint starts returning 404 in production. It works in local development.

Explore: The error is consistent in production, absent locally. Check the build logs — the deployed bundle doesn’t include the /api/users route handler. Trace: the build uses tree-shaking, and the route handler file isn’t imported anywhere directly; it’s loaded dynamically via a glob. The glob pattern doesn’t match the new file structure introduced in last week’s refactor.

Reproduction case: a test that runs the production build configuration and asserts the route responds — fails against the current code.

Brainstorm: Fix the glob pattern in the build config to match the new file structure. Simple, addresses the root cause.

Plan: Update the build config glob. Confirm the route handler is included in the output bundle. No other changes.

Execute: Update the glob. The reproduction test passes. Full suite green.

Verify: Deploy to a staging environment. Hit the endpoint. 200. The test that was previously failing now passes against the build configuration.

Document: Journal entry: production 404 traced to build config glob not matching refactored file structure. Fixed glob, added build-integration test to catch this class of problem.

When a fix reveals a larger design problem, see Run an Epic or Project and /michi-planning. The underlying cycle is described in full at The Iteration Cycle.