Skip to content
Michi v2026.05.20
Save the Tokens

The 5-Layer Strategy

Each layer catches a different class of bug. No single layer is sufficient.

Fast, run after every file change. Catch regressions, validate logic, confirm shapes. Necessary but not sufficient — the agent is testing what it built against what it thinks is correct.

Run: After every file change Catches: Logic errors, regressions, typos

Layer 2: Integration tests (agent-written)

Section titled “Layer 2: Integration tests (agent-written)”

Use real-ish dependencies (e.g., mongodb-memory-server, test database instances). Catch operator conflicts, query bugs, data flow issues that mocked tests miss.

Run: After each milestone Catches: Database operator conflicts, query construction errors, service wiring

In the first experiment, integration tests caught a $set/$setOnInsert operator conflict that mocked tests missed entirely.

The agent runs at least one real API call with real data through the full stack. Not mocked, not dry-run.

Run: After each milestone, before marking complete Catches: Schema mismatches, validation gaps, cross-package inconsistencies

This is what was missing in the first experiment. The dry-run test passed because it skipped the API call. A real call would have hit the Zod validation error immediately.

Each milestone should have at least one smoke test that:

  1. Uses real running services (not mocked, not in-memory)
  2. Submits real data through the actual API endpoint
  3. Verifies the data appears in the data store
  4. Cleans up after itself (or uses test-namespaced data)
  5. Is runnable with a single command
  6. Exits non-zero on failure

Layer 4: Holdout tests (human-written, agent-invisible)

Section titled “Layer 4: Holdout tests (human-written, agent-invisible)”

Behavioral tests the agent doesn’t see during development. Stored separately from the codebase. Run as a post-development verification step.

Run: After the agent declares the milestone complete Catches: Assumptions the agent made that don’t match real-world usage

  1. Test integration boundaries. Each test should cross at least one package/service boundary.
  2. Test new entry points against existing schemas. When the agent adds a new source/caller, verify it against all downstream validation.
  3. Test the “other caller” scenario. If the agent builds for caller A, test from caller B’s perspective.
  4. Store outside the agent’s workspace. Options: separate directory, separate branch, separate repo.
ScenarioWhat it validatesWhy the agent might miss it
Submit new service type via APISchema enum includes new typeAgent updated adapter but not schema
Submit from CLI, check source fieldSource isn’t hardcodedAgent didn’t check shared service
Submit with empty messages arrayEdge case validationAgent only tested happy path
Submit from two services, search by filterMulti-service search worksAgent tested each in isolation

You run the code and use it. Manual testing, exploratory testing, visual QA.

Run: After all automated layers pass Catches: UX issues, visual bugs, workflow gaps, edge cases

In the trace tool project, all unit tests passed across four milestones. Then the human tested with real hardware. Five bugs: an orphaned microphone process, oversized audio files, an SDK version mismatch, duplicate validators, and a path expansion issue. None could have been caught by any automated test.

[ ] Unit tests pass
[ ] Integration tests pass
[ ] At least one smoke test with real API call succeeds
[ ] Result verified in data store
[ ] Holdout tests pass (if they exist)
[ ] No new hardcoded assumptions about callers/sources
[ ] Cross-package schemas updated if new types/enums introduced