The 5-Layer Strategy
Each layer catches a different class of bug. No single layer is sufficient.
Layer 1: Unit tests (agent-written)
Section titled “Layer 1: Unit tests (agent-written)”Fast, run after every file change. Catch regressions, validate logic, confirm shapes. Necessary but not sufficient — the agent is testing what it built against what it thinks is correct.
Run: After every file change Catches: Logic errors, regressions, typos
Layer 2: Integration tests (agent-written)
Section titled “Layer 2: Integration tests (agent-written)”Use real-ish dependencies (e.g., mongodb-memory-server, test database instances). Catch operator conflicts, query bugs, data flow issues that mocked tests miss.
Run: After each milestone Catches: Database operator conflicts, query construction errors, service wiring
In the first experiment, integration tests caught a $set/$setOnInsert operator conflict that mocked tests missed entirely.
Layer 3: Smoke tests (real API calls)
Section titled “Layer 3: Smoke tests (real API calls)”The agent runs at least one real API call with real data through the full stack. Not mocked, not dry-run.
Run: After each milestone, before marking complete Catches: Schema mismatches, validation gaps, cross-package inconsistencies
This is what was missing in the first experiment. The dry-run test passed because it skipped the API call. A real call would have hit the Zod validation error immediately.
Smoke test design
Section titled “Smoke test design”Each milestone should have at least one smoke test that:
- Uses real running services (not mocked, not in-memory)
- Submits real data through the actual API endpoint
- Verifies the data appears in the data store
- Cleans up after itself (or uses test-namespaced data)
- Is runnable with a single command
- Exits non-zero on failure
Layer 4: Holdout tests (human-written, agent-invisible)
Section titled “Layer 4: Holdout tests (human-written, agent-invisible)”Behavioral tests the agent doesn’t see during development. Stored separately from the codebase. Run as a post-development verification step.
Run: After the agent declares the milestone complete Catches: Assumptions the agent made that don’t match real-world usage
Holdout test principles
Section titled “Holdout test principles”- Test integration boundaries. Each test should cross at least one package/service boundary.
- Test new entry points against existing schemas. When the agent adds a new source/caller, verify it against all downstream validation.
- Test the “other caller” scenario. If the agent builds for caller A, test from caller B’s perspective.
- Store outside the agent’s workspace. Options: separate directory, separate branch, separate repo.
Example holdout scenarios
Section titled “Example holdout scenarios”| Scenario | What it validates | Why the agent might miss it |
|---|---|---|
| Submit new service type via API | Schema enum includes new type | Agent updated adapter but not schema |
| Submit from CLI, check source field | Source isn’t hardcoded | Agent didn’t check shared service |
| Submit with empty messages array | Edge case validation | Agent only tested happy path |
| Submit from two services, search by filter | Multi-service search works | Agent tested each in isolation |
Layer 5: Human verification
Section titled “Layer 5: Human verification”You run the code and use it. Manual testing, exploratory testing, visual QA.
Run: After all automated layers pass Catches: UX issues, visual bugs, workflow gaps, edge cases
In the trace tool project, all unit tests passed across four milestones. Then the human tested with real hardware. Five bugs: an orphaned microphone process, oversized audio files, an SDK version mismatch, duplicate validators, and a path expansion issue. None could have been caught by any automated test.
Verification checklist (per milestone)
Section titled “Verification checklist (per milestone)”[ ] Unit tests pass[ ] Integration tests pass[ ] At least one smoke test with real API call succeeds[ ] Result verified in data store[ ] Holdout tests pass (if they exist)[ ] No new hardcoded assumptions about callers/sources[ ] Cross-package schemas updated if new types/enums introduced