The 5-Layer Strategy

Each layer catches a different class of bug. No single layer is sufficient.

Layer 1: Unit tests (agent-written)

Fast, run after every file change. Catch regressions, validate logic, confirm shapes. Necessary but not sufficient — the agent is testing what it built against what it thinks is correct.

Run: After every file change Catches: Logic errors, regressions, typos

Layer 2: Integration tests (agent-written)

Use real-ish dependencies (e.g., mongodb-memory-server, test database instances). Catch operator conflicts, query bugs, data flow issues that mocked tests miss.

Run: After each milestone Catches: Database operator conflicts, query construction errors, service wiring

In the first experiment, integration tests caught a $set/$setOnInsert operator conflict that mocked tests missed entirely.

Layer 3: Smoke tests (real API calls)

The agent runs at least one real API call with real data through the full stack. Not mocked, not dry-run.

Run: After each milestone, before marking complete Catches: Schema mismatches, validation gaps, cross-package inconsistencies

This is what was missing in the first experiment. The dry-run test passed because it skipped the API call. A real call would have hit the Zod validation error immediately.

Smoke test design

Each milestone should have at least one smoke test that:

Uses real running services (not mocked, not in-memory)
Submits real data through the actual API endpoint
Verifies the data appears in the data store
Cleans up after itself (or uses test-namespaced data)
Is runnable with a single command
Exits non-zero on failure

Layer 4: Holdout tests (human-written, agent-invisible)

Behavioral tests the agent doesn’t see during development. Stored separately from the codebase. Run as a post-development verification step.

Run: After the agent declares the milestone complete Catches: Assumptions the agent made that don’t match real-world usage

Holdout test principles

Test integration boundaries. Each test should cross at least one package/service boundary.
Test new entry points against existing schemas. When the agent adds a new source/caller, verify it against all downstream validation.
Test the “other caller” scenario. If the agent builds for caller A, test from caller B’s perspective.
Store outside the agent’s workspace. Options: separate directory, separate branch, separate repo.

Example holdout scenarios

Scenario	What it validates	Why the agent might miss it
Submit new service type via API	Schema enum includes new type	Agent updated adapter but not schema
Submit from CLI, check source field	Source isn’t hardcoded	Agent didn’t check shared service
Submit with empty messages array	Edge case validation	Agent only tested happy path
Submit from two services, search by filter	Multi-service search works	Agent tested each in isolation

Layer 5: Human verification

You run the code and use it. Manual testing, exploratory testing, visual QA.

Run: After all automated layers pass Catches: UX issues, visual bugs, workflow gaps, edge cases

In the trace tool project, all unit tests passed across four milestones. Then the human tested with real hardware. Five bugs: an orphaned microphone process, oversized audio files, an SDK version mismatch, duplicate validators, and a path expansion issue. None could have been caught by any automated test.

Verification checklist (per milestone)

[ ] Unit tests pass
[ ] Integration tests pass
[ ] At least one smoke test with real API call succeeds
[ ] Result verified in data store
[ ] Holdout tests pass (if they exist)
[ ] No new hardcoded assumptions about callers/sources
[ ] Cross-package schemas updated if new types/enums introduced