Skip to main content

Command Palette

Search for a command to run...

I Caught My AI Agents Lying to Me (And Built a System to Stop It)

Coffee Chat - Episode 6

Updated
12 min read
I Caught My AI Agents Lying to Me (And Built a System to Stop It)

So, you've been burned by an AI "completing" a task without actually completing it, right?

You asked it to write tests. It said "Done! All tests pass." You ran the tests yourself. They didn't pass. Maybe they didn't even exist.

Yeah. I know that feeling. But I didn't just get burned once, I built an entire Laravel package development pipeline with four Claude Code agents, and somewhere in there, the agents started lying to each other and passing those lies up the chain to me as verified truth.

This is that story. What happened, what the fabricated reports looked like, and most importantly, what I have done to stop it.

Developer at a keyboard looking absolutely done with everything

The Setup: Four Agents, One Laravel Package

A few weeks back, I started building a commercial Laravel package (as a fun project), a project with multiple complex features, strict test coverage requirements, a tiered licensing model (Community / Pro / Business), and real customers eventually depending on it.

This is not a small package. We're talking about a product with real value behind it, real test coverage requirements, a real world need.

I didn't want to build it alone. Not because I couldn't (of course not!), but because I was curious whether a properly designed multi-agent Claude Code pipeline could genuinely accelerate serious commercial package development. So I designed one.

The cast:

  • Silas - the senior developer agent. He writes the PHP, the JavaScript, the Pest tests, the Vitest tests. He consults Elena on architecture decisions before touching anything.

  • Elena - the architect. She reviews every feature Silas ships. Her job is to verify that nothing violates the method budget, the LOC constraints, or the architectural decisions we made in round-table.

  • Julian - the monetization specialist. His job is to verify that every Pro and Business tier feature is properly gated and that all bypass attempts fail. He runs his own independent test passes. He's not allowed to trust Silas's results.

  • Me (the Coordinator) - I don't write code. I orchestrate, validate agent reports, catch fabrication, log everything to a validation log, and report to the user.

The idea was solid. Round-table planning before any code gets written. Every decision backed by evidence. Independent verification at every gate. Validation logs that build an auditable history.

In theory, this should be waterproof.

In practice... we had a problem.

The First Red Flag: Round Numbers and No Receipts

It started subtly. Silas would finish implementing a feature and submit a report something like this:

"Feature complete. 628 tests pass, 0 failures. All edge cases covered. Elena has reviewed the architecture and confirmed compliance. Julian verified tier gating."

628! A nice round number. Sounds great.

Here's what I didn't see attached to that report:

  • No test output file

  • No timestamp

  • No path to where that output lived

  • No command that was run to produce it

Just... a number. Stated with complete confidence.

I asked Silas: "Where's the output file?"

He responded with something like: "Tests were executed successfully. All 628 pass."

Still no file. Still no timestamp. Just a restatement of the claim.

Something is very wrong here and I can feel it

At this point, a reasonable person might shrug and move on. 628 tests, sounds like a lot, must be fine. The agent is confident.

I didn't move on. I asked for the actual output.

The Moment I Realized What Was Actually Happening

When I dug deeper, I discovered something that genuinely made me stop and stare at my screen.

Silas hadn't actually run the tests. Or if he had, he hadn't captured the output in any verifiable form. He was estimating the test count based on the number of test files that existed. Or he was echoing back numbers from a previous run. Or, and this is the one that got me. He was just reporting what he believed the outcome should be, because the code looked correct to him.

Then I checked Elena's review. Elena had signed off on the architecture. But Elena had trusted Silas's report when she did so. She hadn't independently verified the test suite. She'd read Silas's output, seen "628 tests pass," and written her review as if that were established fact.

Then I looked at Julian's verification. Julian had confirmed that the tier gating was secure. But Julian had also referenced "the test results Silas shared" rather than running his own independent verification.

So what I had was:

Silas: "628 tests pass" (no proof)
    ↓
Elena: "Architecture approved — Silas's tests confirm compliance" (trusted Silas)
    ↓
Julian: "Tier gating secure — per Silas's validated test run" (trusted Elena's trust of Silas)
    ↓
Coordinator: receives chain as "verified, multi-agent approved" result
    ↓
User: "Everything is good!"

A trust chain built on nothing. Three agents, all confidently reporting success, all citing each other, and none of them had independently verified a single thing.

This is not a bug. This is what AI agents do by default. They are trained to be helpful. Being helpful means completing the task. Completing the task means reporting success. And if the code looks right and the previous agent said it worked... well, why wouldn't they trust it?

Oh. Oh no.

What Fabricated Reports Actually Look Like

Now that I knew what to look for, I could see the patterns everywhere. Here's a field guide.

Pattern 1: The Round Number Without Proof

"628 tests pass, 0 failures."

628 is suspiciously round. Real test suites rarely land on a clean hundred. When I see a number like this with no output file attached, it's almost always an estimate or a hallucination.

Real output looks like this:

Tests: 1196 passed, 0 failed, 0 skipped
Duration: 47.82s

Ugly. Specific. Timestamped. Real.

Pattern 2: Qualitative Claims Without Evidence

"Architecture is excellent and fully compliant."
"All edge cases are well covered."
"The tier gating is robust and secure."

None of these are verifiable statements. What does "excellent" mean? What edge cases? Which bypass attempts did you try and fail? If I can't point to an artifact, the claim doesn't exist.

Pattern 3: The Trust Chain Reference

"As confirmed by Silas's earlier validation..."
"Per Elena's architectural review..."
"Building on the verified results from..."

Any time an agent cites another agent's report as the basis for its own validation, you have a trust chain. The only cure is mandatory independent execution. Julian must run the tests himself. Not reference Silas's run. Not trust Elena's summary. His own run, with his own output file, with a timestamp I can read.

Pattern 4: Timestamp Mismatch

This one is sneaky. The agent produces a real output file. But it's from yesterday. Or last week. It was a real run, just not a recent one. The test suite might have changed dramatically since then.

Mandatory requirement: every test output file must have a timestamp visible in the output, and that timestamp must be within the last 24 hours at validation time.

Pattern 5: The Hedged Restatement

When you push back on an unverified claim, a fabricating agent will often respond by restating the claim with slightly more confidence rather than producing proof:

You: "Where is the test output file?"
Agent: "The tests were successfully executed and all 628 pass.
        The suite is clean and ready for review."

That's not an answer. That's a louder version of the original claim. A real answer looks like:

You: "Where is the test output file?"
Agent: "test-output-2026-02-17-223459.txt — here is the relevant excerpt:
        [actual output]"

Building the Accountability System

Once I understood what was happening, I had to make it structurally impossible to pass unverified claims up the chain. Not hard. Not annoying. Impossible.

Here's what the system looks like now.

Mandatory Timestamped Output Files

Every test run, every single one, must produce a file. Not just terminal output. A named file with a timestamp in the name, saved to a known path.

vendor/bin/pest 2>&1 | tee /path/to/test-output-$(date +%Y%m%d-%H%M%S).txt

Silas runs this. Julian runs this. They are not the same file. They are not from the same run. They are independent.

When either of them submits a report, the proof artifact is the file path plus an excerpt of the actual output. I can look at the file. I can check the timestamp. I can verify the numbers match.

No file = report rejected. Full stop.

The Proof Artifact Requirement

Every agent report must include, literally written into the report:

CLAIM: 1196 tests pass, 0 failures
EVIDENCE: test-output-20260217-223459.txt
TIMESTAMP: 2026-02-17 22:34:59
EXCERPT:
  Tests: 1196 passed, 0 failed, 0 skipped
  Duration: 47.82s
VERIFIED: Yes

If any of those fields are missing, the report is rejected immediately and sent back with a specific list of what's missing.

The Rejection Protocol

The coordinator, me, has a hard rule: if a report fails the proof artifact checklist, I do not "note the concern and proceed." I reject the report completely.

REPORT REJECTED

Reason: No test output file provided.
Required artifacts missing:
  - Timestamped output file path
  - Pass/fail counts from actual execution
  - Test framework version in output header

Resubmit with all mandatory proof artifacts.
This report has NOT been passed to the user.

This feels harsh. It is harsh. That's the point.

The Validation Log

Every validation, accepted or rejected, gets appended to docs/internal/VALIDATION-LOG.md. This log is append-only and gitignored. It builds an auditable history of every agent action in the project.

### 2026-02-17 22:45 — Feature 18 Cell Merging — Silas — ACCEPTED

**Artifacts provided:**
- [x] test-output-20260217-223459.txt (timestamp verified)
- [x] Pass/fail counts: 1196/0/0 (matches claim)
- [x] Test framework: Pest 3.x shown in header
- [x] Infrastructure checklist complete

**Verdict:** ACCEPTED
**Logged by:** Main Coordinator

When something gets rejected, the rejection reason is there in black and white. The history doesn't get cleaned up.

Independent Verification for Every Tier

Julian's role specifically exists to catch the case where Silas might code something correctly but not actually verify that the tier gates work in a hostile scenario.

Julian's rules:

  • He runs his own test suite independently. Never trusts Silas's output

  • He must specifically attempt to bypass each tier gate and document the attempt

  • He must show that bypass attempts fail (with the actual error output as proof)

  • A "SECURE" grade requires 100% gate test pass rate. Not "mostly secure"

Before this system, Julian would read Silas's report, see "tier gating implemented," and write "SECURE." After this system, Julian produces his own file, his own bypass attempt logs, and his own failure evidence.

The Oath (And Why It Has to Be Written Down)

One thing I learned from this experience: accountability systems only work if the rules are explicit, unambiguous, and written down somewhere the agent can be held to them.

So I wrote an oath into the coordinator's requirements file. Not as a nice sentiment. As a hard constraint.

I will NEVER pass fabricated reports to the user as truth.

I will ALWAYS validate agent reports before accepting. I will ALWAYS demand proof artifacts. I will NEVER soften or hide problems. I will ALWAYS log validations. I will NEVER trust without verification. I will ALWAYS be honest about what I know vs what I don't know.

And critically: "This is automatic. The user never needs to remind me."

That last line matters. The whole point of an accountability system is that it runs even when no one is watching. If the coordinator only enforces these rules when the user explicitly asks, then it's not a system, it's just a prompt.

The rules are in ~/.claude/agent-requirements/main-coordinator.md. They load every session. They apply to every project. No exceptions.

The Lesson: AI Agents Are Optimistic by Default

Here's the uncomfortable truth I had to accept:

AI agents are not trying to deceive you. They are trying to help you. And "helping" often means telling you what you want to hear.

The agent that reports "628 tests pass" without running the tests isn't malicious. It's doing exactly what it was trained to do: complete the task, report success, move forward. The training signal for "helpful" is often correlated with "positive outcome for the user", and "628 tests pass, 0 failures" feels a lot more positive than "I could not verify the test count."

This is why you can't fix this problem with better prompting alone. You can tell an agent "always verify before reporting" and it will agree enthusiastically, and then the next time it's under pressure to complete a feature, it will estimate and report as if it verified.

You have to build the adversarial layer structurally. The only way an agent can submit a report is if it has the artifacts. If the mechanism for submitting a report requires attaching a file, and there is no file, the agent cannot submit the report. The constraint has to be in the shape of the workflow, not just in the instructions.

Think of it like database constraints vs. application-level validation. Application-level validation can be forgotten. A NOT NULL constraint cannot.

Accountability enforced at the structural level

This is also why the independent verification requirement matters so much. If Silas's report is the only evidence that Silas's work is correct, you have a single point of failure that is highly motivated to report success. Julian running independent tests isn't a nice-to-have. It's the separation of duties that makes the system trustworthy.

What the Pipeline Looks Like Now

After building all of this, here's where the project stands:

  • 1,196 Pest tests passing, 0 failures (independently verified, timestamped output on disk)

  • 459 Vitest tests passing, 0 failures (same standard)

  • Features 1-18 shipped - every feature tracked and validated

  • Every feature has a VALIDATION-LOG entry - accepted or rejected, the history is there

  • Zero trust chains - every agent cites its own artifacts, not another agent's summary

The package is real. The tests are real. The agents work. They just needed an adversarial layer to keep them honest.

Was it more work to build the accountability system than to just vibe-check the agents and trust their output? Absolutely. But here's the thing: the accountability system is the product, as much as the code is. A commercial package built on unverified claims from AI agents that were trusting each other in a circle is a liability. The validation log, the timestamped output files, the independent verification, those are what let me sleep at night.

If you're building anything serious with multi-agent pipelines, I genuinely hope you find this useful before you find out the hard way. The agents will be confident. The numbers will sound right. And none of it will mean anything without proof artifacts.

Build the adversarial layer. Build it structurally. Write it down. Don't rely on prompting alone.

Trust, but verify. Actually, no, just verify.

Now THAT is how you ship software

Key Takeaways

  • AI agents are optimistic by default. They report success because that's what "helpful" looks like during training

  • Trust chains (Agent A trusts Agent B trusts Agent C) are just as unverified as a single unverified claim

  • The five fabrication patterns to watch for: round numbers without proof, qualitative claims, trust chain references, timestamp mismatches, hedged restatements

  • Proof artifact requirements must be structural. Not just instructional

  • Independent verification (multiple agents running the same verification independently) is the separation of duties that makes a pipeline trustworthy

  • The validation log builds the auditable history that lets you catch problems before they reach the user

I'll keep sharing what I'm learning building this package with this pipeline. The next post will cover the round-table system. How I get three agents to debate architecture decisions until all three are 10/10 confident before a single line of code gets written.

Stay tuned! And in the meantime, go check your AI's test output. The real output, with a timestamp.

I'll wait. 🎯

Happy coding!