
I've built 39 agent blueprints and shipped most of them somewhere. The toolkit page on this site makes it look like a clean track record. It isn't. For every agent that worked well enough to earn a blueprint, there's one or two that either didn't ship, broke in production, or shipped and got quietly turned off after a month.
This post is about those. Ten specific agents, each with a short autopsy. What I tried. What broke. What it cost. What I'd do differently.
I'm writing this because I think the AI-native product conversation is too bullish, and I contribute to that when I only publish the wins. The more honest conversation makes the next round of bets better.
Here are the ten.
The short version
Ten AI agents that failed in production, with the cost of each. The auto-approve expense agent that approved $2,400 in dinners-with-customers before finance turned it off. The customer sentiment classifier that scored 89% on held-out data and 61% in production within six weeks because the training distribution missed non-English customers. The autonomous pricing tester that locked in a 14% under-anchored price for two quarters. The internal docs RAG that confidently surfaced 2021 wiki pages alongside 2024 ones. Six more, each with the same structure: what I tried, what broke, what it cost, what I would do differently. The common thread across all ten: agents got authority before the eval system was load-bearing.
For the agents that did work, see Your AI Agent Fleet (39 blueprints). For the eval infrastructure that catches failures like these before they ship, see The Eval-First Product Org. For the live sandbox to feel agent cadence end-to-end, see the Agent Sandbox (PM Version).
1. The auto-approve expense agent
What I built: An agent that read expense reports submitted in Slack, cross-referenced them against policy, and auto-approved anything under $500 with matching receipts. Idea: save finance 10 hours a week.
What broke: It worked great for the first two weeks. Then one of my engineers figured out that submitting "dinner with customer" with a vague receipt passed every time. Approvals went up 40%. Actual spend followed. Finance turned it off within the month.
What it cost: About $2,400 in approved expenses that should not have been approved. Three hours of awkward conversation with the CFO. A meaningful trust cost that took a quarter to rebuild.
What I learned: Any agent that touches money needs a human review layer you cannot remove, no matter how confident the model is. Confidence and correctness are not the same thing, and the gap is exactly where an adversarial user lives. I now run a rule: if the downside of being wrong is financial, human-in-the-loop is not optional.
2. The customer sentiment classifier
What I built: Agent that ingested support tickets, classified sentiment, and routed angry customers to senior support reps faster. The first version scored 89% accuracy on a held-out set.
What broke: In production, the accuracy dropped to 61% within six weeks. The model was drifting, but the real issue was subtler. Customers writing in languages other than English were systematically scored as "angry" regardless of content, because the training set was English-heavy. We were routing our non-English customers to senior reps for tickets that didn't need them, which meant English-speaking angry customers were waiting.
What it cost: Response time for English-speaking critical tickets went up 18% for two months before I caught it.
What I learned: Evals have to cover the full production distribution, not the distribution you trained on. I now run language-segment evals every week. I also don't trust an 89% accuracy number without knowing the per-segment breakdown.
3. The PR-drafting agent
What I built: An agent that took a shipped feature, read the changelog and product copy, and drafted a press release. Goal: speed up the external comms cycle.
What broke: The drafts were fine. That was the problem. They were fine in a way that made the head of communications furious. Every press release sounded the same. Competitors started referencing "the generic AI-drafted tone" of our announcements. We turned it off after three months.
What it cost: Some brand dilution. A tense conversation with the comms team about whether I had thought through the downside.
What I learned: Some categories of output are valuable because they are distinctive. Agents produce median-quality output very efficiently, which is bad when your brand depends on above-median. I now ask a single question before building any content agent: if this content has the same shape and tone every time, does it lose value? If yes, the agent is a bad idea.
4. The autonomous pricing tester
What I built: An agent that ran small pricing experiments on new signups, reading conversion data and adjusting prices within a band. Idea: find the optimal price point faster than manual A/B testing.
What broke: The agent found a local maximum that wasn't the global one, and it optimized toward it with compounding effect. Within five weeks, we'd anchored on a price that was 14% below what our benchmark analysis said was correct. Recovering the anchor took two quarters of careful re-pricing.
What it cost: About $180,000 in annual recurring revenue from the cohort that signed up during the experiment window, and sustained discounting pressure on later cohorts.
What I learned: Autonomous optimization in a domain with long feedback loops and anchoring effects is dangerous. I now require a human approval gate at every experiment iteration for anything revenue-touching. The time saved by autonomy is not worth the cost of compounded bad decisions.
5. The internal docs agent
What I built: A retrieval agent over our internal documentation. Employees could ask questions, get answers with citations. Classic RAG setup.
What broke: The agent was answering confidently from outdated documents. We had pages on the internal wiki from 2021 and 2024, and the agent gave both equal weight. Employees got wrong answers about current policies, procedures, and product specs. Two HR incidents traced back to this.
What it cost: A lot of small trust leakage, plus legal review time on the two HR incidents. No major blowup, but consistent slow damage.
What I learned: Source quality matters more than model quality. I should have spent 80% of my implementation time on curation and versioning, not on the retrieval architecture. I now treat every RAG system as primarily a data hygiene project.
6. The competitor monitoring agent
What I built: An agent that read competitor blog posts, release notes, and social media, and summarized weekly intel for the strategy team.
What broke: The agent hallucinated product features that competitors had not launched. It was blending claims from marketing pages with wish-list content from user forums and presenting them as confirmed shipped features. We acted on one of these briefly, adjusting our roadmap, before someone noticed the competitor did not actually have what we thought they had.
What it cost: One week of wasted roadmap discussion. A loss of confidence in the tool that meant the strategy team went back to manual research.
What I learned: Summarization of potentially contradictory sources requires structured output and source attribution per claim, not prose summaries. I now build intel agents to produce a table of claim, source, confidence, and direct quote, never a narrative summary.
7. The onboarding walkthrough agent
What I built: An agent that watched new users interact with the product for their first session, noticed where they got stuck, and proactively offered contextual help.
What broke: Users hated it. The agent was technically correct. It identified real stuck points. But users perceived it as surveillance. The activation rate on users who received agent interventions was lower than the control group. We had built a feature that made the product feel worse for new users.
What it cost: One quarter of decreased activation, plus the engineering investment to build and then remove the feature.
What I learned: Technical correctness is not sufficient. Agent interventions have a social and emotional dimension, and users have strong preferences about when an agent is helpful versus intrusive. I now prototype interventions with a small cohort and measure perception, not just behavior, before scaling.
8. The code review agent
What I built: An agent that reviewed pull requests and posted suggestions. Ambitious version: auto-approve trivial changes. Conservative version: suggest only, humans still approve.
What broke: The ambitious version approved changes that broke staging twice. The conservative version created so much noise that engineers started ignoring it, including on the one PR where it caught a real bug. Neither version landed.
What it cost: Two incidents in staging that ate about a day each to diagnose. Engineer trust in automated review went down, not up, after the experiment.
What I learned: Auto-approve in critical path systems is almost never worth the risk. Suggest-mode is worth it only if the signal-to-noise ratio is very high, which means the agent has to be conservative about when it speaks. I now default all code-review agents to "speak rarely, with high confidence," not "speak often, with suggestions."
9. The meeting summary agent
What I built: An agent that joined recorded meetings, transcribed them, extracted decisions and action items, and posted to a shared channel.
What broke: Participants stopped talking about sensitive topics in meetings. Everyone knew the summary would land somewhere, and that "somewhere" was impossible to fully control. Difficult conversations moved to 1:1s, to hallways, to unrecorded chats. The meetings got less useful as a result.
What it cost: A hard-to-measure but real degradation in the quality of certain recurring meetings. Two senior leaders started declining to have the agent in their sessions, which created an awkward two-tier system.
What I learned: The design question for meeting agents is not "can we transcribe and summarize accurately" but "what behavior does the agent's presence cause." Observability changes behavior. I now think of any meeting agent as a broadcasting tool first and a summarization tool second.
10. The multi-agent orchestration experiment
What I built: An agent system with four specialized agents: a researcher, a writer, an editor, and a publisher. Goal: end-to-end content pipeline with minimal human input.
What broke: Everything that could break did. The agents disagreed with each other. The editor kept sending work back to the researcher for reasons the researcher couldn't address. The writer optimized for passing the editor's checks rather than for quality. The publisher shipped outputs that none of the other three would have approved alone. Latency was 8x single-agent. Cost was 12x. Quality was lower than a single well-prompted agent.
What it cost: A month of my own time and an embarrassing internal demo.
What I learned: Multi-agent systems are worth it only when the sub-tasks genuinely benefit from different contexts or different tools. When the same model with the same context could do each sub-task, adding agents adds cost and complexity without adding capability. I now require a specific justification for every added agent in an orchestrated system, and the default is: don't add one.
The patterns across the failures
If you step back from the ten specific autopsies, a few patterns show up.
- Five of the ten failed because of distribution shift. Training data, user populations, or source documents didn't match production. Evals against the real distribution would have caught most of these early.
- Three of the ten failed because the agent's correctness was beside the point. The agent did what it was asked to do. The asking was wrong. Users, brand, or culture cared about something the agent wasn't measuring.
- Two of the ten failed because of cost-compounding autonomy. Giving an agent authority to act, without human gates, created downside risk that was much bigger than the upside it captured.
Which means the pre-flight checklist before any agent ship is roughly:
- Am I evaluating on the real production distribution, not the training one?
- What is the social, emotional, or brand cost of this agent being technically correct but culturally wrong?
- Where are the human gates, and what would compound if they weren't there?
Three questions. If I'd asked them before every build, I'd have skipped six of the ten failures above.
Why I'm publishing this
Two reasons.
The first is selfish. The best way to keep my own judgment sharp is to write the failures down where they can be referenced by future-me, before I make the same mistake a second time.
The second is that the AI-native product conversation has a problem. The incentive structure rewards publishing wins. Every case study is a win. Every conference talk is a win. The implicit message is that agents work, and if yours don't, you must be doing something wrong.
That's not true. Agents fail constantly. They fail in the ways above, and in ways I haven't listed, and in ways that haven't been invented yet. The teams that ship successfully are not the ones whose agents never fail. They're the ones who've built a fast, honest feedback loop for catching failures early.
If you run product and you have your own list of ten failed agents, I'd love to read it. LinkedIn is the best place to send me yours.
Want the pre-flight checklist in a usable form? I've published it as a one-page template on the toolkit at falkster.com/toolkit.
Further reading
- Simon Willison's field notes on where LLM features actually break
- Anthropic's engineering blog on agent architectures and failure modes
- OpenAI research on agentic systems
- Hamel Husain's deep dive on why most agent projects fail in production
- Claude Code docs on building agents that hand work back to humans
Also on Medium
Full archive →AI Agents and the Future of Work: A Pixar-Inspired Journey
What product managers can learn about AI agents from how Pixar runs a film team.
Many AI Agents Are Actually Workflows or Automations in Disguise
How to tell agents from workflows from cron jobs, and why it matters for what you ship.
Frequently asked
Why do most AI agents fail in production?+
Three patterns dominate: missing human-in-the-loop on financial decisions (the auto-approve expense agent that approved $2,400 in dinners-with-customers), training distribution mismatched to production (the sentiment classifier that scored 89% on held-out data and 61% in production six weeks later), and autonomous optimization in domains with long feedback loops (the pricing tester that locked in a 14% under-anchored price). The common thread: agents got authority before the eval system was load-bearing.
When should AI agents have human-in-the-loop and when not?+
If the downside of being wrong is financial, regulated, or anchoring with compounding effects, human-in-the-loop is not optional. If the agent is producing prose that is reviewed before it leaves the building, you can skip the gate. The rule that worked: anything that touches money, contracts, customers individually, or decisions that compound over time gets a human review layer that cannot be removed by configuration.
Why did the customer sentiment classifier drift from 89% to 61% accuracy?+
The held-out set was English-heavy. Customers writing in other languages were systematically scored as 'angry' regardless of content. The eval did not cover the production distribution. The fix: language-segment evals every week, and never trust an aggregate accuracy number without the per-segment breakdown.
What went wrong with the autonomous pricing tester?+
The agent found a local maximum that was not the global one and optimized toward it with compounding effect. Within five weeks the price had drifted 14% below the benchmark. Recovery took two quarters of careful re-pricing. The lesson: autonomous optimization in long-feedback-loop domains needs a human approval gate at every iteration. Time saved by autonomy is not worth compounded bad decisions.
Why did the internal docs RAG agent lose user trust?+
Source quality, not model quality. The wiki had 2021 and 2024 pages on the same topic and the agent gave both equal weight, surfacing outdated answers with confidence. Two HR incidents traced back to it. The lesson: every RAG system is primarily a data hygiene project. Spend 80% of the implementation on curation and versioning.
What's the 'distinctive output' rule for content agents?+
Some categories of output are valuable because they are distinctive. Agents produce median-quality output efficiently, which is bad when the brand depends on above-median. The PR-drafting agent produced fine drafts that all sounded the same and competitors started referencing 'the AI-drafted tone' of our announcements. The test: if this content has the same shape every time, does it lose value? If yes, the agent is a bad idea.
What's the right way to retrospective a failed agent?+
Four sections per agent: what I tried, what broke, what it cost (in dollars or trust or rework), what I would do differently. Publish the cost honestly. The AI-native product conversation is too bullish and only publishing the wins makes the next round of bets worse, not better.