AI AgentsNew·Falk Gottlob··20 min read

39 PM AI Agents Deployed: What Stuck, What Died, and Why

An honest accounting of 39 PM AI agents across 4 product orgs in 80 days. Stage skew, cadence patterns, and the failure mode I kept repeating.

AI agents for PMsagent fleetPM AIfield researchproduct operating modelagent failure modesFalkster
Helpful?

This is an honest accounting of every PM AI agent in my fleet across four product orgs. Forty files start with agent- in this site's content folder. One of them turns out to be the user manual for the sandbox itself, not an agent, which is its own tiny finding (about how much hand-curation a fleet needs to stay coherent), so the real number is 39 product manager AI agents. They were written across 80 days, between February 6 and April 27, 2026. Eight in February, ten in March, twenty-two in April. The acceleration is real, and so is the regret.

I'm publishing this for two reasons. First, almost nothing public exists on what an agent fleet for product managers actually looks like at scale, written by someone running one. Most of the writing on AI for product management is either "here's how to build one agent" (useful, but doesn't help with the portfolio question) or "here's our prediction for AI agents in 2027" (not useful at all). I want a piece in the world that says "here's exactly what I built, here's the stage distribution, here's the cadence pattern, here's what died and why." Second, I'm trying to figure out, on the page, what I'd do differently next time. Writing it out is the cheapest way to find the answer.

The short version

Forty files in my agent fleet, of which 39 are real agents. Built across 80 days. The fleet is structurally biased toward Decide (7) and Build (7) over Discover (5), which contradicts the popular framing that AI for PMs equals customer interview synthesis. Thirteen agents fire in the 7-to-9 a.m. morning brief slot, which is now the most-contested real estate in the system. Slack is mentioned in 25 of 39 agents, the universal substrate. Thirteen agents are completely orphaned, with zero references from non-agent content anywhere on the site, which is the cleanest "we built it but it didn't stick" signal I have. The single biggest predictor of agent survival is whether a named human owns it. The single biggest predictor of agent death is scope drift, where an agent starts focused and ends as a Frankenstein doing five jobs badly. If I were starting over today I would build seven agents first, all on a single workflow surface, and refuse to add agent number eight until the first seven were fully integrated into a real team's week. I built thirty-nine.

What I'm calling an agent

Some terminology, because the word "agent" is the most overloaded term in product management in 2026.

For this piece, an agent is a long-running process that takes data on a schedule (or on a trigger), uses an LLM to produce a structured artifact, and delivers that artifact into a workflow surface where a human actually consumes it. By that definition, a one-off prompt is not an agent. A scheduled prompt that emails me a digest is an agent. A multi-step prompt that calls tools, reads from a database, and writes back to Slack is an agent. The line between an agent and a workflow is the schedule and the artifact.

This is a tighter definition than what most vendors use, which tends to include everything from a chat interface to an entire SaaS product. I prefer the tighter line because it forces honesty about what's actually running unattended. There is a related piece on this distinction at Agents vs. Workflows vs. Automations; if you want the longer argument, that's where it lives.

The fleet I'm describing is structured around the PM Operating System I work from, which has seven stages: Sense, Discover, Decide, Build, Ship, Measure, Amplify. Every agent is tagged to one stage. The full fleet is browsable at Your AI Agent Fleet.

Five findings I didn't expect before I counted

Finding 1, The fleet is biased toward Decide and Build, not Discover

Stage distribution across the 39 agents:

StageAgentsAvg words written about it
Sense51,231
Discover5998
Decide71,226
Build71,123
Ship51,321
Measure61,535
Amplify41,159

I expected Discover to dominate, because the writing about AI in product management is dominated by interview-synthesis examples. The data says otherwise. Decide and Build are tied at the top, and Discover is tied for last with Sense. The average word count is also revealing: Discover agents are the shortest (998 words), Measure agents are the longest (1,535). The shorter average word count for Discover almost certainly reflects that I myself wrote those agents less carefully, which is its own finding about where my own attention went.

The implication, which surprised me: the highest-leverage agent surface for an experienced PM is not "help me find the next idea" but "help me make the seven decisions a week that compound" (Decide) and "help me reduce the friction between idea and shipped code" (Build). Discovery agents are valuable but harder to get right. Decision and execution agents are easier to instrument and produce more visible weekly leverage.

Finding 2, The morning brief slot is the most-contested real estate in the system

Thirteen of the 39 agents fire daily between 7 and 9 a.m. They are:

agent-auto-bugfix, agent-daily-focus, agent-documentation-gaps, agent-gtm-monitoring, agent-instant-prototype, agent-nps-csat-analysis, agent-pm-issues, agent-product-ops, agent-red-flag-detection, agent-roadmap-tracker, agent-signal-to-ship, agent-support-signal-processing, agent-team-triage.

I did not plan this. Each of these agents was designed independently, and each independently gravitated to the early-morning window because that is when the PM consumer of the brief actually wants the answer. The consequence is that on any given Monday, my Slack receives thirteen briefs in roughly the same hour, and I read maybe four of them carefully and skim the rest. The remaining nine are doing work in the dark, running fine, producing fine output, going unread.

This is a fleet design failure I would correct on a redo. The right architecture is one of two patterns. Either a meta-agent that consolidates the thirteen morning briefs into a single read with click-through to detail. Or a deliberate spread of cadences so the briefs hit at different times, each tied to a moment in the workday when that signal is actually actionable. I default to the meta-agent pattern; the agent-daily-focus was a first attempt at this and lives in the Amplify stage, but it doesn't yet pull from the other twelve morning briefs. That's the next build.

Finding 3, The fleet is internally disconnected

Across all 39 agents, I count two agent-to-agent links in the entire body of writing. That is, the agents almost never reference each other inside their own posts. Average outgoing links per agent: 0.05. The most-connected agent is agent-signal-to-ship, which links to two others. Every other agent in the fleet, all 38 of them, exists as a standalone document that doesn't acknowledge the rest of the fleet's existence.

This is invisible to readers, but Google sees it. A topic cluster ranks for an entire field when the posts in the cluster reference each other with intentional anchors. My agent fleet, by this measure, is not yet a cluster, it's 39 standalone posts that happen to share a folder. The fix is mechanical: every agent post should link to two or three sibling agents in the same stage, plus the Your AI Agent Fleet hub. That's an evening of editing that I have not done.

This finding is doing double duty: it's a meta-finding about the writing, and it's also a real finding about the fleet itself. Agents that don't reference each other in writing tend to be agents that don't reference each other in execution either. The orphan-agent problem (Finding 5, below) is partly a downstream consequence of this writing pattern.

Finding 4, Slack is the universal substrate; everything else is a wrapper around Slack

Counting integration mentions across all 39 agent posts:

IntegrationAgents that mention it
Slack25
Segment17
Jira17
Salesforce14
Notion13
Linear12
Mixpanel9
GitHub9
Amplitude9
Gong8

Slack is the delivery surface for 64% of agents. Notion and Linear, the two most-credible "alternative agent surfaces," are at 33% and 31%. Looker, FullStory, Heap, Datadog, PagerDuty all show up in single-digit counts.

The implication for fleet design is that any agent that does not deliver into Slack starts at a disadvantage in this shop. There is a counter-argument that agents that deliver into Linear (where the work actually happens) compound differently than agents that deliver into Slack (where the work is talked about). I think both architectures are valid; my fleet has clearly chosen Slack. If I were rebuilding from scratch I would deliberately push more agents into Linear and Notion to test whether the lift on completion-rate is worth the lower notification visibility.

Finding 5, Thirteen of 39 agents are orphans

This is the finding I had to look hardest to find, and it's the one I wish I'd surfaced earlier. By scanning every non-agent piece of content on the site (handbook chapters, non-agent blog posts, app code, sandbox configurations) for references to each agent slug, I can produce a "structural adoption" proxy: how many places in the broader system know this agent exists.

The distribution:

  • 3 references: agent-competitive-intel, agent-instant-prototype, agent-product-ops
  • 2 references: agent-daily-focus, agent-executive-report, agent-interview-synthesis, agent-kpi-watchdog, agent-launch-comms, agent-product-health, agent-red-flag-detection, agent-signal-to-ship
  • 1 reference: 14 agents
  • 0 references: 13 agents

A third of the fleet, 13 agents, has zero references anywhere in the broader system. They are not linked from any handbook chapter. They are not mentioned in any non-agent blog post. They are not registered in the sandbox or referenced by any landing page. They exist as files, not as load-bearing parts of anything.

The list, for transparency:

agent-customer-segmentation, agent-feature-adoption, agent-journey-mapping, agent-nps-csat-analysis, agent-opportunity-prioritization, agent-release-documentation, agent-retrospective-synthesis, agent-sandbox-user-manual (this is the meta-document, expected), agent-sprint-planning, agent-stakeholder-communication, agent-support-signal-processing, agent-tech-debt-analyzer, agent-win-loss-analysis.

Some of these are recent additions that haven't been wired into anything yet, which is fair. But several have been live for over a month and are still orphans, which is the actual signal: I built them, I shipped them, and then the rest of the system never absorbed them. By contrast, the three-reference agents (agent-competitive-intel, agent-instant-prototype, agent-product-ops) are mentioned in landing pages, prototype demos, and handbook chapters. The broader system has decided those three are load-bearing. It has not decided the same about the other 13.

If I were forced to triage the fleet to half its size, the orphan list is where I would start. Not because those agents are inherently worse, but because the rest of the system has voted with its silence.

Three case studies, one win, one death, one ambiguous

Win, agent-instant-prototype

agent-instant-prototype is a daily on-demand agent in the Build stage that turns a one-page spec into a working Claude Artifacts prototype. Three references from non-agent content, including the prototype landing page and the Instant Prototyping handbook chapter.

What it does well: it's load-bearing for an actual workflow that the team runs every week (the Ten-Day Dev Loop explicitly depends on it). When the agent breaks, multiple humans notice within hours. Its scope is narrow, one input shape, one output shape, no feature creep. It has a clear DRI. It has been migrated through three model versions without breaking.

What it would do better if rebuilt: deliver a side-by-side diff against the previous prototype version, so iteration is visible. Right now each invocation produces a fresh artifact and the iteration history is lost.

Death, agent-tech-debt-analyzer

This is one of the orphan agents. Hourly cadence. Build stage. Zero references from anywhere in the broader system. I built it during the April acceleration thinking that "always-on tech-debt monitoring" would compound into a clean codebase. It hasn't.

What killed it: scope was too broad from the start. It tried to identify, prioritize, and recommend fixes for tech debt across the entire codebase every hour. The output was too long for anyone to read, too noisy to act on, and not integrated into a workflow that anyone actually ran. Engineers ignored it because it wasn't tied to a sprint. PMs ignored it because it wasn't tied to a roadmap commitment. It posted, and posted, and posted, and the only person who read its output was me, occasionally, when I was procrastinating.

What I'd build instead: a single-purpose agent that runs once per sprint planning, scopes itself to one specific module the team is about to touch, and produces a single recommendation. Same domain, fundamentally different design. The mistake was building the agent before the workflow.

Ambiguous, agent-okr-tracker

agent-okr-tracker is a Decide-stage agent firing daily at 9 a.m. One reference from the broader system. Tracks progress against a quarter's OKRs and surfaces drift early.

The ambiguous part: I cannot tell from the structural data whether this agent is working or dying. It fires reliably. It produces clean output. But OKRs themselves are an evergreen point of contention in product organizations (see Kill the OKR(coming May 13) for why I think the format is broken), and the agent is downstream of a planning artifact that itself may not be load-bearing. If the team stops setting useful OKRs, the agent has nothing to track, and it will become an orphan over time.

The lesson here: an agent's survival is not just about the agent. It's about whether the planning surface it depends on is itself healthy. Agents that depend on fragile inputs inherit that fragility. The right move with agent-okr-tracker is probably to refactor it to track not OKRs but commitments, something more durable than the format of the quarterly planning doc.

The pattern that predicts whether an agent sticks

Across the 39 agents, the agents that survive and become load-bearing share four traits. The agents that die or orphan share the inverse.

A surviving agent has a named human DRI who notices when it breaks within hours, not weeks. The orphan agents I cataloged above don't have this; their failure is silent.

A surviving agent has a narrow, durable scope, one input shape, one output shape, one workflow surface. The dead agent-tech-debt-analyzer violated this. The surviving agent-instant-prototype honored it.

A surviving agent delivers into a workflow that already exists, not a new workflow it's trying to create. The Ten-Day Dev Loop existed before the prototype agent. The agent slotted into it. That is the right direction. A new agent attached to a hypothetical future workflow is a coin-flip at best.

A surviving agent has upstream inputs that are themselves stable. The fragile-input problem with agent-okr-tracker is the opposite of this. Agents that depend on stable artifacts (Slack messages, Salesforce records, GitHub PR titles) outlive agents that depend on freshly-curated planning docs.

If you are about to build an agent and any of these four traits is missing, do not build it yet. Fix the missing trait first. The agent is downstream of all four; building the agent in their absence is the same mistake as shipping a feature without an outcome metric.

What I'd do differently if I were starting today

If I were rebuilding this fleet from a blank folder on April 28, 2026, I would build seven agents and stop. The full design, names, signal stacks, surfaces, cadences, and build order, is in the companion post: The Seven-Agent Reset. The short version of the seven is below; the long version is in that piece.

The seven would be:

  1. One Sense agent that reads my Slack DMs, support tickets, and customer Gong calls overnight and produces a single morning red-flag brief at 8 a.m.
  2. One Discover agent that runs weekly on Monday, takes the past week's customer interviews, and produces a Jobs-to-Be-Done update plus a list of questions for the next interview round.
  3. One Decide agent that runs daily at 4 p.m. and produces tomorrow's "the three decisions on your plate" brief based on the day's signals.
  4. One Build agent, the prototype agent. On-demand. Same as agent-instant-prototype is today.
  5. One Ship agent that runs on every production deploy and produces the release note plus the rollback brief in one artifact.
  6. One Measure agent that runs daily and reports the three metrics that matter against last week's baseline. No vanity metrics. Three numbers.
  7. One Amplify agent that runs weekly on Friday and produces the next week's bets, the previous week's wins, and the monthly compounding-lessons doc.

Seven agents. One per stage. Each one consolidating the leverage of what is currently three to seven agents in the existing fleet. I would refuse to build agent number eight until the first seven were fully wired into actual team rituals and producing measurable lift in someone's week.

The version I'd hold myself to: I would not allow any agent into production without a named DRI, a sentence describing the existing workflow it slots into, and a one-week kill-switch test ("if I turn this off for a week, who notices?"). Any agent that fails the kill-switch test gets retired or rebuilt.

This is roughly the inverse of how I built the actual fleet, which grew from 8 to 18 to 40 in three months mostly through enthusiasm and the wrong kind of optimism about what compounds in a system. The 7-agent fleet would compound. The 39-agent fleet partially does, partially doesn't, and the doesn't is mostly invisible until you stop and count.

What this means for the field

A handful of implications, for the broader question of how product organizations should be thinking about agent fleets in 2026.

The default fleet size you should design for is probably between 7 and 14, not between 30 and 50. The compounding leverage of an agent comes from how deeply it integrates into a team ritual, not from how many other agents are in the fleet. Past about 14, the cognitive overhead of knowing which agent does what starts costing more than the agents save.

The job of "eval engineer" is becoming load-bearing for any org with more than 10 agents. The orphan-agent problem in my fleet is partly a quality problem (some agents aren't good enough to be adopted) and partly a calibration problem (we don't have a clean way to measure whether an agent's output is improving or drifting over model migrations). Both are eval engineering questions. There is more on this in the upcoming The Eval-First Product Org.

The right way to grow a fleet is not "ship one agent per week and let the strongest survive." It's "ship one agent per quarter, instrument it relentlessly, kill it if it doesn't earn a place in someone's week, and only then start the next one." Slow agent fleets win. Fast agent fleets accumulate orphans.

The morning-brief slot will become a contested resource at every org over the next year, and the right move is a "brief of briefs" architecture before the contention bites you. I missed this and built thirteen agents that all want the same six minutes of attention. The lesson is to design for attention as the scarce resource, not for output as the scarce resource.

Methodology and caveats

The structural data in this piece was extracted programmatically from the 40 agent-prefixed files in this site's content folder. Specifically: the agentStage and agentSchedule frontmatter fields, the body word counts, the integration mentions matched against a list of 25 known PM tools, and the network of internal links between agent posts and the rest of the site. The extraction script is reproducible and lives outside the published repo.

Sample size: 39 real agents (40 minus the sandbox user manual, which was tagged as an agent file but is documentation, not an agent, see Finding 0, below).

Time window: 80 days of writing, between February 6 and April 27, 2026. Some agents have been running longer than that, they were written before this folder existed and the post was a write-up of an existing system. Others were greenfield builds during the writing window.

What this data does not include: actual user-level adoption rates inside the orgs where these agents run, time-to-first-value measurements, retention curves, or model migration costs. Those numbers exist (in spreadsheets, in chat threads, in my head) but they are organizational data that I cannot publish at the per-agent level. The structural data above is a proxy for adoption, references from the broader system, scheduling cadences, integration concentration, and the inferences it supports are weaker than the inferences I could make if I could publish raw adoption metrics. A future version of this piece, written when I'm allowed to publish the per-org numbers, will be sharper.

What this data also does not include: comparison to other companies' agent fleets. I don't know how agent-product-ops performs at companies that aren't the ones I run. The findings above are about one fleet across four orgs over one operator's career; they may generalize, they may not. If you have a fleet of comparable size and want to compare notes, I would genuinely like to read your version of this post, even if it's a paragraph in a Slack DM.

A note on Finding 0, which I left out of the main numbered list because it's a method finding, not a substantive one: of the 40 files in the agent folder, one (agent-sandbox-user-manual) is the user manual for the sandbox tool itself, not an agent. It got the agent- prefix because it lives in the same folder and got the same naming convention. Of the 39 real agents, 38 have the required agentStage frontmatter and one is missing it. That's a 97% compliance rate on a process I documented for myself, which is roughly the rate I would expect, and which is its own small lesson about how much hand-curation a system this size requires to stay honest.

What you do with this piece if you're building your own fleet

Three things are genuinely portable from one organization to another, and three things probably aren't.

Portable: the four traits that predict survival (DRI, narrow scope, existing workflow, stable inputs). The 7-stage minimum-viable fleet. The cost of the morning-brief slot.

Not portable: my specific stage distribution. Whether Slack is the right substrate for your org. Which agents should be daily versus weekly. The integration concentration. Those are functions of your team's existing tools, rituals, and writing culture. Don't copy my distribution. Build your own and compare it to mine, that comparison itself is a useful diagnostic.

If you're at the start of an agent fleet build, I would skip directly to the 7-agent design above and start there. If you're past 15 agents and feeling like things are getting noisy, do the orphan-count exercise on your own fleet, for each agent, ask how many places in the broader system reference it, and consider retiring the bottom third. If you're at 30+ and you can no longer remember what every agent does without looking, congratulations, you are now me at the start of April.

Pick one thing to try this week: count your fleet's orphans. Either you're surprised by the count (most likely) or you're not (in which case you already have an eval engineer on the case, or you're a smaller fleet than you think). Either result is information you can act on.

Sources and further reading: Your AI Agent Fleet, The AI Product Operating Model, Ten-Day Dev Loop, Agents vs. Workflows vs. Automations, Kill the OKR(coming May 13), The Eval-First Product Org.

Share this post

Frequently asked

How many AI agents make sense for a single product manager?+

Based on a fleet of 39 agents I've deployed across four product orgs over 80 days, the right number for one PM is somewhere between 8 and 14. Below 8 and you're not getting compounding leverage. Above 14 and the cognitive overhead of knowing which agent does what starts costing you more than the agents save. The 39 in this piece were built for a multi-PM fleet across multiple stages of the PM operating system, not for one operator.

Which AI agent should a PM build first?+

Build the agent for the task you do every Monday morning that you wish you didn't. For 80% of PMs that's either the weekly status update (replaced by an executive-report agent) or the weekly competitive scan (replaced by a competitive-intel agent). Both pay back within the first week. Do not start with discovery agents (interview synthesis, etc.), those are higher value but harder to get right, and a bad discovery agent damages your judgment in ways a bad ops agent doesn't.

What kills an AI agent inside a product organization?+

Three things, in roughly equal measure. First, no integration into a workflow people already run, the agent posts beautiful Slack briefs that nobody reads because they don't fit a habit. Second, scope drift, the agent starts focused, gets feature requests from teammates, becomes a Frankenstein that does five jobs poorly. Third, no owner, when the agent breaks (it will, every model migration), nobody is on the hook to fix it, so it sits dead for weeks until it's abandoned by default. The single biggest predictor of agent death in my fleet is the absence of a named DRI.

What stage of the PM job has the highest agent leverage?+

The data from my 39-agent fleet says Decide and Build, with 7 agents each. That contradicts the popular take that AI for PMs equals customer interview synthesis (Discover). Discovery in my fleet is the most under-instrumented stage, only 5 agents and the lowest average word count, suggesting I myself wrote them less carefully. Decide is where the daily decisions are; Build is where the execution friction is. Both compound.

How long does it take to build a useful PM AI agent?+

First version: half a day if you have Claude Code or a similar coding agent and a clear prompt. Production version that runs reliably on a schedule with proper Slack integration: three to five days. Version that survives a year of model migrations and team turnover: ongoing, budget two hours a month per agent for maintenance. Most agents that die in my fleet die because nobody budgeted that maintenance time.

What's the morning-brief slot and why do so many agents target it?+

Across my fleet, 13 of 39 agents fire between 7 and 9 a.m. local time. That's a third of all agents converging on the same one-hour window. The pattern emerged accidentally, every agent I built with the prompt 'tell me the most important thing in [X] from yesterday' gravitated to that slot because that's when PMs actually want the answer. The morning brief slot is the most-contested real estate in the agent fleet; the implication is that the next generation of fleet design needs a meta-agent that consolidates the morning briefs into one read instead of thirteen.

Is 39 agents a lot for a product organization?+

It's a lot for one PM. It's about right for a 12-person product org running an aggressive AI-first model. Most of the agent fleets I've seen in 2026 are between 5 and 25 agents per org. Beyond about 30, you start seeing diminishing returns from cognitive overhead, plus the maintenance burden becomes its own job (which is why eval engineering is becoming a dedicated role). For more on this, see The Eval-First Product Org.

How do you measure whether an agent is actually working?+

Three signals, in priority order. First, did anyone open the agent's most recent output? (Trackable via Slack reactions, link clicks, doc views.) Second, did anyone act on it within 48 hours? (Visible in PR titles, Slack replies, Linear ticket creation.) Third, would the team notice if you turned the agent off for a week? (Reliable answer: turn it off for a week and see who complains.) An agent that produces beautiful output nobody reads is dead even if it's still running.

Keep Reading

Posts you might find interesting based on what you just read.