This is the agent every PM I talk to wishes existed and most do not realize they can build today.
The job is simple: take a 60-minute customer call, return a tagged and clustered artifact, recommend a next action. The whole loop should take 20 minutes of human time on top of about 90 seconds of agent runtime.
The pieces are sitting in your workflow already. The only missing piece is the recipe.
The short version
Stack: any transcript source (Granola, Fireflies, Gong) + Claude with the customer-call-notes skill + whatever discovery ledger you already use. Inputs: a single call transcript. Outputs: a verbatim quote bank (8-15 quotes with timestamps), a theme rollup (3-5 themes with frequency), a one-line recommended next action, and a row formatted for the discovery ledger. The agent runtime is 90 seconds. The human review is 18 minutes. Eval: quote accuracy, theme legibility, action specificity. The highest-ROI use is running it across your entire backlog of unprocessed calls in a batch, which usually takes about four hours.
For the broader practice this slots into, see the handbook chapters Interview Guide and Continuous Listening. For what the agent stack looks like at scale, see The PM Agent Stack.
The recipe
Inputs
One call transcript. Plain text or markdown.
If your transcript tool exports JSON or has speaker labels, leave them in. The agent uses them. If you only have a flat text dump, that works too, but speaker attribution will be weaker.
Three pieces of context the agent needs (write them into a short brief file the agent reads):
- The customer's role and the account context (tier, stage, recent renewal status)
- Three to five things you are currently considering shipping (so the agent can connect quotes to active bets)
- One sentence about what you want out of the call (not "everything," something specific)
If you skip the brief, the agent reverts to generic synthesis. Generic synthesis is not useful.
The agent
Claude, with the customer-call-notes skill loaded. The skill provides the structured output template. If you do not have access to the skill, the equivalent is a system prompt that names the four outputs you want.
The agent runs once per call. About 90 seconds.
Outputs
Four artifacts, in a single response.
1. Verbatim quote bank. Eight to fifteen quotes, with timestamps, organized by topic. Verbatim, not paraphrased. If the agent paraphrases, the artifact decays fast and you lose the ability to point at evidence in a planning meeting.
2. Theme rollup. Three to five themes the call surfaced. Named, not just listed. Each theme has a frequency (how many quotes belong to it) and a confidence (how clear the customer was).
3. Recommended next action. One sentence. Concrete enough to put on the PM's calendar this week. Bad: "Follow up with the customer." Good: "Ship a 90-minute prototype of the bulk-export flow and send it to this customer's champion by Thursday."
4. Discovery ledger row. A pre-formatted entry that maps to whatever ledger format you use. Account, date, theme tags, opportunity link, owner. Drop into the ledger without re-typing.
Human review
Eighteen minutes.
Read the quote bank first. Spot-check three quotes against the transcript timestamps. Verbatim, not paraphrased. If even one is paraphrased, the whole quote bank is suspect. Re-run the agent with a stricter instruction.
Read the themes. Ask yourself: would a senior PM who did not sit on the call understand what this customer cares about from these three theme names? If not, rewrite the theme names in your own voice. The agent gets the clustering right and the naming wrong about 30% of the time.
Read the recommended action. Is it something you can do this week? If it says "investigate further," "circle back," or "schedule a follow-up," the agent did not produce an action. Rewrite it.
Drop the ledger row in.
The eighteen minutes are the work. The agent is a force multiplier, not a replacement.
The eval
Three checks. Run them on the first ten calls you process, then weekly.
Quote accuracy. Open the transcript. Pick three quotes from the bank. Find them at the timestamp the agent claimed. They must be verbatim. Below 95% accuracy across ten calls, the agent is not load-bearing.
Theme legibility. Print the theme rollup. Show it to a senior PM who did not sit on the call. They should be able to describe what the customer cares about in 30 seconds. If they can't, the themes are too generic or too granular.
Action specificity. Read the recommended action out loud. If it could apply to any customer call, the agent did not produce a useful action. Rewrite it or re-run with a tighter instruction.
The eval is the difference between an agent and a feature you turned on once.
The batch run
The highest-ROI use of this agent is not the next call. It is the backlog of unprocessed calls you already have.
Most PMs I talk to have six to twelve weeks of recorded calls they have never synthesized. Sales reviews. Customer success check-ins. Win-loss interviews that got recorded and forgotten.
Run the agent across the whole backlog in a batch.
Read only the recommended actions and the theme rollups. Skip the quote banks for now (you can pull them later when you need evidence).
Pull the three to five themes that recur across calls. These are your real opportunities, surfaced from data you already had.
Ship a prototype against the most common theme within a week.
Total human time on a 40-call backlog: roughly four hours, almost all of which is reading recommended actions and clustering themes across calls.
This is the use case that pays for the whole stack in one afternoon.
Where it breaks
Three failure modes worth naming.
Cross-talk and multiple speakers. If the customer side of the call has more than one speaker, especially if they are talking over each other, the speaker-attribution gets confused. The agent will assign quotes to the wrong person. Fix: use a transcript tool with strong speaker diarization (Otter, Fireflies, and Gong are all decent) and skim the speaker labels before running the agent.
Customer diplomacy. A customer being polite is not a customer giving you signal. The agent takes statements at face value. If the customer says "your product is great, we'd love to use it more," the agent will tag it as a positive signal. The PM has to know that "we'd love to use it more" usually means "we are not currently using it much." This is judgment the agent does not have.
Brand-new use cases. If the call covers a use case you have never briefed the agent on, the theme clustering reverts to generic categories ("product feedback," "feature request," "user experience"). Fix: when you encounter a new use case, add it to the brief file. The agent improves over time as your brief gets richer.
What to do this week
Pick the call you most recently had with a customer that mattered. Run the agent on it.
Read the four outputs. Spot-check the quotes. Rewrite the recommended action in your own voice.
Drop the ledger row in.
Then ask the harder question: how many calls just like this one are sitting on your hard drive unprocessed?
That is the batch run. That is the four-hour afternoon that pays for itself.
The full system prompt, the brief template, the four-output schema, and the eval rubric are in the downloadable recipe.
Further reading
Download the artifact
Ready to use. Copy into your project or share with your team.
Also on Medium
Full archive →AI Agents and the Future of Work: A Pixar-Inspired Journey
What product managers can learn about AI agents from how Pixar runs a film team.
How to Avoid Survivorship Bias in Product Management
Lessons from the British bomber study, applied to PM customer interviews and analytics.
Frequently asked
What does the triage agent actually do?+
Takes a transcript of a 60-minute customer call as input. Returns four artifacts: a verbatim quote bank (8-15 quotes, with timestamps), a theme rollup (3-5 themes the call surfaced, with frequency), a one-line recommended next action for the PM, and a structured row ready to drop into the discovery ledger. Total runtime: about 20 minutes including human review.
What stack does it run on?+
The recording layer is whatever you already use: Granola, Fireflies, Gong, Otter, or Loom transcripts. The agent layer is Claude with the customer-call-notes skill loaded. The output layer is wherever you keep discovery (Notion database, Linear project, a Google Doc, a Markdown file in a repo). No proprietary tooling. The point is that the inputs already exist and are being wasted.
Why 20 minutes and not faster?+
The agent runtime is about 90 seconds. The other 18 minutes are human review: skim the quote bank for ones the agent misclassified, sanity-check the theme names, write a one-line decision in your own voice. Skip the review and the artifact decays into noise inside a month. The review is where the agent's output becomes load-bearing.
What's the eval that tells me the agent is working?+
Three checks. (1) Quote accuracy: spot-check three quotes against the transcript timestamp. They must be verbatim, not paraphrased. (2) Theme legibility: a senior PM who didn't sit on the call reads the themes and can describe what the customer cares about. (3) Action specificity: the recommended action is something the PM can do this week, not 'follow up with the customer' or 'investigate further.'
Where does this agent break?+
Three places. (1) Calls with cross-talk or multiple speakers from the customer side, because the speaker-attribution often confuses the agent about who said what. (2) Calls where the customer was lying or being diplomatic, because the agent takes statements at face value. (3) Calls about a brand-new use case the PM hasn't briefed the agent on, because the theme clustering reverts to generic categories.
Can I run this on the whole backlog of calls I never processed?+
Yes. That is the highest-ROI use of the agent. Run it across your last six weeks of calls in a batch. Read only the recommended actions and the theme rollups. Pull the three to five themes that recur across calls. Ship a prototype against the most common one within a week. The whole backlog-clear takes about four hours of human time on top of agent runtime.
What's the prompt scaffold?+
Available in the toolkit artifact linked below. The short version: a system prompt that names the customer's role, the product context, and the three outputs the agent has to produce. A user prompt that contains the transcript. A scoring rubric the agent runs against its own output before returning (does not eliminate scorer collusion, but catches the most obvious misses).

Comments (0)
Sign in with LinkedIn to leave a comment.
Sign in with LinkedIn