Ship Story: The Discovery Week I Ran With Three Agents and No Calls

I ran a full AI agent discovery week last month without scheduling a single live customer call. The synthesis was sharper than any sprint I shipped in 2025.

This is not a recommendation that you stop talking to customers. It is a recommendation that you stop running live-call discovery sprints when the actual bottleneck is synthesis, not signal. AI agent customer discovery is the right move when you have unprocessed signal already sitting in Gong, support, and sales notes, which is most of the time.

The short version

Three agents (transcript triage, theme synthesizer, opportunity ranker) ran across six weeks of Gong calls, three months of support tickets, and a sales-notes export. Four hours per day for four days. The output was a ranked opportunity list with a quote bank and four prototype briefs. Two agents got beaten by a human on the same week: low-volume high-severity signal and tie-breaks between strategically different themes. Everywhere else, the agents were faster, more consistent, and produced better artifacts than my last live-call discovery sprint.

For the broader argument, see the handbook chapter Continuous Discovery on Autopilot and the companion Continuous Listening. For why this is a synthesis problem more than a signal problem, see Your First OST.

The setup

I had a backlog problem, not a signal problem.

Six weeks of unprocessed Gong calls. Three months of support tickets nobody had tagged. A sales-notes export I had been promising to read since February. The team was about to run another live-call discovery sprint, and I noticed that we were going to layer new signal on top of an unread pile of existing signal.

So I cancelled the sprint, blocked four days, and built three agents.

Agent one: Transcript triage

Input: a folder of Gong call transcripts.

Output: every call tagged with topic, sentiment, product surface, account tier, and a short summary. About 90 seconds per call.

The eval was a 12-call gold set I had hand-tagged. The agent had to match my tags within one category-distance on 10 of 12 before I let it run on the full backlog. It hit 11 of 12 on the second iteration.

Agent two: Theme synthesizer

Input: the tagged segments from agent one.

Output: clusters of segments grouped by named themes (not just keywords), with a frequency count, an account-tier weighting, and a quote bank of three to five verbatim quotes per theme.

The eval was that a senior PM (not me) had to read the themes blind and tell me whether each theme name was a description of a real pattern or a hallucinated category. The first run produced four themes I could not defend. The second run produced eleven, and a senior PM agreed with nine of them.

Agent three: Opportunity ranker

Input: the themes from agent two, plus a small context file about our product strategy and active bets.

Output: ranked opportunities as falsifiable statements. Format: "We could move [metric] by addressing [theme] for [segment], with a confidence of [low/med/high] based on [reach × severity]." Twelve opportunities, ranked.

The eval was that two of the ranked opportunities had to map to bets already on our outcome ledger. If the agent could not connect the new ranking to the existing strategy, the ranking was probably hallucinated relevance. It connected three.

The week itself

Monday morning, the triage agent ran across 84 Gong calls and 312 support tickets. Total runtime, including my reviews: about three hours.

Tuesday, the theme synthesizer ran on the tagged corpus. I spent two hours reviewing themes and killing the four that did not survive blind read. The remaining 11 themes went into the ranker.

Wednesday, the opportunity ranker produced 12 ranked opportunities. I spent three hours editing the framing, cutting two that were clearly the same opportunity stated twice, and merging two into one. Final list: nine.

Thursday, I sat with the engineering lead and turned the top four opportunities into prototype briefs. Each brief has a hypothesis, a metric to move, an acceptance criterion, and a small eval set. No PRD. Roughly two hours of work.

Friday, I wrote the synthesis memo. Four hours.

Total human time: about 20 hours. A comparable live-call discovery week, for reference, is 25 to 35 hours of call execution alone, before synthesis even starts.

Where the agents got beaten

Two places, both worth naming.

Low-volume, high-severity signal

The ranker's confidence score was driven heavily by frequency. A theme that showed up in 31 calls beat a theme that showed up in three. But the three calls were our largest enterprise account talking about an integration regression that was about to put their renewal at risk.

The agent was technically correct (low frequency, low confidence) and strategically wrong. A human had to pull that theme up the stack.

The fix for next time is a weighting layer: tier-1 accounts get a frequency multiplier. I have not built it yet. Right now I just review the bottom of the ranker's list before I trust the top.

Tie-breaks between strategically different themes

Two themes came in with nearly identical scores: a new-user activation regression in the self-serve flow, and a different activation regression in the enterprise onboarding flow.

The agents could not tell me which one mattered more, because the answer depended on what we were planning to charge for in Q3 (a question the agents did not have context on, and that I had not put into the strategy context file because the decision was not final).

I made the call. Self-serve. The agent had ranked enterprise slightly higher.

The lesson is that the strategy context file needs to include the unsettled questions, not just the settled answers. The agent needs to know what is in play, so it can flag uncertainty instead of pretending to know.

The artifact

The deliverable from the week was a single page.

Top seven opportunities, ranked, each one falsifiable
A 40-quote evidence bank linked by opportunity
Four prototype briefs ready to hand to engineering
A short list of three account-specific risks the ranker under-weighted, flagged for separate handling

Compared to the deliverable from my last live-call sprint (a slide deck, a 60-page synthesis doc, and a list of recommendations I had to defend in a 90-minute review), this is structurally better. Shorter, falsifiable, with the dissent already on the page.

What live calls are still for

Three things, in my experience.

First, testing a new hypothesis. The agent week is great for synthesis of existing signal. It is bad at exploring a question nobody has asked customers yet. For that, you need to schedule the call and ask.

Second, rebuilding trust with a specific account. A live conversation has a relational function the agent week cannot replace. If a champion at a top-five account feels unheard, a synthesis memo will not fix it. A 45-minute call with the CPO will.

Third, validating a finding that surprised you. If the agent week surfaces a theme you did not expect and that contradicts your prior model, run two live calls to pressure-test it before you commit to a prototype.

The frame I have landed on: synthesis is for agents, exploration and trust are for humans. Most discovery weeks I have run in the past two years were synthesis disguised as exploration. That is the work the agent week replaces.

What to try this week

Pick one of these.

Pull six weeks of Gong calls and run a triage agent against them. Even just topic tags and sentiment is enough to surface what you have been ignoring. Two hours of work.

Or: dump your last quarter of support tickets into a theme synthesizer and ask it for the top five themes. Compare against what your team thinks the top five are. The gap is the story.

Or: take your most recent opportunity solution tree and ask an agent to grade each opportunity for falsifiability. Anything not falsifiable goes back to the drawing board.

The discovery week without calls is not a permanent operating mode. It is the move you make when the bottleneck is synthesis, which, for most product teams I have worked with in the last 18 months, is most of the time.

The three-agent stack, the gold-set evaluation template, and the prototype-brief format are on the toolkit. For the full fleet of agents this stack is a subset of, see Your AI Agent Fleet.

Sources: Teresa Torres on continuous discovery habits, Marty Cagan on the discovery work that matters, Gong, Lenny Rachitsky, Claude customer-call-notes skill.

Ship Story: The Discovery Week I Ran With Three Agents and No Calls

The short version

The setup

Agent one: Transcript triage

Agent two: Theme synthesizer

Agent three: Opportunity ranker

The week itself

Where the agents got beaten

Low-volume, high-severity signal

Tie-breaks between strategically different themes

The artifact

What live calls are still for

What to try this week

Further reading

Also on Medium

AI Agents and the Future of Work: A Pixar-Inspired Journey

Many AI Agents Are Actually Workflows or Automations in Disguise

Frequently asked

About the author

Comments (0)

Keep Reading

The 20-Minute Customer Call Triage Agent

Customer Discovery When Your Customer Is an Agent

Continuous Discovery Doesn't Scale for AI-Native Products

Build a Discovery Agent Stack: Continuous Customer Listening

Audits, workshops, advisory.

Follow on LinkedIn.

Browse the toolkit.