DiscoveryNew·Falk Gottlob··7 min read

Ship Story: The Discovery Week I Ran With Three Agents and No Calls

A full discovery week run with three agents and zero live customer calls. The synthesis was sharper than any sprint I shipped in 2025. Here's the stack, the inputs, and what broke.

continuous discoveryAI agentscustomer researchdiscovery weekagent stackTeresa Torressynthesis
Helpful?

I ran a full discovery week last month without scheduling a single live customer call. The synthesis was sharper than any sprint I shipped in 2025.

This is not a recommendation that you stop talking to customers. It is a recommendation that you stop running live-call discovery sprints when the actual bottleneck is synthesis, not signal.

The short version

Three agents (transcript triage, theme synthesizer, opportunity ranker) ran across six weeks of Gong calls, three months of support tickets, and a sales-notes export. Four hours per day for four days. The output was a ranked opportunity list with a quote bank and four prototype briefs. Two agents got beaten by a human on the same week: low-volume high-severity signal and tie-breaks between strategically different themes. Everywhere else, the agents were faster, more consistent, and produced better artifacts than my last live-call discovery sprint.

For the broader argument, see the handbook chapter Continuous Discovery on Autopilot and the companion Continuous Listening. For why this is a synthesis problem more than a signal problem, see Your First OST.

The setup

I had a backlog problem, not a signal problem.

Six weeks of unprocessed Gong calls. Three months of support tickets nobody had tagged. A sales-notes export I had been promising to read since February. The team was about to run another live-call discovery sprint, and I noticed that we were going to layer new signal on top of an unread pile of existing signal.

So I cancelled the sprint, blocked four days, and built three agents.

Agent one: Transcript triage

Input: a folder of Gong call transcripts.

Output: every call tagged with topic, sentiment, product surface, account tier, and a short summary. About 90 seconds per call.

The eval was a 12-call gold set I had hand-tagged. The agent had to match my tags within one category-distance on 10 of 12 before I let it run on the full backlog. It hit 11 of 12 on the second iteration.

Agent two: Theme synthesizer

Input: the tagged segments from agent one.

Output: clusters of segments grouped by named themes (not just keywords), with a frequency count, an account-tier weighting, and a quote bank of three to five verbatim quotes per theme.

The eval was that a senior PM (not me) had to read the themes blind and tell me whether each theme name was a description of a real pattern or a hallucinated category. The first run produced four themes I could not defend. The second run produced eleven, and a senior PM agreed with nine of them.

Agent three: Opportunity ranker

Input: the themes from agent two, plus a small context file about our product strategy and active bets.

Output: ranked opportunities as falsifiable statements. Format: "We could move [metric] by addressing [theme] for [segment], with a confidence of [low/med/high] based on [reach × severity]." Twelve opportunities, ranked.

The eval was that two of the ranked opportunities had to map to bets already on our outcome ledger. If the agent could not connect the new ranking to the existing strategy, the ranking was probably hallucinated relevance. It connected three.

The week itself

Monday morning, the triage agent ran across 84 Gong calls and 312 support tickets. Total runtime, including my reviews: about three hours.

Tuesday, the theme synthesizer ran on the tagged corpus. I spent two hours reviewing themes and killing the four that did not survive blind read. The remaining 11 themes went into the ranker.

Wednesday, the opportunity ranker produced 12 ranked opportunities. I spent three hours editing the framing, cutting two that were clearly the same opportunity stated twice, and merging two into one. Final list: nine.

Thursday, I sat with the engineering lead and turned the top four opportunities into prototype briefs. Each brief has a hypothesis, a metric to move, an acceptance criterion, and a small eval set. No PRD. Roughly two hours of work.

Friday, I wrote the synthesis memo. Four hours.

Total human time: about 20 hours. A comparable live-call discovery week, for reference, is 25 to 35 hours of call execution alone, before synthesis even starts.

Where the agents got beaten

Two places, both worth naming.

Low-volume, high-severity signal

The ranker's confidence score was driven heavily by frequency. A theme that showed up in 31 calls beat a theme that showed up in three. But the three calls were our largest enterprise account talking about an integration regression that was about to put their renewal at risk.

The agent was technically correct (low frequency, low confidence) and strategically wrong. A human had to pull that theme up the stack.

The fix for next time is a weighting layer: tier-1 accounts get a frequency multiplier. I have not built it yet. Right now I just review the bottom of the ranker's list before I trust the top.

Tie-breaks between strategically different themes

Two themes came in with nearly identical scores: a new-user activation regression in the self-serve flow, and a different activation regression in the enterprise onboarding flow.

The agents could not tell me which one mattered more, because the answer depended on what we were planning to charge for in Q3 (a question the agents did not have context on, and that I had not put into the strategy context file because the decision was not final).

I made the call. Self-serve. The agent had ranked enterprise slightly higher.

The lesson is that the strategy context file needs to include the unsettled questions, not just the settled answers. The agent needs to know what is in play, so it can flag uncertainty instead of pretending to know.

The artifact

The deliverable from the week was a single page.

  • Top seven opportunities, ranked, each one falsifiable
  • A 40-quote evidence bank linked by opportunity
  • Four prototype briefs ready to hand to engineering
  • A short list of three account-specific risks the ranker under-weighted, flagged for separate handling

Compared to the deliverable from my last live-call sprint (a slide deck, a 60-page synthesis doc, and a list of recommendations I had to defend in a 90-minute review), this is structurally better. Shorter, falsifiable, with the dissent already on the page.

What live calls are still for

Three things, in my experience.

First, testing a new hypothesis. The agent week is great for synthesis of existing signal. It is bad at exploring a question nobody has asked customers yet. For that, you need to schedule the call and ask.

Second, rebuilding trust with a specific account. A live conversation has a relational function the agent week cannot replace. If a champion at a top-five account feels unheard, a synthesis memo will not fix it. A 45-minute call with the CPO will.

Third, validating a finding that surprised you. If the agent week surfaces a theme you did not expect and that contradicts your prior model, run two live calls to pressure-test it before you commit to a prototype.

The frame I have landed on: synthesis is for agents, exploration and trust are for humans. Most discovery weeks I have run in the past two years were synthesis disguised as exploration. That is the work the agent week replaces.

What to try this week

Pick one of these.

Pull six weeks of Gong calls and run a triage agent against them. Even just topic tags and sentiment is enough to surface what you have been ignoring. Two hours of work.

Or: dump your last quarter of support tickets into a theme synthesizer and ask it for the top five themes. Compare against what your team thinks the top five are. The gap is the story.

Or: take your most recent opportunity solution tree and ask an agent to grade each opportunity for falsifiability. Anything not falsifiable goes back to the drawing board.

The discovery week without calls is not a permanent operating mode. It is the move you make when the bottleneck is synthesis, which, for most product teams I have worked with in the last 18 months, is most of the time.


The three-agent stack, the gold-set evaluation template, and the prototype-brief format are on the toolkit.

Further reading

Share this post

Also on Medium

Full archive →

Frequently asked

Did you really run discovery without any live customer calls?+

For one week, yes. The inputs were a Gong dump of the prior six weeks of calls, three months of support tickets, and a sales-notes export. Three agents ran across those inputs. Live calls returned the following week, but the synthesis from the agent week was sharper than any single live-call sprint I ran in 2025.

What were the three agents?+

Transcript triage (tags every Gong call by topic, sentiment, and product surface), theme synthesizer (clusters tagged segments into named themes with a quote bank and a frequency count), and opportunity ranker (turns themes into ranked opportunity statements with reach, severity, and confidence scores). Each agent has a short eval suite the next agent depends on.

Where did the agents fail?+

Two places. They under-weighted low-volume but high-severity issues from large accounts, because frequency dominated the ranker's signal. And they could not break a tie between two themes with similar scores but different strategic implications. A human had to make the call on which of two activation regressions was the real blocker.

How long did the week actually take?+

Setup of the three agents was about six hours over a Sunday. The week itself was four hours per day across Monday through Thursday, with Friday left for the synthesis write-up. So roughly 20 hours of human time on top of the agent runtime. A comparable live-call discovery week is 25 to 35 hours of just call execution before synthesis.

What kind of artifact came out of the week?+

A one-page opportunity ranking with the top seven themes, a 40-quote evidence bank, and four prototype briefs ready to hand to engineering. The prototype briefs included acceptance criteria and a short eval set, not a PRD.

Should every discovery week look like this now?+

No. The agent week is the right move when you have a backlog of unprocessed signal and need synthesis. Live calls are the right move when you are testing a new hypothesis, exploring a new segment, or rebuilding trust with a specific account. The mistake is doing live calls when synthesis was the bottleneck.

What tools did you use?+

Gong as the transcript source. Zendesk and Linear for support. A Notion database for the opportunity ledger. Claude with the customer-call-notes skill for the agent layer. No proprietary stack. The point is that the inputs already existed and were being ignored.

About the author

Falk Gottlob

Falk Gottlob

Product Executive · Founder, Falkster.AI

Thirty years shipping product at Microsoft Research, Adobe, Salesforce (Marketing Cloud / Quip / Slack), and several startups including one $6.5B exit and one acquired by Microsoft. Now CPO at Smartcat and founder of Falkster.AI, writing this notebook from the boardroom, not the keyboard.

Comments (0)

Sign in with LinkedIn to leave a comment.

Sign in with LinkedIn
  • Be the first to comment.

Keep Reading

Posts you might find interesting based on what you just read.