Build a Measurement Agent Stack: End the Dashboard Hamster Wheel

SERIES · THE PM AGENT STACK · PART 4 OF 5

One connected system, not a set of independent tools. Read in order. The recipes here only land if the framing is in place.

1. Overview: the destination, the gap, the bridge
2. Discovery agent stack
3. Build agent stack
4. Measure agent stack ← you are here
5. The PM Agent Stack handbook chapter

The short version

This post is part 4 of 5 of the PM Agent Stack series. It is the concrete how-to for the Measure stage of the PM operating system. If you have not read the overview yet, start there. It sets up the destination (an enterprise-wide AI brain), the gap, and why this stack is what to build today.

Measurement is the stage where most PMs feel stuck on a dashboard hamster wheel. You open the same dashboards every morning, you scan for what changed, you don't quite trust your eye, and you definitely don't have time to do this for every product area. The measurement agent stack moves the work from pull to push: the agent watches the metrics on a schedule, flags variance worth your attention, and drafts the weekly stakeholder update from raw evidence so you can edit instead of compose.

Seven repos. A specific install order. Three concrete recipes: the morning brief, the post-launch impact synthesis, and the stakeholder update writer. Skim the install order, pick one of the three recipes, run it tomorrow.

The five PM problems this stack solves

The "I open the same dashboards every morning" problem. The agent opens them for you and pushes a one-page brief at 8 a.m.

The "I notice the metric drop on day three instead of day one" problem. The agent runs variance detection nightly and flags drops the morning they happen.

The "post-launch impact analysis is always rushed" problem. The agent synthesizes the launch's effect across cohorts, segments, and time windows in twenty minutes instead of three days.

The "weekly stakeholder update eats my Friday afternoon" problem. The agent drafts the update from the raw evidence. You edit, you add the narrative, you ship in 30 minutes instead of three hours.

The "I lost track of the experiment results from last quarter" problem. Memory tools make the metric history queryable. "What was the conversion rate before the redesign?" gets answered in a sentence instead of a dig.

If three of those five would meaningfully change your week, the stack is worth building.

The seven repos to install for Measure

In install order. Most PMs will not install all seven. The base three earn their keep first.

1. anthropics/claude-code. The base. npm install -g @anthropic-ai/claude-code.

2. crystaldba/postgres-mcp. Safe Postgres MCP server with read-only default mode. The single highest-leverage MCP for measurement work. Point it at a read-only role on your metrics replica (do not point it at production read-write credentials, ever).

3. A scheduling primitive. Either the cowork mode scheduled-tasks tool that ships with the falkster.ai setup, or a system cron that invokes Claude Code with a fixed prompt. The mechanism matters less than having one. Without scheduling, the agent is on-demand and you forget to call it.

4. microsoft/playwright-mcp. Browser automation for dashboards that don't have an API or aren't queryable directly. Officially Microsoft-backed. Use it sparingly; many BI tools hate scraping. If your dashboard has an API, prefer the API.

5. thedotmack/claude-mem. Long-term memory. Critical for measurement specifically because metric questions are usually historical: "what was X six weeks ago." Without persistent memory, the agent can pull the number but can't tell you what it was last quarter without re-querying. Memory also stores the variance flags from prior days so the agent knows what's new.

6. continuous-claude/continuous-claude-v2. Context-preserving ledger. Useful when your measurement work spans many sessions and the context degrades. Install when you start feeling the friction. Skip on day one.

7. One subagent collection (I use wshobson/agents). Install the data-analyst subagent and the variance-detector subagent specifically. The variance detector flags week-over-week deltas that a human would have to eyeball.

Total install time: a focused half-day. The base three (Claude Code, postgres-mcp, scheduled hook) earn their keep within a week.

Recipe one: the morning brief

The recipe that buys the most calendar back and is the easiest to set up.

Setup. Identify the five to ten metrics you'd open a dashboard to check every morning. Activations, retention, revenue, error rates, NPS open-text counts, whatever your role demands. Write a SQL query for each metric (or a Playwright snippet that reads it from a dashboard). Save them in a metrics/ directory.

The hook. Schedule a daily 7 a.m. job with this prompt: "Run each query in metrics/. Compare the result to the same metric one day, seven days, and 30 days ago (read from metrics/history.json). Flag any metric that has moved more than {threshold}. Update history.json. Write a one-page Slack message: today's numbers, week-over-week deltas, and a brief explanation of any flagged variance based on the living changelog." Send to a private Slack channel or your inbox.

What you get. By 8 a.m., a one-page brief is waiting. Most days, the brief says "everything within range." Twice a month, the brief flags something you would have noticed Wednesday instead of Monday. The cost of running this is approximately zero. The cost of not running it is a category of metric drifts you learn about late.

A note on threshold tuning. Start with three percent week-over-week as the variance flag. You'll get false positives the first week. Tune up to five or seven percent. Settle into the threshold that produces one or two flags per week. More than three flags per week means the threshold is too tight. Zero flags per month means the threshold is too loose.

Recipe two: post-launch impact synthesis

The recipe that turns a three-day analysis into a 20-minute one.

Setup. After a launch, define the success cohort (users who hit the new feature) and the control cohort (matched users who didn't). Have postgres-mcp pointing at the metrics replica. Have the launch date and the eval-as-spec output (the test cases that defined what success looks like) ready.

The pass. Open Claude Code with one prompt: "I launched feature X on {date}. The success cohort is users defined by {SQL filter}. The control cohort is matched users with {filter}. Run the standard impact analysis: per-cohort means on activation, retention day-1, retention day-7, conversion to paid, and {three feature-specific metrics}. Compare against the eval-as-spec test cases at {path}. Report: did we hit the bar? What was the magnitude? What's the confidence interval? What surprised us?"

The agent runs the queries, computes the deltas, and produces a four-page synthesis. You edit, you add the narrative ("the redesign drove a 12% lift in trial-to-paid, but only in the SMB segment, which is consistent with the assumption we tested in March"), and you ship it as the post-launch readout.

The thing that makes this recipe work is the eval-as-spec output. Without it, "did we hit the bar" is a vague question. With it, the question has specific test cases the agent can answer. This is one of the highest-leverage applications of the eval-is-the-spec practice.

Recipe three: the stakeholder update writer

The recipe that ends the Friday afternoon ritual of crafting the weekly note.

Setup. Define the structure of your weekly update: the standing sections (numbers, what shipped, what's next, what's at risk). Save the structure as a template in templates/weekly-update.md. Make sure the living changelog is up to date because it feeds the "what shipped" section.

The job. Friday at 3 p.m., schedule a job with this prompt: "Read templates/weekly-update.md. Pull this week's numbers using the metrics/ queries (vs. last week and four-week average). Pull this week's shipped items from the living changelog. Pull at-risk items from the anti-backlog. Draft a stakeholder update following the template. Use the practitioner voice: short sentences, no jargon, leading with the numbers, avoiding hedging."

You get a 70%-finished draft at 3:15. You spend 30 minutes adding the narrative voice (the part the agent can't write because it requires judgment about emphasis), and the update goes out at 4 p.m. instead of 6 p.m. on Friday. The agent does the structural work; you do the editorial work. That's the right division of labor.

A note on voice. The agent will lapse into corporate fluff if you don't push back. Read the voice guide (or your equivalent) and put a constraint in the prompt: "no hedging, no 'we believe', no 'fundamentally', no 'leverage' as a verb." Tune the constraints until the draft sounds like you.

What stays human

The agent doesn't decide which metrics matter. The agent doesn't decide whether a 12% lift is the lift you wanted or a fluke. The agent doesn't write the narrative that says "this matters because next quarter we're betting on..." The agent definitely doesn't decide whether to keep, kill, or extend a feature based on the impact analysis.

The agent flags. The PM judges. That division is the entire game in measurement work. Get it wrong (let the agent decide) and you lose the practice that makes a good PM. Get it right (let the agent surface, you decide) and you compound your judgment instead of replacing it.

Read the impact loop for the longer argument about how outcome thinking has to stay human-led.

What to do this week

Block a half day. Pick recipe one. Wire up the morning brief for five metrics. Schedule it. Watch what arrives Monday morning.

If the brief produces something useful in the first week, schedule a second half-day to set up recipe two before your next launch. If recipe one feels noisy or underwhelming, tune the metrics and the threshold before adding more recipes. The stack rewards depth over breadth.

The dashboard hamster wheel is the one to break first. Most PMs don't realize how much of their week it eats until they get the morning brief and feel the contrast.

Final part · Part 5 of 5 (canonical)

The PM Agent Stack: handbook chapter

The canonical reference. Eighteen categories of repos mapped to the seven-stage PM operating system, with the install order and the critical limitations spelled out.

Sources: anthropics/claude-code, crystaldba/postgres-mcp, microsoft/playwright-mcp, thedotmack/claude-mem, continuous-claude/continuous-claude-v2, wshobson/agents. The full taxonomy and the credit to Divyanshi Sharma's Instagram carousel of the Claude ecosystem are in the handbook chapter.

Build a Measurement Agent Stack: End the Dashboard Hamster Wheel

The short version

The five PM problems this stack solves

The seven repos to install for Measure

Recipe one: the morning brief

Recipe two: post-launch impact synthesis

Recipe three: the stakeholder update writer

What stays human

What to do this week

The PM Agent Stack: handbook chapter

Download the artifact

Also on Medium

AI Agents and the Future of Work: A Pixar-Inspired Journey

How to Avoid Survivorship Bias in Product Management

Frequently asked

Keep Reading

Build a Discovery Agent Stack: Continuous Customer Listening

The PM Agent Stack: A Bridge to the Enterprise AI Brain

Build a Prototype Agent Stack: PRD to Working Demo in a Day

Meet the Agent Operator Team at Falkster.ai

Audits, workshops, advisory.

Follow on LinkedIn.

Browse the toolkit.