ExecutionNew·Falk Gottlob··10 min read

Build a Measurement Agent Stack: End the Dashboard Hamster Wheel

Measurement agent stack: seven open-source Claude repos turn dashboards into a daily outcome digest, a variance detector, and a stakeholder update writer.

outcome measurementpost-launchmetricsvariance detectionstakeholder updatesAI agentsClaude CodeMCPpostgresplaywrightPM agent stacktooling
Helpful?

4/5
SeriesThe PM Agent Stack

SERIES · THE PM AGENT STACK · PART 4 OF 5

One connected system, not a set of independent tools. Read in order. The recipes here only land if the framing is in place.

  1. 1. Overview: the destination, the gap, the bridge
  2. 2. Discovery agent stack
  3. 3. Build agent stack
  4. 4. Measure agent stack ← you are here
  5. 5. The PM Agent Stack handbook chapter

The short version

This post is part 4 of 5 of the PM Agent Stack series. It is the concrete how-to for the Measure stage of the PM operating system. If you have not read the overview yet, start there. It sets up the destination (an enterprise-wide AI brain), the gap, and why this stack is what to build today.

Measurement is the stage where most PMs feel stuck on a dashboard hamster wheel. You open the same dashboards every morning, you scan for what changed, you don't quite trust your eye, and you definitely don't have time to do this for every product area. The measurement agent stack moves the work from pull to push: the agent watches the metrics on a schedule, flags variance worth your attention, and drafts the weekly stakeholder update from raw evidence so you can edit instead of compose.

Seven repos. A specific install order. Three concrete recipes: the morning brief, the post-launch impact synthesis, and the stakeholder update writer. Skim the install order, pick one of the three recipes, run it tomorrow.

The five PM problems this stack solves

The "I open the same dashboards every morning" problem. The agent opens them for you and pushes a one-page brief at 8 a.m.

The "I notice the metric drop on day three instead of day one" problem. The agent runs variance detection nightly and flags drops the morning they happen.

The "post-launch impact analysis is always rushed" problem. The agent synthesizes the launch's effect across cohorts, segments, and time windows in twenty minutes instead of three days.

The "weekly stakeholder update eats my Friday afternoon" problem. The agent drafts the update from the raw evidence. You edit, you add the narrative, you ship in 30 minutes instead of three hours.

The "I lost track of the experiment results from last quarter" problem. Memory tools make the metric history queryable. "What was the conversion rate before the redesign?" gets answered in a sentence instead of a dig.

If three of those five would meaningfully change your week, the stack is worth building.

The seven repos to install for Measure

In install order. Most PMs will not install all seven. The base three earn their keep first.

1. anthropics/claude-code. The base. npm install -g @anthropic-ai/claude-code.

2. crystaldba/postgres-mcp. Safe Postgres MCP server with read-only default mode. The single highest-leverage MCP for measurement work. Point it at a read-only role on your metrics replica (do not point it at production read-write credentials, ever).

3. A scheduling primitive. Either the cowork mode scheduled-tasks tool that ships with the falkster.ai setup, or a system cron that invokes Claude Code with a fixed prompt. The mechanism matters less than having one. Without scheduling, the agent is on-demand and you forget to call it.

4. microsoft/playwright-mcp. Browser automation for dashboards that don't have an API or aren't queryable directly. Officially Microsoft-backed. Use it sparingly; many BI tools hate scraping. If your dashboard has an API, prefer the API.

5. thedotmack/claude-mem. Long-term memory. Critical for measurement specifically because metric questions are usually historical: "what was X six weeks ago." Without persistent memory, the agent can pull the number but can't tell you what it was last quarter without re-querying. Memory also stores the variance flags from prior days so the agent knows what's new.

6. continuous-claude/continuous-claude-v2. Context-preserving ledger. Useful when your measurement work spans many sessions and the context degrades. Install when you start feeling the friction. Skip on day one.

7. One subagent collection (I use wshobson/agents). Install the data-analyst subagent and the variance-detector subagent specifically. The variance detector flags week-over-week deltas that a human would have to eyeball.

Total install time: a focused half-day. The base three (Claude Code, postgres-mcp, scheduled hook) earn their keep within a week.

Recipe one: the morning brief

The recipe that buys the most calendar back and is the easiest to set up.

Setup. Identify the five to ten metrics you'd open a dashboard to check every morning. Activations, retention, revenue, error rates, NPS open-text counts, whatever your role demands. Write a SQL query for each metric (or a Playwright snippet that reads it from a dashboard). Save them in a metrics/ directory.

The hook. Schedule a daily 7 a.m. job with this prompt: "Run each query in metrics/. Compare the result to the same metric one day, seven days, and 30 days ago (read from metrics/history.json). Flag any metric that has moved more than {threshold}. Update history.json. Write a one-page Slack message: today's numbers, week-over-week deltas, and a brief explanation of any flagged variance based on the living changelog." Send to a private Slack channel or your inbox.

What you get. By 8 a.m., a one-page brief is waiting. Most days, the brief says "everything within range." Twice a month, the brief flags something you would have noticed Wednesday instead of Monday. The cost of running this is approximately zero. The cost of not running it is a category of metric drifts you learn about late.

A note on threshold tuning. Start with three percent week-over-week as the variance flag. You'll get false positives the first week. Tune up to five or seven percent. Settle into the threshold that produces one or two flags per week. More than three flags per week means the threshold is too tight. Zero flags per month means the threshold is too loose.

Recipe two: post-launch impact synthesis

The recipe that turns a three-day analysis into a 20-minute one.

Setup. After a launch, define the success cohort (users who hit the new feature) and the control cohort (matched users who didn't). Have postgres-mcp pointing at the metrics replica. Have the launch date and the eval-as-spec output (the test cases that defined what success looks like) ready.

The pass. Open Claude Code with one prompt: "I launched feature X on {date}. The success cohort is users defined by {SQL filter}. The control cohort is matched users with {filter}. Run the standard impact analysis: per-cohort means on activation, retention day-1, retention day-7, conversion to paid, and {three feature-specific metrics}. Compare against the eval-as-spec test cases at {path}. Report: did we hit the bar? What was the magnitude? What's the confidence interval? What surprised us?"

The agent runs the queries, computes the deltas, and produces a four-page synthesis. You edit, you add the narrative ("the redesign drove a 12% lift in trial-to-paid, but only in the SMB segment, which is consistent with the assumption we tested in March"), and you ship it as the post-launch readout.

The thing that makes this recipe work is the eval-as-spec output. Without it, "did we hit the bar" is a vague question. With it, the question has specific test cases the agent can answer. This is one of the highest-leverage applications of the eval-is-the-spec practice.

Recipe three: the stakeholder update writer

The recipe that ends the Friday afternoon ritual of crafting the weekly note.

Setup. Define the structure of your weekly update: the standing sections (numbers, what shipped, what's next, what's at risk). Save the structure as a template in templates/weekly-update.md. Make sure the living changelog is up to date because it feeds the "what shipped" section.

The job. Friday at 3 p.m., schedule a job with this prompt: "Read templates/weekly-update.md. Pull this week's numbers using the metrics/ queries (vs. last week and four-week average). Pull this week's shipped items from the living changelog. Pull at-risk items from the anti-backlog. Draft a stakeholder update following the template. Use the practitioner voice: short sentences, no jargon, leading with the numbers, avoiding hedging."

You get a 70%-finished draft at 3:15. You spend 30 minutes adding the narrative voice (the part the agent can't write because it requires judgment about emphasis), and the update goes out at 4 p.m. instead of 6 p.m. on Friday. The agent does the structural work; you do the editorial work. That's the right division of labor.

A note on voice. The agent will lapse into corporate fluff if you don't push back. Read the voice guide (or your equivalent) and put a constraint in the prompt: "no hedging, no 'we believe', no 'fundamentally', no 'leverage' as a verb." Tune the constraints until the draft sounds like you.

What stays human

The agent doesn't decide which metrics matter. The agent doesn't decide whether a 12% lift is the lift you wanted or a fluke. The agent doesn't write the narrative that says "this matters because next quarter we're betting on..." The agent definitely doesn't decide whether to keep, kill, or extend a feature based on the impact analysis.

The agent flags. The PM judges. That division is the entire game in measurement work. Get it wrong (let the agent decide) and you lose the practice that makes a good PM. Get it right (let the agent surface, you decide) and you compound your judgment instead of replacing it.

Read the impact loop for the longer argument about how outcome thinking has to stay human-led.

What to do this week

Block a half day. Pick recipe one. Wire up the morning brief for five metrics. Schedule it. Watch what arrives Monday morning.

If the brief produces something useful in the first week, schedule a second half-day to set up recipe two before your next launch. If recipe one feels noisy or underwhelming, tune the metrics and the threshold before adding more recipes. The stack rewards depth over breadth.

The dashboard hamster wheel is the one to break first. Most PMs don't realize how much of their week it eats until they get the morning brief and feel the contrast.

Final part · Part 5 of 5 (canonical)

The PM Agent Stack: handbook chapter

The canonical reference. Eighteen categories of repos mapped to the seven-stage PM operating system, with the install order and the critical limitations spelled out.

Sources: anthropics/claude-code, crystaldba/postgres-mcp, microsoft/playwright-mcp, thedotmack/claude-mem, continuous-claude/continuous-claude-v2, wshobson/agents. The full taxonomy and the credit to Divyanshi Sharma's Instagram carousel of the Claude ecosystem are in the handbook chapter.

Share this post

Download the artifact

Ready to use. Copy into your project or share with your team.

Download

Also on Medium

Full archive →

Frequently asked

What does a measurement agent stack actually do?+

Three things. It pulls metrics from the systems that hold them (databases, dashboards, analytics tools) on a schedule. It compares the new numbers against historical context and flags variance worth your attention. It turns the raw numbers into a stakeholder-ready update so the PM stops handcrafting the same weekly note. The compound effect is that you learn about metric drift before the data team or the exec asks.

Why not just use the dashboards I already have?+

Because dashboards are pull-based and the human attention span is the bottleneck. The agent reads the dashboards on a schedule, compares them against historical patterns, and pushes only the variances that matter. You go from 'I should check the dashboard' (which you don't, consistently) to 'the agent flagged something' (which you act on). That's a different relationship to data.

How does this work without putting customer data at risk?+

The MCPs you connect determine the blast radius. postgres-mcp defaults to read-only. Playwright-mcp acts on dashboards you authenticate to, exactly as you would. Customer data stays in the systems it lives in. The agent reads, summarizes, and reports. It does not export raw rows to public destinations unless you write a tool that does that, which you should not.

Do I need a data team's permission to set this up?+

You need the same permission you'd need to query the data yourself. If you have read access to a metrics database, you can wire postgres-mcp to it. If you have access to your analytics dashboards, Playwright can read them. The agent doesn't escalate your privileges. Loop in the data team early, both for collaboration and so they're not surprised by the queries that show up in the audit log.

How long does this stack take to set up?+

A focused half-day for the base. Most of the time goes into wiring the data sources (database connection, dashboard auth, scheduled hooks). The agent layer on top is fast. Plan a half-day, schedule it, do the work end-to-end. Half-installs sit unfinished and don't earn their keep.

What is the smallest version that's still useful?+

Three installs. Claude Code, postgres-mcp pointed at a read-only metrics replica, and a scheduled hook that runs a daily digest query. That alone replaces the morning ritual of opening four dashboards. Add Playwright and the variance subagent only when you've felt the limits of just-postgres.

Keep Reading

Posts you might find interesting based on what you just read.