Ship With Observability or Don't Ship

No feature leaves staging without the traces, metrics, and evals that will tell you whether it's working. Before your first customer hits it.

Falk GottlobUpdated 6 min readNew

The gap where PMs fall off

The gap between your prototype and your production is where most PMs fall off the Builder path.

You ship a beautiful prototype. Hand it to engineering. They wire it up. It goes live. A week later you ask "how's it doing" and get a shrug. Nobody set up instrumentation. Nobody knows the adoption curve. Nobody can tell you what's broken because nothing's being measured. By the time dashboards are stitched together, it's three weeks post-launch, half the customers who tried it have already bounced, and the learning window is closed.

I've watched this happen at every company I've worked at. It's preventable. It's also non-negotiable.

The rule: no feature leaves staging without the traces, metrics, and evals that will tell you whether it's working, before your first customer hits it.

Why "we'll add metrics later" keeps failing

Three predictable failures, in order of severity.

You can't tell if it's working. Launch. Some customers use it. Some don't. Some use it once and never again. Without instrumentation, no idea which is which. You guess. You guess wrong. You invest in the wrong things or kill the feature based on vibes. Feedback loop closes weeks later when sales escalates a complaint or finance flags a cost spike.

Regressions go undetected. Without an eval set running against production, the silent regression (the one that happens when a model updates, a prompt drifts, a downstream tool changes) finds your customer before it finds you.

You can't justify the next investment. Stakeholder asks "should we double down on this?" No data. Takes three weeks to instrument what you should have instrumented at launch. Moment has passed.

"Adding metrics later" isn't a slight delay. It's a multi-week feedback loop being absent during the most learnable period of the feature's life. Strategic loss disguised as tactical convenience.

The instrumentation contract

Before any code is written, PM and engineer write a one-page contract.

1. The success metric. What's the one number that tells us this feature works? Not engagement. Not click-through. The actual outcome. Specific. Quantifiable. Tied to the user job.

2. The leading indicators. What earlier signals tell us if the success metric will land? Adoption rate in week 1. Time-to-first-success. Percent of users who reach the "aha" moment. Leading indicators are what I watch in the first two weeks, before the lagging metric is reliable.

3. The cost meter. Cost per successful action for this feature. Real-time. Per user, per workspace, per surface. If this number goes wrong, the rest of the metrics don't matter. Feature is unprofitable and either gets repriced or killed.

4. The eval set. Named eval set for this feature, current score, threshold below which the feature shouldn't ship. Runs daily against production once live.

5. The trace points. What user actions, system calls, model calls, tool invocations get traced? Sample rate? Retention period? I don't need to instrument everything. I need to instrument the path that matters. I write down which path that is.

6. The dashboard URL. Where will all of this be visible, in one place, refreshed automatically? Exists before launch, even if the data is empty.

7. The kill condition. If metric X stays below threshold Y for time Z, feature is reviewed for deprecation. Written down before launch. Makes the future deprecation conversation 10x easier (see The Deprecation Playbook).

One page. Negotiated up front. Engineering builds against it. PM signs off. Nothing ships without all seven.

The Definition of Done upgrade

Most teams' Definition of Done includes "QA passed," "code reviewed," "docs updated." Add four lines:

  • Instrumentation contract written and approved.
  • Eval set in place, current score above threshold.
  • Dashboard live, populated with real data from a staging cohort.
  • Cost-per-action measurement validated end-to-end.

If any of these four is missing, the feature is not done. It is "code complete." Different state. Different decision.

This is the move that hardens the new practice into team culture. It's not a PM preference. It's a checklist item in the merge process.

The "good enough" rule on instrumentation

The trap on the other side is over-instrumentation. PMs new to this discipline ask for everything to be measured, everywhere, all the time. Result: a dashboard nobody can read, a data pipeline that costs more than the feature, and instrumentation work that delays launch by weeks.

The right amount is the minimum that lets you make the next decision. That's it. You can always add more later if a question emerges the current data can't answer. You can rarely walk back the cost of over-instrumentation.

A useful test: imagine looking at the dashboard a week post-launch. What's the first decision you'll need to make? Instrument exactly enough to support that decision. Cut everything else. You'll be surprised how spare that is.

What the PM actually owns here

The PM doesn't do the instrumentation work. The PM owns what gets instrumented and why. The engineer implements.

PM job is:

  • Writing the contract before work starts.
  • Pushing back when "we'll add it later" appears in the plan.
  • Owning the dashboard after launch.
  • Making the kill-or-double-down call when the data points clearly.

That last one is where it matters most. Most PMs have an instinct to keep features alive past their data-supported lifespan because killing them is awkward. The contract removes the awkwardness. The kill condition was agreed up front. You're not killing a baby. You're executing the plan.

Observability plus evals: the same loop

Observability and evals are the same loop from two angles. Evals tell you what the system is doing. Observability tells you what the users are experiencing. Both must run continuously. Both must be visible on the same page.

When the offline eval score is high but production usage is flat, you have a discovery problem (people aren't finding or trying the feature). When the eval score is low but usage is high, you have a quality problem (people are using it but it's failing them). When both are good, you have a winner. When both are bad, you have a learning.

Most teams in 2026 only watch one or the other. The builder PM watches both. The pair is much more informative than either alone.

Pick one thing this week

You're about to ship something. Write the contract.

  1. Open a page. Title it "[Feature Name] Instrumentation Contract."
  2. Write the seven items: success metric, leading indicators, cost meter, eval set, trace points, dashboard URL, kill condition.
  3. Share with your engineer and your data person. Ask them to push back on any item that's vague.
  4. Once signed, commit it to the feature's ticket in Jira or Linear. Make it part of the "definition of done" checklist.
  5. Don't ship the feature until all seven are in place.

Do this for one feature this week. Do it for every feature next month. Within a quarter, no feature ships on your team without observability, and you stop losing the learning window on every launch.

A feature without observability is a feature you've shipped on faith. Faith is not a strategy.

Share this post

Frequently asked

Why does 'we'll add metrics later' always fail?+

Three reasons. You can't tell if it's working (you guess wrong). Regressions go undetected (silent ones find customers first). You can't justify investment (data collection takes weeks). Multi-week feedback loop closes during the most learnable period of the feature's life.

What should the instrumentation contract cover?+

Success metric. Leading indicators. Cost meter. Eval set. Trace points. Dashboard URL. Kill condition. One page, written before any code, negotiated with engineering. Nothing ships without all seven.

What's the difference between success metric and leading indicators?+

Success metric is the one number proving the feature works (specific, tied to user job). Leading indicators are early signals predicting if success metric will land (adoption week 1, time-to-first-success, percent reaching aha moment).

How much instrumentation is enough?+

The minimum that lets you make the next decision. Imagine looking at the dashboard a week post-launch. What's the first decision you'll need to make? Instrument exactly enough to support it. Cut everything else. You'll be surprised how spare it is.

What if I don't have the kill condition written before launch?+

Then you'll keep a dead feature alive past its data-supported lifespan because killing it is awkward. Write the condition up front. It removes the awkwardness. You're not killing a baby. You're executing the plan.

Related reading

Deeper essays and other handbook chapters on the same thread.