Incident Response Is a PM Ritual

The cheapest discovery your company already does

Most PMs treat incidents as engineering's problem. They show up only when sales escalates, only to perform concern. They don't read the post-mortem. They don't extract a product lesson. They go back to the roadmap meeting and the cycle repeats.

This is the most expensive habit I see in PM practice. Every incident is a bundle of compressed truth: a latent assumption that was wrong, a missing guardrail, a mis-scoped user segment, a feature that worked in the demo but not in the wild, an integration nobody mapped. The infra fix is the engineer's job. The product lesson is the PM's job.

Almost nobody does the second job. That's where this chapter starts.

What changes when the PM owns the post-mortem

Engineering owns the engineering post-mortem. PM owns the product post-mortem. They're different documents. They cover different ground. They both happen, in parallel, after every Sev-2 or worse.

The PM-owned post-mortem is a one-page doc, written within 48 hours of incident close. Six sections.

1. What broke from the user's perspective. Not "service X returned 503s." User-facing description: "users in workflow Y could not complete step Z; they saw [specific behavior]; they tried [specific workarounds]."

2. The product assumption that was wrong. Every incident exposes one. "We assumed users would only upload documents under 10MB." "We assumed the model would handle ambiguous tool calls gracefully." "We assumed cost per action wouldn't spike when usage doubled." Name the assumption.

3. The eval gap. What test in the eval set should have caught this and didn't? Either the eval set was missing this case, or the rubric didn't penalize this behavior, or the eval was correct and production diverged from it. Name which.

4. What customers told us during the incident. Direct quotes. Patterns from the support flood. The 5 things customers said that we wouldn't have heard otherwise. This is the most valuable section. Most teams don't write it because they're too busy fighting the fire to listen.

5. The product-level fix. Separate from the infra fix. Often involves changing default behavior, adding a fallback experience, reshaping a workflow, tightening a guardrail, removing a foot-gun feature, or adjusting pricing for a high-cost flow that's now exposed.

6. The updated eval set. The eval set is updated to include the case that was missed. This change is committed alongside the post-mortem. Next time this pattern emerges, it's caught in CI before reaching production.

That's the document. Read aloud at the team's weekly review. Action items have owners and dates. Six weeks later, the team verifies the fixes landed.

Customer comms during the live incident

The PM owns the customer comms during a live incident. Not marketing alone. Not support alone. The PM, because the customer comms shape the experience of the incident more than the technical resolution does.

Three rules.

Acknowledge fast, even if you don't know the cause. Silence reads as ignorance or indifference. A short "we're seeing X, we're investigating, we'll update in 30 minutes" is the move.

Estimate honestly. Bad estimates erode trust faster than the incident does. "We don't know yet" is better than "should be fixed in 15 minutes" repeated for two hours.

Tell them what you learned, after. The post-incident note explaining what happened, what you're changing, what you'll do differently next time, is the single most trust-building artifact a product team produces. Most teams skip it. Send it.

This is also the moment to capture which customers were materially affected and follow up individually. The customer who reports a bug during an incident and gets a personal note a week later, explaining what was fixed and why, is the customer who upgrades next quarter.

Incident frequency as product signal

After three months of running PM-owned post-mortems, you'll have a pattern visible in the incident log. Patterns to watch:

Same surface, repeated. Same feature, same workflow, keeps producing incidents. Infra fixes have layered up. Underlying product design is wrong. Stop fixing the symptom. Redesign the surface.

Same root assumption, different surfaces. "We assumed users would tolerate latency over 5 seconds." This was wrong in feature A in March, in feature B in May, in feature C in July. The assumption is wrong globally, not locally. Change the global default.

Cost-driven incidents. A growing share of incidents are about cost spikes: runaway usage on a flow nobody priced for, infinite loops in agent calls, retry storms after a minor outage. These are not bugs. They're unit-economics design flaws. Address them as pricing or architecture issues.

Customer-facing escalations dominating. If a growing share of incidents reach customer-visible status before alerts catch them, your observability is weaker than your customers' patience. Invest in detection, not in PR.

These patterns are leading indicators of product health. NPS lags by quarters. Incident patterns lead by weeks. The PM who watches the incident log carefully knows what's about to break in a quarter the way a sailor knows what the weather will be in six hours.

Incidents I've actually learned the most from

The Sev-1 I learned the most from at Smartcat involved a translation flow that hit a model timeout under specific document structures, returned an empty result, and the workflow accepted the empty result as valid output. Engineering's fix was a retry plus a non-empty validator.

My product post-mortem found three things engineering's didn't:

We had assumed workflow steps would either succeed or fail loudly. They could also succeed silently with empty output. This assumption was wrong across at least four other surfaces.
Customers had been working around the silent-empty failure for weeks before the incident, never reporting it because they didn't realize it was a bug. Support had 23 tickets about "translations that look empty," all marked "user error."
Our eval set tested for translation correctness but not for non-empty output. Trivial gap. Easy fix.

Three product changes from one incident. Engineering's post-mortem was about the timeout. Mine was about the system that hid the problem from us. Different documents. Both needed.

Pick one thing this week

You probably have an incident in the last 30 days where you didn't write a product post-mortem. Write it now.

Pick the most recent Sev-2-or-worse incident.
Open a doc. Write the six sections (user-facing description, wrong assumption, eval gap, what customers said, product fix, updated eval).
Be honest. The whole point is the assumption that was wrong, not the heroic engineering response.
Share with your team. Schedule a 20-minute walkthrough at the next weekly review.
Add at least one new test case to the eval set based on what you found.

Do this for every incident going forward. Within a quarter you have a library of compressed truths about your product that you can mine for the next several roadmap decisions. An incident is a customer telling you the truth about your product, loudly, all at once. Don't let engineering listen alone.

Incident Response Is a PM Ritual

The cheapest discovery your company already does

What changes when the PM owns the post-mortem

Customer comms during the live incident

Incident frequency as product signal

Incidents I've actually learned the most from

Pick one thing this week

Frequently asked

Related reading

Prototype Before You Spec

The Impact Loop

The Eval Is The Spec

Ship With Observability or Don't Ship

The Deprecation Playbook

Build a Prototype Agent Stack: PRD to Working Demo in a Day

Audit, workshop, or advisory.

Follow on LinkedIn.

Browse the toolkit.