Incident Response Is a PM Ritual
An incident is a customer telling you the truth about your product, loudly, all at once. Stop letting engineering listen alone.
The cheapest discovery your company already does
Most PMs treat incidents as engineering's problem. They show up only when sales escalates, only to perform concern. They don't read the post-mortem. They don't extract a product lesson. They go back to the roadmap meeting and the cycle repeats.
This is the most expensive habit I see in PM practice. Every incident is a bundle of compressed truth: a latent assumption that was wrong, a missing guardrail, a mis-scoped user segment, a feature that worked in the demo but not in the wild, an integration nobody mapped. The infra fix is the engineer's job. The product lesson is the PM's job.
Almost nobody does the second job. That's where this chapter starts.
What changes when the PM owns the post-mortem
Engineering owns the engineering post-mortem. PM owns the product post-mortem. They're different documents. They cover different ground. They both happen, in parallel, after every Sev-2 or worse.
The PM-owned post-mortem is a one-page doc, written within 48 hours of incident close. Six sections.
1. What broke from the user's perspective. Not "service X returned 503s." User-facing description: "users in workflow Y could not complete step Z; they saw [specific behavior]; they tried [specific workarounds]."
2. The product assumption that was wrong. Every incident exposes one. "We assumed users would only upload documents under 10MB." "We assumed the model would handle ambiguous tool calls gracefully." "We assumed cost per action wouldn't spike when usage doubled." Name the assumption.
3. The eval gap. What test in the eval set should have caught this and didn't? Either the eval set was missing this case, or the rubric didn't penalize this behavior, or the eval was correct and production diverged from it. Name which.
4. What customers told us during the incident. Direct quotes. Patterns from the support flood. The 5 things customers said that we wouldn't have heard otherwise. This is the most valuable section. Most teams don't write it because they're too busy fighting the fire to listen.
5. The product-level fix. Separate from the infra fix. Often involves changing default behavior, adding a fallback experience, reshaping a workflow, tightening a guardrail, removing a foot-gun feature, or adjusting pricing for a high-cost flow that's now exposed.
6. The updated eval set. The eval set is updated to include the case that was missed. This change is committed alongside the post-mortem. Next time this pattern emerges, it's caught in CI before reaching production.
That's the document. Read aloud at the team's weekly review. Action items have owners and dates. Six weeks later, the team verifies the fixes landed.
Customer comms during the live incident
The PM owns the customer comms during a live incident. Not marketing alone. Not support alone. The PM, because the customer comms shape the experience of the incident more than the technical resolution does.
Three rules.
Acknowledge fast, even if you don't know the cause. Silence reads as ignorance or indifference. A short "we're seeing X, we're investigating, we'll update in 30 minutes" is the move.
Estimate honestly. Bad estimates erode trust faster than the incident does. "We don't know yet" is better than "should be fixed in 15 minutes" repeated for two hours.
Tell them what you learned, after. The post-incident note explaining what happened, what you're changing, what you'll do differently next time, is the single most trust-building artifact a product team produces. Most teams skip it. Send it.
This is also the moment to capture which customers were materially affected and follow up individually. The customer who reports a bug during an incident and gets a personal note a week later, explaining what was fixed and why, is the customer who upgrades next quarter.
Incident frequency as product signal
After three months of running PM-owned post-mortems, you'll have a pattern visible in the incident log. Patterns to watch:
Same surface, repeated. Same feature, same workflow, keeps producing incidents. Infra fixes have layered up. Underlying product design is wrong. Stop fixing the symptom. Redesign the surface.
Same root assumption, different surfaces. "We assumed users would tolerate latency over 5 seconds." This was wrong in feature A in March, in feature B in May, in feature C in July. The assumption is wrong globally, not locally. Change the global default.
Cost-driven incidents. A growing share of incidents are about cost spikes: runaway usage on a flow nobody priced for, infinite loops in agent calls, retry storms after a minor outage. These are not bugs. They're unit-economics design flaws. Address them as pricing or architecture issues.
Customer-facing escalations dominating. If a growing share of incidents reach customer-visible status before alerts catch them, your observability is weaker than your customers' patience. Invest in detection, not in PR.
These patterns are leading indicators of product health. NPS lags by quarters. Incident patterns lead by weeks. The PM who watches the incident log carefully knows what's about to break in a quarter the way a sailor knows what the weather will be in six hours.
Incidents I've actually learned the most from
The Sev-1 I learned the most from at Smartcat involved a translation flow that hit a model timeout under specific document structures, returned an empty result, and the workflow accepted the empty result as valid output. Engineering's fix was a retry plus a non-empty validator.
My product post-mortem found three things engineering's didn't:
- We had assumed workflow steps would either succeed or fail loudly. They could also succeed silently with empty output. This assumption was wrong across at least four other surfaces.
- Customers had been working around the silent-empty failure for weeks before the incident, never reporting it because they didn't realize it was a bug. Support had 23 tickets about "translations that look empty," all marked "user error."
- Our eval set tested for translation correctness but not for non-empty output. Trivial gap. Easy fix.
Three product changes from one incident. Engineering's post-mortem was about the timeout. Mine was about the system that hid the problem from us. Different documents. Both needed.
Pick one thing this week
You probably have an incident in the last 30 days where you didn't write a product post-mortem. Write it now.
- Pick the most recent Sev-2-or-worse incident.
- Open a doc. Write the six sections (user-facing description, wrong assumption, eval gap, what customers said, product fix, updated eval).
- Be honest. The whole point is the assumption that was wrong, not the heroic engineering response.
- Share with your team. Schedule a 20-minute walkthrough at the next weekly review.
- Add at least one new test case to the eval set based on what you found.
Do this for every incident going forward. Within a quarter you have a library of compressed truths about your product that you can mine for the next several roadmap decisions. An incident is a customer telling you the truth about your product, loudly, all at once. Don't let engineering listen alone.
Frequently asked
Why should the PM own a post-mortem if engineering already did one?+
Two different documents. Engineering's covers the infra fix. Yours covers the product lesson: the latent assumption that was wrong, the eval gap, what customers said, the product fix, updated eval set. Both happen. Most teams only do the engineering one.
What are the six sections of a PM post-mortem?+
User-facing description of what broke. The product assumption that was wrong. The eval gap (what test should have caught this). What customers said (direct quotes, patterns). Product-level fix (not the infra fix). Updated eval set.
When should I write this?+
Within 48 hours of incident close. One page. Not a deep investigation. The whole point is naming the wrong assumption, not heroic recovery narrative.
What's the most valuable section?+
What customers told you during the incident. Direct quotes. The five things they said that you wouldn't have heard otherwise. Most teams skip this because they're too busy fighting the fire to listen.
How do I use incident patterns to forecast product problems?+
Track three patterns: same surface, repeated (infra fixes have layered, underlying design is wrong). Same assumption, different surfaces (assumption is globally wrong, not locally). Cost-driven incidents (unit-economics design flaws, not bugs). These lead by weeks what NPS lags by quarters.
Related reading
Deeper essays and other handbook chapters on the same thread.
Prototype Before You Spec
Why the fastest way to get alignment, test ideas, and advance your career is to build something people can touch - and exactly how to do it in 2 hours.
The Impact Loop
The daily rhythm that replaces sprints, stand-ups, and roadmap reviews. Sense what's happening, build a response, measure the impact, amplify what works.
The Eval Is The Spec
Kill the PRD. Ship against a test set. The eval is the contract, the changelog, and the definition of done.
Ship With Observability or Don't Ship
No feature leaves staging without the traces, metrics, and evals that will tell you whether it's working. Before your first customer hits it.
The Deprecation Playbook
Feature death is the most under-written topic in PM. Kill on signal, not politics, and your team ships faster than the team that hopes politely.
Build a Prototype Agent Stack: PRD to Working Demo in a Day
Build a prototype agent stack: eight open-source Claude repos take a PM from idea to working prototype in a day, with TDD, design, and security review.