The Living Changelog
Your model vendor changed the model on Tuesday and didn't tell you. Run a daily replay against production or your customers will catch it before you do.
Your product changed last Tuesday
A model vendor deprecated a snapshot. Or rerouted a checkpoint behind the scenes. Or pushed a safety-training update that suddenly refuses requests it used to handle. Or fixed a tokenization bug that changed the output format. You didn't get an email. You didn't get a release note. Your customers found out before you did, in the worst possible way: by complaining about a thing that used to work.
This is the new operational reality of building with AI. You shipped a product you don't fully control. Pretending otherwise is the fastest way to lose your customers' trust.
What to do about it
Run a continuous eval replay against production. The same eval set, every day, against the live system. Any delta greater than noise is treated as an incident. Even if no customer has complained yet. You assume the model providers will change things without telling you, because they will, and you build the operational layer that catches it before your customers do.
Why "test once, ship, forget" doesn't work anymore
Most teams test once. Run an eval when the prompt or model changes, get a score, ship it, and never run that eval again. Meanwhile the model behind the API can shift behavior weekly without any version bump on your side.
Vendors don't always tell you. Even when major model versions deprecate, the comms are imperfect. Snapshots get rerouted. Safety training gets updated mid-version. Quantization changes get rolled out silently. Read your model vendor's changelog and assume that 80 percent of what they actually change is not on it.
Your customers will tell you. Eventually. After it's been broken for two weeks. Via a sales escalation. After they've already mentioned it to a competitor.
And without continuous replay, when a customer says "this got worse," you can't reproduce the regression. You can't show what changed. You can't date the regression. You don't have operational ground truth.
The replay system, three components
It's simpler than it sounds.
1. The replay set. Take the production eval set you use as the spec for each surface (see The Eval Is The Spec). This is your ground truth. Each surface has one. The set is small enough to run cheaply (50 to 300 examples), large enough to detect drift across slices.
2. The daily run. Every day, automatically, the replay set runs against every model your product uses, with the prompts currently live in production. Scores are stored with timestamps. Per-example outputs are stored with timestamps. The store is small (a few MB per surface per day) and it's the only way you'll ever be able to answer "when did this start drifting?"
3. The drift alarm. Define a noise floor for each surface. You'll learn it after a week of data. Any per-day change beyond that floor triggers an alert. The alert goes to the PM and the on-call engineer. It includes: which examples changed, what the previous outputs were, what the new outputs are, and which model and prompt are currently in production.
That's the whole system. A few days to build. It pays for itself the first time it catches a silent vendor change before a customer does.
The runbook when the alarm fires
- Confirm the regression. Look at the failing examples. Are they actually worse, or did the eval rubric drift? (Yes, the LLM-as-judge can also drift. You may need a second model to judge the judge.)
- Check the vendor. Has the model been updated? Is there a published changelog item? Is there community chatter about the same regression?
- Pin the model version. If the vendor offers pinned snapshots, switch to one immediately. Buy yourself time.
- Decide: prompt fix, model swap, or vendor escalation. A prompt change is fastest. A model swap is most stable. A vendor escalation is the right move if you're a major customer and this is a violation of their stability commitments.
- Communicate. If customer impact is non-zero, send a proactive note. "We detected a regression in [feature]; here's what we're doing." Doing this before customers complain is the trust differentiator. Doing it after is damage control.
- Update the eval set. The fact that this regression made it into production means your eval set was missing something. Add the failing case. Strengthen the rubric. The eval set gets better every time the alarm fires.
The vendor contract conversation
If you're spending real money with a model provider, you have leverage to ask for:
- Pinned model snapshots with stability commitments and notice periods before deprecation.
- Pre-release access to upcoming model versions so you can run your eval set before they go GA.
- A direct support channel for behavior regressions.
- A documented escalation path for production incidents tied to model behavior.
Most vendors will offer these to enterprise customers. Most PMs don't ask. Ask. Even if you're not yet enterprise-tier, asking signals that you operate the way they want their large customers to operate.
Your customer-facing version
Somewhere on your status page or trust center, your customers should be able to see:
- Which models you use, by surface.
- When you last validated each surface.
- How you handle regressions.
- The notice period you give when you change models.
This is the new "trust center" content for the AI era. Most teams haven't published it yet. Be early. Customers buying AI products in 2026 increasingly check this before signing, especially in regulated industries. The absence is a sales blocker more than the presence is a sales asset.
The mental model shift
You used to ship code, and the code did what it did until you changed it.
You now ship a system in which a third party can change part of it, at any time, without your permission. Your product is partially controlled by your model providers. The Living Changelog is how you take that control back at the operational layer. You can't stop the vendor from changing the model. You can detect the change before it hurts your customers.
This is uncomfortable. It's also the job.
Pick one thing this week
Pick the most important AI surface in your product. Set up its replay.
- Take the eval set you already have for it. (If you don't have one, see The Eval Is The Spec. That's your week.)
- Schedule it to run daily against production, automatically. A small Python script and a cron job is fine.
- Store the score in a sheet, with a timestamp.
- After a week, look at the variance. That's your noise floor. Set an alert at 2x the noise floor.
- The first time the alert fires, run the runbook above. Notice that you knew about the issue before any customer told you.
Do this for one surface this week. Add another every week. Within a quarter your top 10 surfaces all have continuous replay, and you become the team that catches model drift before any of your competitors.
You shipped a product with a third party in the loop. Either you watch the loop daily, or your customers will tell you when it broke.
Frequently asked
What's the Living Changelog?+
A continuous eval replay against production. Same eval set, every day, against live system. Any delta greater than noise triggers an incident, even if no customer complained. You assume model vendors will change things without telling you because they will.
Why is 'test once, ship, forget' broken?+
Model behavior shifts weekly without version bumps. Snapshots get rerouted. Safety training updates silently. Quantization changes roll out silently. You test once and ship. Vendor changes daily. Customer finds it before you do.
What are the three components of the replay system?+
The replay set (production eval set, small enough to run cheaply, 50-300 examples). The daily run (eval runs automatically every day, scores stored with timestamps). The drift alarm (define noise floor, anything beyond triggers alert with what changed).
What's the runbook when the alarm fires?+
Confirm the regression in failing examples. Check vendor changelog for updates. Pin model version if available. Decide: prompt fix, model swap, or vendor escalation. Communicate proactively. Update eval set to catch this next time.
What should my customer-facing trust center say about this?+
Which models you use by surface. When you last validated each. How you handle regressions. Notice period when you change models. Most teams haven't published this. Be early. Customers buying AI in 2026 increasingly check this before signing.
Related reading
Deeper essays and other handbook chapters on the same thread.
Prompt Ops
Your prompts are production code. Version, review, eval, stage, and roll back, or your product is one Notion edit away from breaking.
The Eval Is The Spec
Kill the PRD. Ship against a test set. The eval is the contract, the changelog, and the definition of done.
When Not to Use AI
The senior PM move in 2026 isn't using AI everywhere. It's knowing when a regex, a query, or a form beats a model.
Gross Margin Is Your Job Now
Cost per successful action is the new primary PM metric. If you don't own it, your CFO will kill your product before your customers do.
Trust, Safety, and the Guardrail as a Product Decision
Every guardrail is a product decision. The PM who outsources it to legal gets a product they didn't design and a customer experience they wouldn't approve.
Pricing for AI Products
Per-seat is dead for AI. Price the work the seat is no longer doing: outcomes, usage, value units.