The Living Changelog

Your product changed last Tuesday

A model vendor deprecated a snapshot. Or rerouted a checkpoint behind the scenes. Or pushed a safety-training update that suddenly refuses requests it used to handle. Or fixed a tokenization bug that changed the output format. You didn't get an email. You didn't get a release note. Your customers found out before you did, in the worst possible way: by complaining about a thing that used to work.

This is the new operational reality of building with AI. You shipped a product you don't fully control. Pretending otherwise is the fastest way to lose your customers' trust.

What to do about it

Run a continuous eval replay against production. The same eval set, every day, against the live system. Any delta greater than noise is treated as an incident. Even if no customer has complained yet. You assume the model providers will change things without telling you, because they will, and you build the operational layer that catches it before your customers do.

Why "test once, ship, forget" doesn't work anymore

Most teams test once. Run an eval when the prompt or model changes, get a score, ship it, and never run that eval again. Meanwhile the model behind the API can shift behavior weekly without any version bump on your side.

Vendors don't always tell you. Even when major model versions deprecate, the comms are imperfect. Snapshots get rerouted. Safety training gets updated mid-version. Quantization changes get rolled out silently. Read your model vendor's changelog and assume that 80 percent of what they actually change is not on it.

Your customers will tell you. Eventually. After it's been broken for two weeks. Via a sales escalation. After they've already mentioned it to a competitor.

And without continuous replay, when a customer says "this got worse," you can't reproduce the regression. You can't show what changed. You can't date the regression. You don't have operational ground truth.

The replay system, three components

It's simpler than it sounds.

1. The replay set. Take the production eval set you use as the spec for each surface (see The Eval Is The Spec). This is your ground truth. Each surface has one. The set is small enough to run cheaply (50 to 300 examples), large enough to detect drift across slices.

2. The daily run. Every day, automatically, the replay set runs against every model your product uses, with the prompts currently live in production. Scores are stored with timestamps. Per-example outputs are stored with timestamps. The store is small (a few MB per surface per day) and it's the only way you'll ever be able to answer "when did this start drifting?"

3. The drift alarm. Define a noise floor for each surface. You'll learn it after a week of data. Any per-day change beyond that floor triggers an alert. The alert goes to the PM and the on-call engineer. It includes: which examples changed, what the previous outputs were, what the new outputs are, and which model and prompt are currently in production.

That's the whole system. A few days to build. It pays for itself the first time it catches a silent vendor change before a customer does.

The runbook when the alarm fires

Confirm the regression. Look at the failing examples. Are they actually worse, or did the eval rubric drift? (Yes, the LLM-as-judge can also drift. You may need a second model to judge the judge.)
Check the vendor. Has the model been updated? Is there a published changelog item? Is there community chatter about the same regression?
Pin the model version. If the vendor offers pinned snapshots, switch to one immediately. Buy yourself time.
Decide: prompt fix, model swap, or vendor escalation. A prompt change is fastest. A model swap is most stable. A vendor escalation is the right move if you're a major customer and this is a violation of their stability commitments.
Communicate. If customer impact is non-zero, send a proactive note. "We detected a regression in [feature]; here's what we're doing." Doing this before customers complain is the trust differentiator. Doing it after is damage control.
Update the eval set. The fact that this regression made it into production means your eval set was missing something. Add the failing case. Strengthen the rubric. The eval set gets better every time the alarm fires.

The vendor contract conversation

If you're spending real money with a model provider, you have leverage to ask for:

Pinned model snapshots with stability commitments and notice periods before deprecation.
Pre-release access to upcoming model versions so you can run your eval set before they go GA.
A direct support channel for behavior regressions.
A documented escalation path for production incidents tied to model behavior.

Most vendors will offer these to enterprise customers. Most PMs don't ask. Ask. Even if you're not yet enterprise-tier, asking signals that you operate the way they want their large customers to operate.

Your customer-facing version

Somewhere on your status page or trust center, your customers should be able to see:

Which models you use, by surface.
When you last validated each surface.
How you handle regressions.
The notice period you give when you change models.

This is the new "trust center" content for the AI era. Most teams haven't published it yet. Be early. Customers buying AI products in 2026 increasingly check this before signing, especially in regulated industries. The absence is a sales blocker more than the presence is a sales asset.

The mental model shift

You used to ship code, and the code did what it did until you changed it.

You now ship a system in which a third party can change part of it, at any time, without your permission. Your product is partially controlled by your model providers. The Living Changelog is how you take that control back at the operational layer. You can't stop the vendor from changing the model. You can detect the change before it hurts your customers.

This is uncomfortable. It's also the job.

Pick one thing this week

Pick the most important AI surface in your product. Set up its replay.

Take the eval set you already have for it. (If you don't have one, see The Eval Is The Spec. That's your week.)
Schedule it to run daily against production, automatically. A small Python script and a cron job is fine.
Store the score in a sheet, with a timestamp.
After a week, look at the variance. That's your noise floor. Set an alert at 2x the noise floor.
The first time the alert fires, run the runbook above. Notice that you knew about the issue before any customer told you.

Do this for one surface this week. Add another every week. Within a quarter your top 10 surfaces all have continuous replay, and you become the team that catches model drift before any of your competitors.

You shipped a product with a third party in the loop. Either you watch the loop daily, or your customers will tell you when it broke.

The Living Changelog

Your product changed last Tuesday

What to do about it

Why "test once, ship, forget" doesn't work anymore

The replay system, three components

The runbook when the alarm fires

The vendor contract conversation

Your customer-facing version

The mental model shift

Pick one thing this week

Frequently asked

Related reading

Prompt Ops

The Eval Is The Spec

When Not to Use AI

Gross Margin Is Your Job Now

Trust, Safety, and the Guardrail as a Product Decision

Pricing for AI Products

Audit, workshop, or advisory.

Follow on LinkedIn.

Browse the toolkit.