Prompt Ops
Your prompts are production code. Version, review, eval, stage, and roll back, or your product is one Notion edit away from breaking.
A bad day I'd like you to avoid
Six months ago I watched a team push a "small prompt tweak" to production on a Friday afternoon. By Monday morning, customer-facing quality scores had dropped 18 percent across three surfaces. Nobody knew exactly when the change went in. The "old" prompt only existed in a Slack thread someone had to scroll back to find. Rollback took six hours.
The team was good. They were not careless. They just hadn't built any operational layer underneath their AI product, so a single intern with edit access to a Notion page could break production for everyone at once.
If your prompts live in a Google Doc and get pasted into the codebase by hand, you don't have a product. You have a liability with a UI on top.
The thesis
Prompts deserve the same lifecycle as code: version control, code review, automated evaluation on every change, staged rollout, monitoring, and one-click rollback. A prompt change without an eval run is a deploy without a test. Treat it that way.
What I see broken on most teams
Prompts in three places at once. The "real" one in code, a "draft" in Google Docs, an older version in a Notion page someone forgot existed. When something breaks, nobody knows which prompt is actually running.
Edits are untracked. Someone tweaks a prompt to fix one customer's edge case. The fix breaks three other use cases. The post-mortem reveals "the prompt was changed at some point" and the team agrees to "be more careful."
No testing layer. Prompts ship straight to production. The only test is "does it look right when I run it manually three times." Three runs of a non-deterministic system is a vibe, not a test.
Rollback is a redeploy. If a prompt breaks, the fastest path to recovery is "find the old version, paste it back, redeploy." That takes hours. During those hours the product is broken for every user.
The five-piece Prompt Ops stack
You don't need exotic tooling. You need five things, in this order.
1. A prompt repository.
Prompts live in your codebase, in version control, in a directory called prompts/. Each prompt is a file. Each file has a header with its purpose, owner, and last-evaluated-at timestamp. The prompt that runs in production is the file in main. There is one source of truth. There are no Google Docs.
2. A prompt review workflow. Prompt changes go through pull requests, exactly like code. The PR template requires three fields: what changed, what eval set this was run against, what the score delta was. PRs without an eval run can't merge. Reviewer's job is to look at the diff, the eval delta, and the failing examples. This is faster than the current "did Slack approve it" process and infinitely more rigorous.
3. An eval harness on every prompt change. Before merge, the prompt is automatically run against its named eval set. Score, deltas per slice, newly failing examples get posted to the PR. You merge when the eval clears the threshold. This is the AI-product version of a CI test suite. Without it, you don't have a deploy pipeline. You have hope.
4. Staged rollout. New prompts deploy to 1 percent of traffic first. Real-world eval scores get monitored for 24 hours against the same eval set running offline. If live agrees with offline, ramp to 10 percent, then 50 percent, then 100 percent. If they diverge, you've found a gap in your eval set. Pause, augment the eval set, try again. You'll be surprised how often offline and online disagree. That gap is the actual product risk you're managing.
5. One-click rollback. Every deployed prompt has a one-command revert. The on-call engineer can roll back without a ticket. The PM can roll back without an engineer. Rollback is a routine action used confidently, not a heroic act.
A/B testing prompts in production
You'll eventually have two viable prompts and want to know which is better. The naive approach is offline evals, but production user behavior is the real test. The right approach:
- Define a primary metric (eval score, completion rate, customer rating, downstream conversion).
- Split traffic 50/50, or 90/10 if one variant is risky.
- Run for at least one full weekly cycle to absorb day-of-week effects.
- Decide. Pick the winner. Delete the loser. Don't carry both prompts forward "just in case." That's how you end up with seven prompts for one feature and nobody knows which is which.
Prompt linting
Most production prompts in 2026 have at least one of these bugs:
- Conflicting instructions ("be concise" with a six-paragraph guideline that asks for verbose output).
- Stale references (mentions a tool or capability that no longer exists).
- Untracked dependencies (assumes a variable will be passed but doesn't validate).
- Format-locking that fights newer models (overly rigid format prescriptions newer models handle natively).
- Tone collisions (one paragraph says "be friendly," another says "respond formally").
A simple linter, a small LLM call running over your prompts/ directory once a week, catches most of these. Treat the lint as advisory, not blocking. But read the report. The first time you'll be embarrassed by what you find. That's the point.
Who owns prompts?
The PM owns the prompt. Engineering owns the wiring around it. This is non-negotiable.
The temptation is to "let engineering write the prompt because they understand the system." That's exactly backwards. The prompt is the spec. The prompt encodes the product decision: what the system does, who it's for, when it refuses, what tone it uses, what edge cases it handles. Those are PM decisions. Engineering's job is to make sure the prompt loads cleanly, runs reliably, evaluates automatically, and rolls back fast.
If your engineers are writing your prompts, you haven't yet done the work of translating the product decision into a contract. The prompt is the contract. Write it.
Pick one thing this week
You probably have at least one prompt that lives in a Google Doc or a Notion page right now. Move it.
- Create a directory in your repo called
prompts/. - Move that one prompt into a file called
prompts/feature-name.md. Add a header: purpose, owner, last evaluated. - Wire your code to load the prompt from the file (one or two lines for most stacks).
- Make any change to the prompt. Open a PR. Notice the diff is now legible.
- Add an eval set for that prompt (see The Eval Is The Spec). Wire it into your CI. Now any prompt change runs the eval.
That's the v1 of Prompt Ops. Five things, half a day of work, and your product gets meaningfully harder to break by accident. Do this for one prompt this week. Add a prompt to the system every week after that. By the end of the quarter, your AI product has the operational maturity of your CRUD product, and the next intern can't break production.
Frequently asked
What does 'Prompt Ops' mean?+
Prompts get the same lifecycle as code: version control, code review, automated eval on every change, staged rollout, monitoring, rollback. A prompt change without an eval run is a deploy without a test.
Why do most teams have broken prompt operations?+
Prompts in three places at once (code, Google Docs, Notion). Edits are untracked. No testing layer (manual runs don't count). Rollback is a re-deploy, which takes hours. One intern with access to the wrong page breaks production.
What are the five pieces of the Prompt Ops stack?+
Prompt repository (in version control, one source of truth). Pull request review workflow (with eval delta required). Eval harness on every change (CI test suite for prompts). Staged rollout (1 percent, 10 percent, 100 percent). One-click rollback.
Who should own the prompt, PM or engineer?+
PM owns the prompt. Engineer owns the wiring. Non-negotiable. The prompt is the spec. It encodes what the system does, who it's for, when it refuses, what tone. Those are PM decisions. Engineer makes sure it loads cleanly, evals automatically, rolls back fast.
How long does v1 Prompt Ops take to set up?+
Half a day. One prompt in a file. Load it from code. Make a change, open a PR, notice the diff is legible. Wire an eval set into CI. That's v1. You can do this for one prompt this week, add one per week after that.
Related reading
Deeper essays and other handbook chapters on the same thread.
The Living Changelog
Your model vendor changed the model on Tuesday and didn't tell you. Run a daily replay against production or your customers will catch it before you do.
When Not to Use AI
The senior PM move in 2026 isn't using AI everywhere. It's knowing when a regex, a query, or a form beats a model.
Gross Margin Is Your Job Now
Cost per successful action is the new primary PM metric. If you don't own it, your CFO will kill your product before your customers do.
Trust, Safety, and the Guardrail as a Product Decision
Every guardrail is a product decision. The PM who outsources it to legal gets a product they didn't design and a customer experience they wouldn't approve.
The Eval Is The Spec
Kill the PRD. Ship against a test set. The eval is the contract, the changelog, and the definition of done.
Pricing for AI Products
Per-seat is dead for AI. Price the work the seat is no longer doing: outcomes, usage, value units.