Prompt Ops

A bad day I'd like you to avoid

Six months ago I watched a team push a "small prompt tweak" to production on a Friday afternoon. By Monday morning, customer-facing quality scores had dropped 18 percent across three surfaces. Nobody knew exactly when the change went in. The "old" prompt only existed in a Slack thread someone had to scroll back to find. Rollback took six hours.

The team was good. They were not careless. They just hadn't built any operational layer underneath their AI product, so a single intern with edit access to a Notion page could break production for everyone at once.

If your prompts live in a Google Doc and get pasted into the codebase by hand, you don't have a product. You have a liability with a UI on top.

The thesis

Prompts deserve the same lifecycle as code: version control, code review, automated evaluation on every change, staged rollout, monitoring, and one-click rollback. A prompt change without an eval run is a deploy without a test. Treat it that way.

What I see broken on most teams

Prompts in three places at once. The "real" one in code, a "draft" in Google Docs, an older version in a Notion page someone forgot existed. When something breaks, nobody knows which prompt is actually running.

Edits are untracked. Someone tweaks a prompt to fix one customer's edge case. The fix breaks three other use cases. The post-mortem reveals "the prompt was changed at some point" and the team agrees to "be more careful."

No testing layer. Prompts ship straight to production. The only test is "does it look right when I run it manually three times." Three runs of a non-deterministic system is a vibe, not a test.

Rollback is a redeploy. If a prompt breaks, the fastest path to recovery is "find the old version, paste it back, redeploy." That takes hours. During those hours the product is broken for every user.

The five-piece Prompt Ops stack

You don't need exotic tooling. You need five things, in this order.

1. A prompt repository. Prompts live in your codebase, in version control, in a directory called prompts/. Each prompt is a file. Each file has a header with its purpose, owner, and last-evaluated-at timestamp. The prompt that runs in production is the file in main. There is one source of truth. There are no Google Docs.

2. A prompt review workflow. Prompt changes go through pull requests, exactly like code. The PR template requires three fields: what changed, what eval set this was run against, what the score delta was. PRs without an eval run can't merge. Reviewer's job is to look at the diff, the eval delta, and the failing examples. This is faster than the current "did Slack approve it" process and infinitely more rigorous.

3. An eval harness on every prompt change. Before merge, the prompt is automatically run against its named eval set. Score, deltas per slice, newly failing examples get posted to the PR. You merge when the eval clears the threshold. This is the AI-product version of a CI test suite. Without it, you don't have a deploy pipeline. You have hope.

4. Staged rollout. New prompts deploy to 1 percent of traffic first. Real-world eval scores get monitored for 24 hours against the same eval set running offline. If live agrees with offline, ramp to 10 percent, then 50 percent, then 100 percent. If they diverge, you've found a gap in your eval set. Pause, augment the eval set, try again. You'll be surprised how often offline and online disagree. That gap is the actual product risk you're managing.

5. One-click rollback. Every deployed prompt has a one-command revert. The on-call engineer can roll back without a ticket. The PM can roll back without an engineer. Rollback is a routine action used confidently, not a heroic act.

A/B testing prompts in production

You'll eventually have two viable prompts and want to know which is better. The naive approach is offline evals, but production user behavior is the real test. The right approach:

Define a primary metric (eval score, completion rate, customer rating, downstream conversion).
Split traffic 50/50, or 90/10 if one variant is risky.
Run for at least one full weekly cycle to absorb day-of-week effects.
Decide. Pick the winner. Delete the loser. Don't carry both prompts forward "just in case." That's how you end up with seven prompts for one feature and nobody knows which is which.

Prompt linting

Most production prompts in 2026 have at least one of these bugs:

Conflicting instructions ("be concise" with a six-paragraph guideline that asks for verbose output).
Stale references (mentions a tool or capability that no longer exists).
Untracked dependencies (assumes a variable will be passed but doesn't validate).
Format-locking that fights newer models (overly rigid format prescriptions newer models handle natively).
Tone collisions (one paragraph says "be friendly," another says "respond formally").

A simple linter, a small LLM call running over your prompts/ directory once a week, catches most of these. Treat the lint as advisory, not blocking. But read the report. The first time you'll be embarrassed by what you find. That's the point.

Who owns prompts?

The PM owns the prompt. Engineering owns the wiring around it. This is non-negotiable.

The temptation is to "let engineering write the prompt because they understand the system." That's exactly backwards. The prompt is the spec. The prompt encodes the product decision: what the system does, who it's for, when it refuses, what tone it uses, what edge cases it handles. Those are PM decisions. Engineering's job is to make sure the prompt loads cleanly, runs reliably, evaluates automatically, and rolls back fast.

If your engineers are writing your prompts, you haven't yet done the work of translating the product decision into a contract. The prompt is the contract. Write it.

Pick one thing this week

You probably have at least one prompt that lives in a Google Doc or a Notion page right now. Move it.

Create a directory in your repo called prompts/.
Move that one prompt into a file called prompts/feature-name.md. Add a header: purpose, owner, last evaluated.
Wire your code to load the prompt from the file (one or two lines for most stacks).
Make any change to the prompt. Open a PR. Notice the diff is now legible.
Add an eval set for that prompt (see The Eval Is The Spec). Wire it into your CI. Now any prompt change runs the eval.

That's the v1 of Prompt Ops. Five things, half a day of work, and your product gets meaningfully harder to break by accident. Do this for one prompt this week. Add a prompt to the system every week after that. By the end of the quarter, your AI product has the operational maturity of your CRUD product, and the next intern can't break production.

Prompt Ops

A bad day I'd like you to avoid

The thesis

What I see broken on most teams

The five-piece Prompt Ops stack

A/B testing prompts in production

Prompt linting

Who owns prompts?

Pick one thing this week

Frequently asked

Related reading

The Living Changelog

When Not to Use AI

Gross Margin Is Your Job Now

Trust, Safety, and the Guardrail as a Product Decision

The Eval Is The Spec

Pricing for AI Products

Audit, workshop, or advisory.

Follow on LinkedIn.

Browse the toolkit.