The Eval Is The Spec
Kill the PRD. Ship against a test set. The eval is the contract, the changelog, and the definition of done.
The PRD was a coping mechanism
The PRD existed because the person who knew what "good" looked like couldn't test the thing themselves. So they wrote it down, at length, hoping someone else would read it carefully enough to build the right thing. They never did. They built a thing. You reviewed it. You said "not quite," and the cycle repeated until everyone was tired enough to ship.
That was a reasonable deal in 2018. It's a bad deal now. In 2026, I can test the thing myself. I can run it against real inputs, score the outputs, see exactly where it breaks, and know within an hour whether we're close. I don't need a 10-page spec to adjudicate opinions about what the product should do. I can run the spec.
So I stopped writing PRDs. I write eval sets now. The eval is the spec. It's also the changelog. It's also the definition of done.
What an eval set looks like
For any AI surface I'm building, the eval set is 30 to 200 real input/output pairs that define "good." It's the thing I'd use to verify the product works. In 2026 it's also the thing I use to build the product in the first place.
Five moves to build one.
1. Gather real inputs. 30 actual customer messages, requests, transcripts, whatever the feature will handle. Not hypothetical ones. Real. I usually pull these from sales call transcripts, support tickets, or old usage logs. If I don't have real inputs yet, the feature isn't ready for an eval set, and honestly isn't ready for a PRD either.
2. Label the desired output. For each input, I write what "good" looks like. This is the actual PM work. If I can't write it, I don't understand the problem yet. Back to discovery.
3. Include the failure modes. I add 10 to 20 examples specifically designed to trip the system. Adversarial prompts, edge cases, data the system shouldn't know, requests it should refuse. Each one gets a labeled "right" behavior too. These are often more informative than the happy-path examples.
4. Score with a rubric. Decide what I'm measuring per slice. Exact match for structured output. Semantic similarity for generative. LLM-as-judge for judgment calls. Programmatic checks for format. Human review for taste. Different rubric per slice, all in the same file.
5. Run the set every day. Not every sprint. Every day. Every prompt change. Every model swap. Every deploy. Stored in the repo, runs in CI, alerts on regressions.
That's the spec. Engineering builds against it. I don't have opinions about whether the thing is ready. I have a number. The number went up or it didn't.
What changes in the review
A typical feature review before I made this shift: 40-minute meeting, five slides of opinions, a decision everyone would reverse by Friday.
Now: one screen showing eval scores by surface, the diff since last week, the top three regressions with the inputs that caused them, and the three bets we're making to close the gap. 20 minutes. A decision that holds.
The eval is the only thing in the room with standing to end the debate.
The objection, and why I don't buy it anymore
A senior engineer or a founder who wrote a lot of PRDs in the 2010s will tell you: "Evals can't capture everything. There's judgment involved. You can't measure quality."
They're right that evals don't capture everything. They're wrong that this is an argument for the PRD. The PRD captures less. The PRD captures exactly one person's opinion, written once, never re-run, never scored. An eval captures 50 concrete examples, re-run every day, scored against labeled truth. Pretending the PRD was the more rigorous artifact is the sunk cost talking.
For the things evals actually miss (taste, craft, emotional resonance), the answer is human review as a named step in the rubric. You don't throw out the whole approach because one slice needs eyes. You build a rubric with an "eyes" column and keep moving.
A specific example
I was running a feature at Smartcat that needed to decide whether a translation request should route to a human linguist or stay with AI. Classic AI-vs-human decision, high stakes (cost, quality, turnaround).
The old way: write a PRD specifying routing rules, hand to engineering, watch them implement something different from what I meant, iterate for a month.
The new way: I spent an afternoon pulling 150 real translation requests from the last quarter. I labeled each one: should it have gone to AI, to a human, or been flagged for escalation? I wrote out the rubric (correctness of routing, cost efficiency, SLA compliance). Eng took the eval set and had a working router in three days. We ran the eval daily. Every time the score regressed, we could see exactly which inputs were failing and why.
No PRD. No spec review meeting. The eval was the spec. The review was the score.
Get over the "I don't know how"
If you just thought "I don't know how to write an eval set," that's the exact gap this site is trying to close. Claude Code will walk you through it in 20 minutes. The mechanics are: put inputs and expected outputs in a file, write a small script that runs each input through your system, compare outputs to expected, store the score.
That's it. There's a lot of commercial tooling for this (Braintrust, LangSmith, Weave, and others), but you don't need them to start. A CSV and a Python script is a fine v1. The discipline is what matters, not the framework.
The first time you do it you'll feel clumsy. The second time you'll wonder why you were ever writing PRDs. By the third eval set, you'll have a template.
Pick one thing this week
You're probably about to write a PRD. Don't. Pick that same feature and build its eval set instead.
- Open a file. Call it
evals/feature-name.md. - Write down 30 real inputs the feature will see. Pull from your actual product data.
- For each one, write what the output should be.
- Add 10 adversarial cases (edge cases, things the feature should refuse, weird formats).
- Share the file with your engineer and say: "This is the spec. Build against it."
That's your first eval-as-spec. Notice what changes in the conversation with engineering. Notice how much faster you align on what "done" means. Notice that you already know, before anyone writes code, exactly what the first test will be on launch day.
If you can't write the eval, you don't yet know what you're building. If you can write the eval, you don't need the spec.
Frequently asked
What is the eval set and why is it the spec?+
An eval set is 30 to 200 real input/output pairs that define what good looks like. For each input, you label the correct output. You include failure modes so the system knows what to refuse. You score per slice with different rubrics. The eval is the spec because it's concrete, repeatable, scored daily.
How do you build an eval set?+
Step 1: Pull 30 real inputs from production (actual customer messages, support tickets). Step 2: Label what good looks like for each one. Step 3: Add 10 to 20 adversarial examples to trip the system. Step 4: Define your scoring rubric (exact match, semantic similarity, LLM-as-judge, format checks, human review). Step 5: Run daily against production.
What happens in the feature review when you have eval scores instead of opinions?+
The review is 20 minutes instead of 40. One screen showing eval scores by surface, delta since last week, top regressions. No debate. The eval is the only thing with standing to end the argument. Either the score went up or it didn't.
Can evals really capture everything about product quality?+
No. They miss taste, craft, emotional resonance. But that's where you add a named step in the rubric: eyes (human review). You don't throw out the whole approach because one slice needs humans. You build the rubric with an eyes column and keep moving.
How much time does it take to write an eval set?+
The first one takes an afternoon. Pulling real inputs, writing expected outputs, defining rubrics. Claude Code will walk you through it. By your third eval set you have a template. It becomes faster. The discipline is what matters, not the framework or the tooling.
Related reading
Deeper essays and other handbook chapters on the same thread.
The Living Changelog
Your model vendor changed the model on Tuesday and didn't tell you. Run a daily replay against production or your customers will catch it before you do.
Automated PRD Generator
Convert prioritized opportunities into PRDs automatically. Drafts based on research context, design specs, and technical requirements.
Prototype Before You Spec
Why the fastest way to get alignment, test ideas, and advance your career is to build something people can touch - and exactly how to do it in 2 hours.
The Impact Loop
The daily rhythm that replaces sprints, stand-ups, and roadmap reviews. Sense what's happening, build a response, measure the impact, amplify what works.
Ship With Observability or Don't Ship
No feature leaves staging without the traces, metrics, and evals that will tell you whether it's working. Before your first customer hits it.
The Deprecation Playbook
Feature death is the most under-written topic in PM. Kill on signal, not politics, and your team ships faster than the team that hopes politely.