The Eval Is The Spec

The PRD was a coping mechanism

The PRD existed because the person who knew what "good" looked like couldn't test the thing themselves. So they wrote it down, at length, hoping someone else would read it carefully enough to build the right thing. They never did. They built a thing. You reviewed it. You said "not quite," and the cycle repeated until everyone was tired enough to ship.

That was a reasonable deal in 2018. It's a bad deal now. In 2026, I can test the thing myself. I can run it against real inputs, score the outputs, see exactly where it breaks, and know within an hour whether we're close. I don't need a 10-page spec to adjudicate opinions about what the product should do. I can run the spec.

So I stopped writing PRDs. I write eval sets now. The eval is the spec. It's also the changelog. It's also the definition of done.

What an eval set looks like

For any AI surface I'm building, the eval set is 30 to 200 real input/output pairs that define "good." It's the thing I'd use to verify the product works. In 2026 it's also the thing I use to build the product in the first place.

Five moves to build one.

1. Gather real inputs. 30 actual customer messages, requests, transcripts, whatever the feature will handle. Not hypothetical ones. Real. I usually pull these from sales call transcripts, support tickets, or old usage logs. If I don't have real inputs yet, the feature isn't ready for an eval set, and honestly isn't ready for a PRD either.

2. Label the desired output. For each input, I write what "good" looks like. This is the actual PM work. If I can't write it, I don't understand the problem yet. Back to discovery.

3. Include the failure modes. I add 10 to 20 examples specifically designed to trip the system. Adversarial prompts, edge cases, data the system shouldn't know, requests it should refuse. Each one gets a labeled "right" behavior too. These are often more informative than the happy-path examples.

4. Score with a rubric. Decide what I'm measuring per slice. Exact match for structured output. Semantic similarity for generative. LLM-as-judge for judgment calls. Programmatic checks for format. Human review for taste. Different rubric per slice, all in the same file.

5. Run the set every day. Not every sprint. Every day. Every prompt change. Every model swap. Every deploy. Stored in the repo, runs in CI, alerts on regressions.

That's the spec. Engineering builds against it. I don't have opinions about whether the thing is ready. I have a number. The number went up or it didn't.

What changes in the review

A typical feature review before I made this shift: 40-minute meeting, five slides of opinions, a decision everyone would reverse by Friday.

Now: one screen showing eval scores by surface, the diff since last week, the top three regressions with the inputs that caused them, and the three bets we're making to close the gap. 20 minutes. A decision that holds.

The eval is the only thing in the room with standing to end the debate.

The objection, and why I don't buy it anymore

A senior engineer or a founder who wrote a lot of PRDs in the 2010s will tell you: "Evals can't capture everything. There's judgment involved. You can't measure quality."

They're right that evals don't capture everything. They're wrong that this is an argument for the PRD. The PRD captures less. The PRD captures exactly one person's opinion, written once, never re-run, never scored. An eval captures 50 concrete examples, re-run every day, scored against labeled truth. Pretending the PRD was the more rigorous artifact is the sunk cost talking.

For the things evals actually miss (taste, craft, emotional resonance), the answer is human review as a named step in the rubric. You don't throw out the whole approach because one slice needs eyes. You build a rubric with an "eyes" column and keep moving.

A specific example

I was running a feature at Smartcat that needed to decide whether a translation request should route to a human linguist or stay with AI. Classic AI-vs-human decision, high stakes (cost, quality, turnaround).

The old way: write a PRD specifying routing rules, hand to engineering, watch them implement something different from what I meant, iterate for a month.

The new way: I spent an afternoon pulling 150 real translation requests from the last quarter. I labeled each one: should it have gone to AI, to a human, or been flagged for escalation? I wrote out the rubric (correctness of routing, cost efficiency, SLA compliance). Eng took the eval set and had a working router in three days. We ran the eval daily. Every time the score regressed, we could see exactly which inputs were failing and why.

No PRD. No spec review meeting. The eval was the spec. The review was the score.

Get over the "I don't know how"

If you just thought "I don't know how to write an eval set," that's the exact gap this site is trying to close. Claude Code will walk you through it in 20 minutes. The mechanics are: put inputs and expected outputs in a file, write a small script that runs each input through your system, compare outputs to expected, store the score.

That's it. There's a lot of commercial tooling for this (Braintrust, LangSmith, Weave, and others), but you don't need them to start. A CSV and a Python script is a fine v1. The discipline is what matters, not the framework.

The first time you do it you'll feel clumsy. The second time you'll wonder why you were ever writing PRDs. By the third eval set, you'll have a template.

Pick one thing this week

You're probably about to write a PRD. Don't. Pick that same feature and build its eval set instead.

Open a file. Call it evals/feature-name.md.
Write down 30 real inputs the feature will see. Pull from your actual product data.
For each one, write what the output should be.
Add 10 adversarial cases (edge cases, things the feature should refuse, weird formats).
Share the file with your engineer and say: "This is the spec. Build against it."

That's your first eval-as-spec. Notice what changes in the conversation with engineering. Notice how much faster you align on what "done" means. Notice that you already know, before anyone writes code, exactly what the first test will be on launch day.

If you can't write the eval, you don't yet know what you're building. If you can write the eval, you don't need the spec.

The Eval Is The Spec

The PRD was a coping mechanism

What an eval set looks like

What changes in the review

The objection, and why I don't buy it anymore

A specific example

Get over the "I don't know how"

Pick one thing this week

Frequently asked

Related reading

The Living Changelog

Automated PRD Generator

Prototype Before You Spec

The Impact Loop

Ship With Observability or Don't Ship

The Deprecation Playbook

Audit, workshop, or advisory.

Follow on LinkedIn.

Browse the toolkit.