The PRD is dead. I have made that argument elsewhere. The argument I have not made enough is that most teams who kill the PRD replace it with a worse PRD: a Notion doc with three new section headers, a "Success Metrics" block nobody scores, and an "Eval Plan" paragraph that says "we will validate with users."
The eval is supposed to be the spec. Most teams ship something that looks like an eval and acts like a wishlist.
This is the template I use. Five rows. Twenty to thirty test cases. One threshold. The whole thing fits on a single page.
The short version
The five-row eval template has exactly five columns: Behavior, Input, Expected, Scorer, Threshold. Each row of the spreadsheet is a falsifiable claim about the feature. Twenty to thirty cases is enough for a typical feature scope, not 200. The PM writes the first draft in three to five hours. The engineer adds adversarial inputs. A customer or domain expert pressure-tests the expected column. The threshold is what makes the eval load-bearing, and is the column most teams skip. Six failure modes kill bad evals: vibes scoring, happy-path-only cases, threshold drift, scorer collusion, ceremony writing, and audit-only running. The template is not new. The discipline of writing it well is the work.
For the broader argument, see the handbook chapter The Eval Is the Spec and the companion blog post Kill the PRD: The Prototype Is the Spec. For why the eval belongs at the center of the org, not inside engineering, see The Eval-First Product Org.
The template
One page. Five columns. As many rows as the feature needs, usually 20 to 30.
| Behavior | Input | Expected | Scorer | Threshold |
|---|---|---|---|---|
| What the system observably does | The concrete test case | What "good" looks like for this case | Who or what judges the output | The pass bar |
Each row is a falsifiable claim. If the row is not falsifiable, it is not an eval. It is a wish.
Behavior
The observable thing the system does, in the user's frame. Not "the LLM responds correctly." Something like "the agent returns a structured opportunity statement when given a tagged transcript segment."
Behavior is the thing a non-technical reader can understand and a customer can verify.
Input
A specific, concrete test case. Not a category. Not "a typical user query." The actual string, file, or context the system gets.
If you find yourself writing inputs in the abstract ("a long document," "an ambiguous request"), stop and write a concrete one. The abstraction is hiding ambiguity that will surface in production.
Expected
What "good" looks like for this specific input.
This is the hardest column. "Good" is almost never one thing. So the expected column has to be specific enough to score, but general enough to allow for variation in correct outputs.
The pattern I use: write the expected output as a list of structural requirements ("contains a metric with a baseline and a target," "names a segment," "is no longer than three sentences") instead of a fixed string. This makes the row scorable without forcing the model to produce an exact match.
Scorer
Who or what judges the output. Three options:
- A human (PM, customer, domain expert, with a one-paragraph rubric)
- A different model (with a one-paragraph rubric, not the same model that generated the output)
- A deterministic check (regex, schema validator, structural test)
The scorer column matters more than people think. Most failed evals fail because the same model is grading its own output. See failure mode four below.
Threshold
The pass bar.
This is the column most teams skip. They write 20 rows of Behavior/Input/Expected/Scorer, then say "we will iterate." Without a threshold, the eval is not load-bearing. It is a vibe check.
Threshold can be expressed as: a numeric score on a 1-5 rubric, a percentage of cases passing, a hard minimum on a critical subset (e.g., "all safety cases must pass"), or a relative bar against a baseline ("must beat the v0 baseline by at least 10%").
The threshold is the column that makes the eval a spec. Without it, you have a checklist.
A worked example
The feature: an agent that takes a tagged customer call segment and returns a ranked opportunity statement (the third agent from the discovery week without calls).
Five sample rows of a 24-row eval:
| Behavior | Input | Expected | Scorer | Threshold |
|---|---|---|---|---|
| Agent returns a structured opportunity for high-signal input | Three tagged segments from tier-1 account about onboarding | Output names the segment, includes baseline metric, names a hypothesis | Human PM, 1-5 rubric | 4/5 on 90% of high-signal cases |
| Agent flags low-confidence input | One tagged segment with conflicting sentiment | Output includes "low confidence" flag and explains the conflict | Deterministic (string match on "low confidence") | 100% |
| Agent refuses to invent a metric | Segment with no metric mentioned | Output says "no baseline available" instead of fabricating | Different model, rubric | 100% on 10 baseline-stripped cases |
| Agent does not over-weight frequency | 30 low-tier mentions, 2 tier-1 mentions of different theme | Tier-1 theme ranks at or above tier-3 frequency theme | Human PM, structural | 100% on 8 tier-weighting cases |
| Output fits the prototype brief schema | Any valid input | Output validates against the brief JSON schema | Deterministic (schema validator) | 100% |
The whole 24-row sheet takes about four hours to draft. Two of the hours are spent finding adversarial cases (the rows above are the easy ones). The other two are spent writing thresholds I can actually defend.
Compare this to the equivalent PRD: a four-page Notion doc describing the agent's behavior in prose. Both take about the same amount of time to produce. The eval is testable. The PRD is not.
The six failure modes
1. Vibes scoring
The eval has no rubric. The PM looks at the output and says "this seems good." There is no anchored score.
Symptom: the same PM looks at the same output on different days and gives a different score.
Fix: write a one-paragraph rubric per scorer that defines what a 1, 3, and 5 look like. Anchor the rubric to two real examples.
2. Happy-path-only cases
The eval has 25 test cases. All 25 are reasonable, polite, on-distribution inputs. The model passes 24 of them. Production immediately hits an adversarial input the eval did not cover.
Symptom: the eval score is high and the customer reports are bad.
Fix: at least 30% of your cases should be adversarial. Ambiguous inputs, malformed inputs, inputs that contradict prior context, inputs that try to break the schema.
3. Threshold drift
The bar moves to whatever the model can hit. The PM lowers the threshold because "the model is doing its best."
Symptom: the threshold is rewritten more than once per quarter.
Fix: lock the threshold at design time. If you have to change it, write a one-paragraph memo explaining why and who signed off. The friction is the point.
4. Scorer collusion
The output is generated by GPT-5. The scorer is GPT-5. The same model is grading its own output and unsurprisingly thinks it did well.
Symptom: model-as-judge scores are 20% higher than human scores on the same outputs.
Fix: use a different model as the scorer, or use a human, or use a deterministic check. If you must use the same model family, use a different prompt structure and add a contrarian instruction ("identify the weakest part of this output").
5. Eval as ceremony
The eval is written once, before the feature ships, then archived. Nobody runs it again.
Symptom: the eval is in a Notion page that was last edited the day of launch.
Fix: the eval has to run on every diff. If your CI does not run the eval suite, the eval is not load-bearing.
6. Eval as audit
The eval is only run when something breaks. It is a forensic tool, not a daily signal.
Symptom: nobody can tell you the eval score for the feature this morning.
Fix: publish the eval scorecard daily, internally, on a shared dashboard. The CPO and the PM see it before they see any other product metric.
How to introduce this on a team that runs PRDs
Pick one feature. The next one.
Ask the team to ship it with a five-row eval instead of a PRD. Run the planning meeting against the eval, not the doc. Watch the conversation get faster.
The first eval will take five hours. The second will take three. The fifth will take ninety minutes.
The hardest part is not the template. It is killing the ceremony of the PRD review meeting. The eval does not benefit from a 90-minute readthrough. It benefits from three people sitting at a table and arguing about row 14.
Get to row 14 faster. Ship the eval.
The full five-row eval template (Notion + Sheets + a CSV for direct CI ingest), the six-failure-mode checklist, and three worked examples are on the toolkit.
Further reading
Frequently asked
What is the five-row eval template?+
A one-page spreadsheet with five rows: (1) Behavior, the observable thing the system does; (2) Input, the concrete test case; (3) Expected, what good looks like; (4) Scorer, who or what judges it; (5) Threshold, the pass bar. Each row is a falsifiable claim about the feature. The whole template is 20 to 30 cases across the five rows, not 200. The point is opinion density, not coverage.
Why five rows specifically?+
Because four rows leaves the threshold implicit, which is where most evals fail. And six rows starts to encode implementation details (latency, cost, model version) that should live in a separate ops eval, not the product eval. Five rows is the smallest set that captures the product question without leaking into engineering or ops.
What are the six ways a bad eval suite fails?+
(1) Vibes scoring (no rubric, just 'good/bad' from the PM). (2) Test cases that all pass on the happy path (no adversarial inputs). (3) Threshold drift (the bar moves to whatever the model can hit). (4) Scorer collusion (the same model that generates the output also grades it). (5) Eval as ceremony (written once, never re-run). (6) Eval as audit (run only when something breaks, never as a daily signal).
How long does writing a good eval suite take?+
Three to five hours for the first one. Ninety minutes by the fifth. The PM cost is front-loaded and the engineering cost drops to near zero, because once the eval is written, the team runs it against every diff. The eval becomes the spec, the acceptance test, and the regression suite in one artifact.
Who writes the eval, the PM or the engineer?+
The PM writes the first draft. The engineer pressure-tests the test cases (adds adversarial inputs, breaks the scorer). The customer or domain expert pressure-tests the expected outputs (signs off that 'good' is actually good). The PM owns the threshold. Three people, one artifact, signed off together.
What if my product isn't AI? Do I still need an eval suite?+
Yes. The eval template works for any feature where you can describe behavior, input, and expected output. A checkout flow, a billing migration, a search ranking change, all of them benefit from the five-row structure more than they benefit from a prose PRD. The 'expected' column changes from a model output to a state transition or a metric movement, but the structure holds.
How do I get a team to adopt this if they're used to PRDs?+
Pick one upcoming feature. Ship it with a five-row eval instead of a PRD. Run the planning meeting against the eval, not the doc. If the conversation gets faster (it will), you have the proof. The hard part is not the template, it's killing the PRD ceremony that surrounds it.

Comments (0)
Sign in with LinkedIn to leave a comment.
Sign in with LinkedIn