The Five-Row Eval Template That Replaced My PRD

The PRD is dead. The eval suite replaces it. I have made that argument elsewhere. The argument I have not made enough is that most teams who kill the PRD replace it with a worse PRD: a Notion doc with three new section headers, a "Success Metrics" block nobody scores, and an "Eval Plan" paragraph that says "we will validate with users."

The eval suite is supposed to be the spec. Most teams ship something that looks like an eval and acts like a wishlist.

This is the five-row eval template I use. Five rows. Twenty to thirty test cases. One threshold. The whole thing fits on a single page.

The short version

The five-row eval template has exactly five columns: Behavior, Input, Expected, Scorer, Threshold. Each row of the spreadsheet is a falsifiable claim about the feature. Twenty to thirty cases is enough for a typical feature scope, not 200. The PM writes the first draft in three to five hours. The engineer adds adversarial inputs. A customer or domain expert pressure-tests the expected column. The threshold is what makes the eval load-bearing, and is the column most teams skip. Six failure modes kill bad evals: vibes scoring, happy-path-only cases, threshold drift, scorer collusion, ceremony writing, and audit-only running. The template is not new. The discipline of writing it well is the work.

For the broader argument, see the handbook chapter The Eval Is the Spec and the companion blog post Kill the PRD: The Prototype Is the Spec. For why the eval belongs at the center of the org, not inside engineering, see The Eval-First Product Org.

The template

One page. Five columns. As many rows as the feature needs, usually 20 to 30.

Behavior	Input	Expected	Scorer	Threshold
What the system observably does	The concrete test case	What "good" looks like for this case	Who or what judges the output	The pass bar

Each row is a falsifiable claim. If the row is not falsifiable, it is not an eval. It is a wish.

Behavior

The observable thing the system does, in the user's frame. Not "the LLM responds correctly." Something like "the agent returns a structured opportunity statement when given a tagged transcript segment."

Behavior is the thing a non-technical reader can understand and a customer can verify.

Input

A specific, concrete test case. Not a category. Not "a typical user query." The actual string, file, or context the system gets.

If you find yourself writing inputs in the abstract ("a long document," "an ambiguous request"), stop and write a concrete one. The abstraction is hiding ambiguity that will surface in production.

Expected

What "good" looks like for this specific input.

This is the hardest column. "Good" is almost never one thing. So the expected column has to be specific enough to score, but general enough to allow for variation in correct outputs.

The pattern I use: write the expected output as a list of structural requirements ("contains a metric with a baseline and a target," "names a segment," "is no longer than three sentences") instead of a fixed string. This makes the row scorable without forcing the model to produce an exact match.

Scorer

Who or what judges the output. Three options:

A human (PM, customer, domain expert, with a one-paragraph rubric)
A different model (with a one-paragraph rubric, not the same model that generated the output)
A deterministic check (regex, schema validator, structural test)

The scorer column matters more than people think. Most failed evals fail because the same model is grading its own output. See failure mode four below.

Threshold

The pass bar.

This is the column most teams skip. They write 20 rows of Behavior/Input/Expected/Scorer, then say "we will iterate." Without a threshold, the eval is not load-bearing. It is a vibe check.

Threshold can be expressed as: a numeric score on a 1-5 rubric, a percentage of cases passing, a hard minimum on a critical subset (e.g., "all safety cases must pass"), or a relative bar against a baseline ("must beat the v0 baseline by at least 10%").

The threshold is the column that makes the eval a spec. Without it, you have a checklist.

A worked example

The feature: an agent that takes a tagged customer call segment and returns a ranked opportunity statement (the third agent from the discovery week without calls).

Five sample rows of a 24-row eval:

Behavior	Input	Expected	Scorer	Threshold
Agent returns a structured opportunity for high-signal input	Three tagged segments from tier-1 account about onboarding	Output names the segment, includes baseline metric, names a hypothesis	Human PM, 1-5 rubric	4/5 on 90% of high-signal cases
Agent flags low-confidence input	One tagged segment with conflicting sentiment	Output includes "low confidence" flag and explains the conflict	Deterministic (string match on "low confidence")	100%
Agent refuses to invent a metric	Segment with no metric mentioned	Output says "no baseline available" instead of fabricating	Different model, rubric	100% on 10 baseline-stripped cases
Agent does not over-weight frequency	30 low-tier mentions, 2 tier-1 mentions of different theme	Tier-1 theme ranks at or above tier-3 frequency theme	Human PM, structural	100% on 8 tier-weighting cases
Output fits the prototype brief schema	Any valid input	Output validates against the brief JSON schema	Deterministic (schema validator)	100%

The whole 24-row sheet takes about four hours to draft. Two of the hours are spent finding adversarial cases (the rows above are the easy ones). The other two are spent writing thresholds I can actually defend.

Compare this to the equivalent PRD: a four-page Notion doc describing the agent's behavior in prose. Both take about the same amount of time to produce. The eval is testable. The PRD is not.

The six failure modes

1. Vibes scoring

The eval has no rubric. The PM looks at the output and says "this seems good." There is no anchored score.

Symptom: the same PM looks at the same output on different days and gives a different score.

Fix: write a one-paragraph rubric per scorer that defines what a 1, 3, and 5 look like. Anchor the rubric to two real examples.

2. Happy-path-only cases

The eval has 25 test cases. All 25 are reasonable, polite, on-distribution inputs. The model passes 24 of them. Production immediately hits an adversarial input the eval did not cover.

Symptom: the eval score is high and the customer reports are bad.

Fix: at least 30% of your cases should be adversarial. Ambiguous inputs, malformed inputs, inputs that contradict prior context, inputs that try to break the schema.

3. Threshold drift

The bar moves to whatever the model can hit. The PM lowers the threshold because "the model is doing its best."

Symptom: the threshold is rewritten more than once per quarter.

Fix: lock the threshold at design time. If you have to change it, write a one-paragraph memo explaining why and who signed off. The friction is the point.

4. Scorer collusion

The output is generated by GPT-5. The scorer is GPT-5. The same model is grading its own output and unsurprisingly thinks it did well.

Symptom: model-as-judge scores are 20% higher than human scores on the same outputs.

Fix: use a different model as the scorer, or use a human, or use a deterministic check. If you must use the same model family, use a different prompt structure and add a contrarian instruction ("identify the weakest part of this output").

5. Eval as ceremony

The eval is written once, before the feature ships, then archived. Nobody runs it again.

Symptom: the eval is in a Notion page that was last edited the day of launch.

Fix: the eval has to run on every diff. If your CI does not run the eval suite, the eval is not load-bearing.

6. Eval as audit

The eval is only run when something breaks. It is a forensic tool, not a daily signal.

Symptom: nobody can tell you the eval score for the feature this morning.

Fix: publish the eval scorecard daily, internally, on a shared dashboard. The CPO and the PM see it before they see any other product metric.

How to introduce this on a team that runs PRDs

Pick one feature. The next one.

Ask the team to ship it with a five-row eval instead of a PRD. Run the planning meeting against the eval, not the doc. Watch the conversation get faster.

The first eval will take five hours. The second will take three. The fifth will take ninety minutes.

The hardest part is not the template. It is killing the ceremony of the PRD review meeting. The eval does not benefit from a 90-minute readthrough. It benefits from three people sitting at a table and arguing about row 14.

Get to row 14 faster. Ship the eval.

The full five-row eval template (Notion + Sheets + a CSV for direct CI ingest), the six-failure-mode checklist, and three worked examples are on the toolkit. For the org built around evals, see The Eval-First Product Org.

Sources: Hamel Husain on evals that work, Eugene Yan on the eval mindset, Simon Willison on LLM tests, Claude Code.

The Five-Row Eval Template That Replaced My PRD

The short version

The template

Behavior

Input

Expected

Scorer

Threshold

A worked example

The six failure modes

1. Vibes scoring

2. Happy-path-only cases

3. Threshold drift

4. Scorer collusion

5. Eval as ceremony

6. Eval as audit

How to introduce this on a team that runs PRDs

Further reading

Frequently asked

About the author

Comments (0)

Keep Reading

Kill the PRD. The Prototype Is the Spec.

200 PM Job Descriptions Reviewed: 98% Hire for a Dead Job

The PRD Collapsed Into Three Artifacts (Brief, Prototype, Eval)

AI Isn't Replacing Developers. It's Eroding Them.

Audits, workshops, advisory.

Follow on LinkedIn.

Browse the toolkit.