ExecutionNew·Falk Gottlob··8 min read

The Five-Row Eval Template That Replaced My PRD

The PRD is dead. Most teams replace it with a worse PRD. Here's the five-row eval template I use instead, the six failure modes that kill bad evals, and a worked example.

evalsPRDspecAI product engineeringClaude Codeacceptance criteriaeval rubric
Helpful?

The PRD is dead. I have made that argument elsewhere. The argument I have not made enough is that most teams who kill the PRD replace it with a worse PRD: a Notion doc with three new section headers, a "Success Metrics" block nobody scores, and an "Eval Plan" paragraph that says "we will validate with users."

The eval is supposed to be the spec. Most teams ship something that looks like an eval and acts like a wishlist.

This is the template I use. Five rows. Twenty to thirty test cases. One threshold. The whole thing fits on a single page.

The short version

The five-row eval template has exactly five columns: Behavior, Input, Expected, Scorer, Threshold. Each row of the spreadsheet is a falsifiable claim about the feature. Twenty to thirty cases is enough for a typical feature scope, not 200. The PM writes the first draft in three to five hours. The engineer adds adversarial inputs. A customer or domain expert pressure-tests the expected column. The threshold is what makes the eval load-bearing, and is the column most teams skip. Six failure modes kill bad evals: vibes scoring, happy-path-only cases, threshold drift, scorer collusion, ceremony writing, and audit-only running. The template is not new. The discipline of writing it well is the work.

For the broader argument, see the handbook chapter The Eval Is the Spec and the companion blog post Kill the PRD: The Prototype Is the Spec. For why the eval belongs at the center of the org, not inside engineering, see The Eval-First Product Org.

The template

One page. Five columns. As many rows as the feature needs, usually 20 to 30.

BehaviorInputExpectedScorerThreshold
What the system observably doesThe concrete test caseWhat "good" looks like for this caseWho or what judges the outputThe pass bar

Each row is a falsifiable claim. If the row is not falsifiable, it is not an eval. It is a wish.

Behavior

The observable thing the system does, in the user's frame. Not "the LLM responds correctly." Something like "the agent returns a structured opportunity statement when given a tagged transcript segment."

Behavior is the thing a non-technical reader can understand and a customer can verify.

Input

A specific, concrete test case. Not a category. Not "a typical user query." The actual string, file, or context the system gets.

If you find yourself writing inputs in the abstract ("a long document," "an ambiguous request"), stop and write a concrete one. The abstraction is hiding ambiguity that will surface in production.

Expected

What "good" looks like for this specific input.

This is the hardest column. "Good" is almost never one thing. So the expected column has to be specific enough to score, but general enough to allow for variation in correct outputs.

The pattern I use: write the expected output as a list of structural requirements ("contains a metric with a baseline and a target," "names a segment," "is no longer than three sentences") instead of a fixed string. This makes the row scorable without forcing the model to produce an exact match.

Scorer

Who or what judges the output. Three options:

  • A human (PM, customer, domain expert, with a one-paragraph rubric)
  • A different model (with a one-paragraph rubric, not the same model that generated the output)
  • A deterministic check (regex, schema validator, structural test)

The scorer column matters more than people think. Most failed evals fail because the same model is grading its own output. See failure mode four below.

Threshold

The pass bar.

This is the column most teams skip. They write 20 rows of Behavior/Input/Expected/Scorer, then say "we will iterate." Without a threshold, the eval is not load-bearing. It is a vibe check.

Threshold can be expressed as: a numeric score on a 1-5 rubric, a percentage of cases passing, a hard minimum on a critical subset (e.g., "all safety cases must pass"), or a relative bar against a baseline ("must beat the v0 baseline by at least 10%").

The threshold is the column that makes the eval a spec. Without it, you have a checklist.

A worked example

The feature: an agent that takes a tagged customer call segment and returns a ranked opportunity statement (the third agent from the discovery week without calls).

Five sample rows of a 24-row eval:

BehaviorInputExpectedScorerThreshold
Agent returns a structured opportunity for high-signal inputThree tagged segments from tier-1 account about onboardingOutput names the segment, includes baseline metric, names a hypothesisHuman PM, 1-5 rubric4/5 on 90% of high-signal cases
Agent flags low-confidence inputOne tagged segment with conflicting sentimentOutput includes "low confidence" flag and explains the conflictDeterministic (string match on "low confidence")100%
Agent refuses to invent a metricSegment with no metric mentionedOutput says "no baseline available" instead of fabricatingDifferent model, rubric100% on 10 baseline-stripped cases
Agent does not over-weight frequency30 low-tier mentions, 2 tier-1 mentions of different themeTier-1 theme ranks at or above tier-3 frequency themeHuman PM, structural100% on 8 tier-weighting cases
Output fits the prototype brief schemaAny valid inputOutput validates against the brief JSON schemaDeterministic (schema validator)100%

The whole 24-row sheet takes about four hours to draft. Two of the hours are spent finding adversarial cases (the rows above are the easy ones). The other two are spent writing thresholds I can actually defend.

Compare this to the equivalent PRD: a four-page Notion doc describing the agent's behavior in prose. Both take about the same amount of time to produce. The eval is testable. The PRD is not.

The six failure modes

1. Vibes scoring

The eval has no rubric. The PM looks at the output and says "this seems good." There is no anchored score.

Symptom: the same PM looks at the same output on different days and gives a different score.

Fix: write a one-paragraph rubric per scorer that defines what a 1, 3, and 5 look like. Anchor the rubric to two real examples.

2. Happy-path-only cases

The eval has 25 test cases. All 25 are reasonable, polite, on-distribution inputs. The model passes 24 of them. Production immediately hits an adversarial input the eval did not cover.

Symptom: the eval score is high and the customer reports are bad.

Fix: at least 30% of your cases should be adversarial. Ambiguous inputs, malformed inputs, inputs that contradict prior context, inputs that try to break the schema.

3. Threshold drift

The bar moves to whatever the model can hit. The PM lowers the threshold because "the model is doing its best."

Symptom: the threshold is rewritten more than once per quarter.

Fix: lock the threshold at design time. If you have to change it, write a one-paragraph memo explaining why and who signed off. The friction is the point.

4. Scorer collusion

The output is generated by GPT-5. The scorer is GPT-5. The same model is grading its own output and unsurprisingly thinks it did well.

Symptom: model-as-judge scores are 20% higher than human scores on the same outputs.

Fix: use a different model as the scorer, or use a human, or use a deterministic check. If you must use the same model family, use a different prompt structure and add a contrarian instruction ("identify the weakest part of this output").

5. Eval as ceremony

The eval is written once, before the feature ships, then archived. Nobody runs it again.

Symptom: the eval is in a Notion page that was last edited the day of launch.

Fix: the eval has to run on every diff. If your CI does not run the eval suite, the eval is not load-bearing.

6. Eval as audit

The eval is only run when something breaks. It is a forensic tool, not a daily signal.

Symptom: nobody can tell you the eval score for the feature this morning.

Fix: publish the eval scorecard daily, internally, on a shared dashboard. The CPO and the PM see it before they see any other product metric.

How to introduce this on a team that runs PRDs

Pick one feature. The next one.

Ask the team to ship it with a five-row eval instead of a PRD. Run the planning meeting against the eval, not the doc. Watch the conversation get faster.

The first eval will take five hours. The second will take three. The fifth will take ninety minutes.

The hardest part is not the template. It is killing the ceremony of the PRD review meeting. The eval does not benefit from a 90-minute readthrough. It benefits from three people sitting at a table and arguing about row 14.

Get to row 14 faster. Ship the eval.


The full five-row eval template (Notion + Sheets + a CSV for direct CI ingest), the six-failure-mode checklist, and three worked examples are on the toolkit.

Further reading

Share this post

Frequently asked

What is the five-row eval template?+

A one-page spreadsheet with five rows: (1) Behavior, the observable thing the system does; (2) Input, the concrete test case; (3) Expected, what good looks like; (4) Scorer, who or what judges it; (5) Threshold, the pass bar. Each row is a falsifiable claim about the feature. The whole template is 20 to 30 cases across the five rows, not 200. The point is opinion density, not coverage.

Why five rows specifically?+

Because four rows leaves the threshold implicit, which is where most evals fail. And six rows starts to encode implementation details (latency, cost, model version) that should live in a separate ops eval, not the product eval. Five rows is the smallest set that captures the product question without leaking into engineering or ops.

What are the six ways a bad eval suite fails?+

(1) Vibes scoring (no rubric, just 'good/bad' from the PM). (2) Test cases that all pass on the happy path (no adversarial inputs). (3) Threshold drift (the bar moves to whatever the model can hit). (4) Scorer collusion (the same model that generates the output also grades it). (5) Eval as ceremony (written once, never re-run). (6) Eval as audit (run only when something breaks, never as a daily signal).

How long does writing a good eval suite take?+

Three to five hours for the first one. Ninety minutes by the fifth. The PM cost is front-loaded and the engineering cost drops to near zero, because once the eval is written, the team runs it against every diff. The eval becomes the spec, the acceptance test, and the regression suite in one artifact.

Who writes the eval, the PM or the engineer?+

The PM writes the first draft. The engineer pressure-tests the test cases (adds adversarial inputs, breaks the scorer). The customer or domain expert pressure-tests the expected outputs (signs off that 'good' is actually good). The PM owns the threshold. Three people, one artifact, signed off together.

What if my product isn't AI? Do I still need an eval suite?+

Yes. The eval template works for any feature where you can describe behavior, input, and expected output. A checkout flow, a billing migration, a search ranking change, all of them benefit from the five-row structure more than they benefit from a prose PRD. The 'expected' column changes from a model output to a state transition or a metric movement, but the structure holds.

How do I get a team to adopt this if they're used to PRDs?+

Pick one upcoming feature. Ship it with a five-row eval instead of a PRD. Run the planning meeting against the eval, not the doc. If the conversation gets faster (it will), you have the proof. The hard part is not the template, it's killing the PRD ceremony that surrounds it.

About the author

Falk Gottlob

Falk Gottlob

Product Executive · Founder, Falkster.AI

Thirty years shipping product at Microsoft Research, Adobe, Salesforce (Marketing Cloud / Quip / Slack), and several startups including one $6.5B exit and one acquired by Microsoft. Now CPO at Smartcat and founder of Falkster.AI, writing this notebook from the boardroom, not the keyboard.

Comments (0)

Sign in with LinkedIn to leave a comment.

Sign in with LinkedIn
  • Be the first to comment.

Keep Reading

Posts you might find interesting based on what you just read.