
Pick any ten product orgs at AI-native companies. Ask each one of them one question.
"Show me the eval dashboard you ran this morning, and tell me which feature's quality is trending down this week."
Two of the ten can answer. The other eight either don't run evals in production, run them quarterly, or have evals buried inside engineering and out of sight from the product leadership that needs them.
This is the single biggest unforced error I see in 2026. The teams that will still be standing in 2028 are the ones that rebuilt the org around evals before the board forced them to. This post is the rebuild.
The short version
The Eval-First Product Org puts evals at the center, not inside engineering or data science. Four functions: a Quality Spine (5-8 people, owns rubrics), Product Builder pods (4-6 people each, own outcomes), an Economics Unit (3-4 people, owns per-outcome margin and model routing), and a Discovery Network (centralized customer research and analytics). Three functions go away: standalone Product Ops, the separate AI/ML team, and research as an isolated shared service. The smallest first step is publishing eval scorecards for your top three AI features within six weeks. That proves the model before you reorganize anyone.
For the seven-stage operating model that connects all four functions, see the PM Operating System. For what the board expects to see from this org, see The CPO Mandate 2026. For the agent fleet the Quality Spine grades, see Your AI Agent Fleet.
Why evals belong at the center, not the edge
When most orgs finally get around to building evals, they put them in one of three bad places.
- Inside engineering. Evals become test infrastructure, run by ML engineers, reviewed by nobody in product. Quality becomes an engineering problem. Product ships, engineering complains, quality drifts.
- Inside a data science team. Evals become a research artifact. They exist, they're sophisticated, and nobody can find them during a production incident.
- Inside a "trust and safety" team. Evals become a compliance function, checked before launch and never again. Drift goes undetected because nobody owns post-launch quality.
Each of these placements is wrong for the same reason. Evals are a product function. They define what "good" means for the feature, and "good" is a product decision.
The right placement is at the center of the product org, owned by product, visible to everyone, and tied to the metrics the board actually sees. That is the Eval-First Product Org.
What an Eval-First Product Org looks like
Imagine a 200-person product organization today. Typical structure: VP of Product at the top, three or four directors underneath, each running a cluster of product managers, each paired with designers and engineering leads. Product Ops sits somewhere on the side. Analytics is a shared service. Research is another shared service.
This is a 2022 org chart. It was built for a world where the scarce resource was engineering capacity and the expensive bottleneck was coordination.
Here's what I would rebuild instead.
The Quality Spine
At the center of the org, reporting directly to the CPO, a small team I call the Quality Spine. Five to eight people for a 200-person org. Their charter has three functions:
- Define what "good" means. They own the eval rubrics for every AI-driven feature in the product. They don't write the rubrics alone, they partner with the Product Builders who own each feature, but they enforce the standard.
- Run the evals. Daily automated runs on production samples. Weekly synthesis. Monthly deep dives. Quarterly audit for drift against the baseline.
- Publish the scorecard. Internally visible to every team. The scorecard is the product org's equivalent of a P&L. It gets read every week.
The Quality Spine is senior. Think L6 and L7 talent. This is not a junior function. It's the position with the most leverage in the org, because its output determines what everyone else trusts as "shipped."
Product Builder pods
The product managers, designers, and front-line engineers become Product Builder pods. Four to six people per pod. Each pod owns one customer-facing outcome end-to-end: an activation rate, a retention number, a specific agent's delivery metric.
Each pod has a published eval rubric for every AI-driven step in their product area. The Quality Spine reviews and signs off on the rubric before any production ship. The pod runs the evals daily. The scorecard visibility keeps the pod honest.
Crucially, pods do not have a product manager separate from the designers and engineers. A pod is a Product Builder team. The title on the org chart may still say PM, designer, or engineer, but the work is shared.
The Economics Unit
One small team, three or four people, reporting to the CPO or jointly to the CPO and CFO. Their job is unit economics: cost per outcome, margin per outcome, token spend per customer, latency-cost tradeoffs per feature.
This team does not exist in most product orgs today. It is the single biggest org gap in 2026. Every Product Builder pod needs a partner who can tell them, in real time, what each ship is costing the business. Without this team, product leaders are flying blind on the new P&L.
The Economics Unit also owns the model routing strategy for the whole org. Which features should run on a fast, cheap model. Which need the expensive one. When to change. They publish this guidance internally.
The Discovery Network
Customer research and analytics get folded into a single function called the Discovery Network. Small, centralized, serving all pods. Their job shifts: less usability testing, more agent telemetry synthesis, more failure-mode interviews, more opportunity mapping.
The Discovery Network is also where European and regulated-market expertise lives, because customer discovery in regulated markets is now a specialized skill that regular product pods should not have to master individually.
What goes away
Three functions get absorbed or eliminated.
Product Ops disappears as a standalone team. The valuable pieces (tooling, templates, playbooks, process health) get distributed. Tooling goes to engineering. Templates and playbooks go to the Quality Spine. Process health gets automated into dashboards that the CPO reads directly.
The separate "AI team" or "ML team" that sits next to product is dismantled. If your ML engineers are sitting in a silo labeled "AI," the rest of the org treats AI as someone else's problem. In an Eval-First org, every pod has the capability to build and ship AI. The ML specialists embed, they don't live in a separate house.
The "PMO" or program management layer, if you have one, becomes a single coordinator who reports to the CPO. Their job is not to run program management, it is to maintain the outcome ledger that goes to the board every quarter. One person. Not a team.
The week-over-week cadence
An Eval-First org runs on a specific weekly rhythm. Here's the cadence I've used.
Monday morning, 30 minutes: Quality stand-up. Quality Spine reads the weekend scorecard aloud. Any feature below threshold gets a named owner and a deadline. Any drift trend gets discussed. This is the product org's equivalent of a production incident review.
Tuesday and Wednesday: Pod work. Product Builder pods ship. They build prototypes, they run evals locally, they push to production. The Quality Spine is available for consultation, not gatekeeping.
Thursday, 45 minutes: Economics review. Economics Unit presents the week's cost and margin trends by pod. Any negative margin gets a deadline for resolution. The CFO or finance partner attends.
Friday, 30 minutes: Ship log. Each pod posts what they shipped, what it moved, and what they're betting next week. The ship log is public. Board member who wants to read it can.
No status meetings. No update-of-updates meetings. Nothing that exists to convey information that could have been read from a dashboard.
What this unlocks
I've seen three patterns show up consistently when orgs make this shift.
- Product leaders stop being the quality bottleneck. Most directors today are the last line of defense against shipping something broken, because there is no systemic way to catch quality drift. When the Quality Spine exists, directors stop reviewing every ship and start reviewing the systems that review every ship. Leverage goes up.
- Product and engineering stop fighting about ownership of AI features. The eval rubric is the contract. Once the rubric is signed off by the Quality Spine, engineering builds to it and product ships against it. The ambiguity that fuels the typical product-engineering friction disappears.
- The conversation with the board gets better. The board can read the scorecard. The CPO's job stops being explaining what's going on and starts being articulating what to bet on next.
The objections
Three things come up every time I propose this structure.
"We can't afford a Quality Spine." You can. You're already spending the money. It is currently distributed across every PM, every engineering team, and every incident review. Centralizing it reduces total cost, because the current distributed spending is duplicate and under-leveraged.
"Our culture won't accept a centralized quality function." Your culture is not going to accept a board member asking why your production evals dropped 15% last month and nobody noticed. Pick the culture problem you'd rather have.
"We don't have the talent to staff this." You have more than you think. The senior PMs you have who are most frustrated with the current system are usually the ones who want to run evals. The ML engineers who are tired of being on the AI Island will apply the day you post the role. This hire is easier than a lot of the others on your plate.
Where to start if you can't rebuild the whole org
Most CPOs reading this can't reorganize 200 people next quarter. Here's the minimum viable Eval-First move for a team that can't rebuild yet.
- Hire or promote one Quality Lead. One person. Reports to you. Charter: define evals for the three highest-traffic AI features in the product, get them running in production, publish the scorecard internally within 60 days.
- Pair that person with one Economics Lead. Can be a finance partner on secondment. Charter: build the cost-per-outcome view for the same three features in the same 60 days.
- Make those two people the standing audience for every product review meeting. If a feature gets reviewed, the Quality Lead and Economics Lead are in the room.
That's two hires or reassignments. 60 days. You now have the nucleus of an Eval-First org, and the rest of the structure can grow around it over the next year.
The product orgs that don't make this move are going to be re-orged by their boards in 2027. The ones that make it voluntarily get to choose the shape.
I've published the eval rubric template I use and the Quality Lead job description on the toolkit at falkster.com/toolkit. Fork them and ship them this week.
Further reading
Frequently asked
Why do evals belong at the center of the product org, not inside engineering?+
Evals define what 'good' means for a feature. Good is a product decision. Putting evals inside engineering makes quality an engineering problem and product ships, engineering complains, quality drifts. Putting them inside data science makes them a research artifact nobody finds during incidents. Putting them inside trust and safety makes them a compliance gate that gets checked once and never re-run. Center placement is the only one where the right people own the right thing.
What is the Quality Spine in an Eval-First Product Org?+
A small team (5-8 people in a 200-person org) reporting directly to the CPO. They define what good means for every AI-driven feature, run the evals daily, and publish the scorecard internally. Senior people, L6 and L7. Their output is the product org's equivalent of a P&L.
What is a Product Builder pod and how is it different from a feature team?+
Four to six people who own one customer-facing outcome end-to-end (an activation rate, a retention number, a specific agent's delivery metric). Every member can build prototypes and read evals. The PM, designer, and front-line engineer titles persist on the org chart but the work is shared across the pod. The Quality Spine signs off on the eval rubric before any production ship.
What is the Economics Unit and why does the product org need one?+
Three or four people reporting jointly to CPO and CFO. They own per-outcome unit economics: cost per outcome, margin per outcome, token spend per customer, latency-cost tradeoffs per feature. They also own the model routing strategy (which features run on the cheap fast model, which need the expensive one). This team does not exist in most product orgs today and is the single biggest org gap in 2026.
What goes away when you rebuild around evals?+
Three things. Product Ops as a standalone function (its valuable pieces redistribute: tooling to engineering, templates to the Quality Spine, process health to dashboards). The separate AI/ML team that sits next to product (every pod gets the capability instead). And research as an isolated shared service (it folds into a Discovery Network that sits next to but not inside pods).
How long does the Eval-First rebuild take?+
Roughly 90 days for the structural shifts in a mid-sized org, but the cultural shift takes a quarter beyond that. The eval scorecard has to become the artifact every weekly leadership meeting opens with. Until that happens, the rebuild has not landed.
What's the smallest first step toward an Eval-First Product Org?+
Pick the three most-used AI features in your product. For each, write a one-page rubric (3-5 criteria, scored 1-5, with at least 20 test cases). Run the rubric weekly. Publish the result internally. Six weeks of doing this is the smallest version of the Quality Spine and proves the model before you reorganize anyone.