Why this workshop exists
The teams I work with have the same problem. They've shipped AI features to production. They know they should have evals. They haven't built them because building them from scratch feels like a research project, and nobody has time for research projects.
So quality drifts. Model updates change behavior. Customer complaints surface patterns that should have been caught in production. The team knows this is a problem. The team keeps pushing it to next quarter.
This workshop solves it in two days.
What we build together
Day 1
Design
We pick your three highest-stakes AI features. For each one, we:
- Write the quality rubric. Three to five scoring criteria, calibrated to your specific product and customer expectations.
- Build the test case library. Twenty diverse cases per feature: happy path, edge cases, adversarial inputs. Generated collaboratively, reviewed by your team.
- Define the thresholds. What scores mean “ship,” “needs work,” and “block.”
- Decide the measurement cadence. Automated on every change, sampled in production daily, or both.
By end of day one, your team has three written rubrics, sixty test cases, and a clear measurement plan. Signed off.
Day 2
Build
We open Claude Code and we build. Together. Your engineers, your PMs, and me, in one room. By end of day two:
- The eval harness runs in your environment against real production samples.
- The first scoring report is generated and reviewed.
- The team has seen me walk through every part of the build and can maintain it without me.
- The handoff document, written by Claude as we build, is in your repo.
What you get
Three production-ready eval rubrics for your top features
An eval harness running in your codebase
Sixty test cases, curated and reviewed
A written runbook for maintaining and extending the system
A 30-minute follow-up call two weeks after the workshop to review what's been running in production
What this is not
Not a classroom course. Your team writes code, not notes.
Not a vendor pitch for an eval platform. We use open tooling. You own the harness.
Not a one-size-fits-all framework. Every rubric and every test case is specific to your product.
Who this is for
Engineering and product leaders whose AI features are live in production and whose eval infrastructure is either missing or informal. Typical attendees:
2–3 senior engineers
2–3 product managers or Product Builders
1 technical leader (VP Engineering, CTO, or senior EM)
Total workshop group: 5 to 8 people. Smaller groups get more hands-on time.
Prerequisites
- 1
Your team has Claude Code or Cursor set up and working before day one.
- 2
Access to the production samples we'll evaluate against (anonymized is fine).
- 3
A willing engineer who has commit access to the relevant codebase.
- 4
Your team has blocked two full days (not half-days, not “as they can make it”). The workshop only works if the room is committed.
Investment
Discussed during scoping
Flat fee. Includes travel anywhere in North America or Europe and all materials. Additional travel costs may apply for international locations. Payment: 50% on signing, 50% on day one.
Availability
One workshop per month.
Currently booking: Q3 2026.
“We'd been talking about building evals for 18 months. Falk got us to a running harness in 48 hours and our team owns it. Worth every dollar.”
How to start
Send the form below with your company, the three features you'd want to work on, and the likely date range.
Frequently asked
Can the workshop be remote?+
Onsite is the default because the room being committed is half the value. Fully remote workshops haven't worked as well; the team gets pulled into other meetings. Hybrid (one onsite day + one remote) is sometimes possible, talk to me on the scoping call.
What if our team doesn't use Claude Code?+
Cursor works too, with minor adjustments to the harness. If your team is on a different stack entirely, we'll discuss it on the scoping call. I won't push you onto a tool that doesn't fit your workflow.
Do you offer ongoing eval support after the workshop?+
Yes. There's an optional 3-month eval operations retainer where I answer questions as they come up, review the team's rubric updates, and help when new features launch. Most teams take it for 3–6 months. We discuss terms during scoping.
What if our top features change between booking and the workshop?+
That's normal. Send me the updated three the week before. The shape of the workshop doesn't change.