AI Agents·Falk Gottlob··7 min read

I Gave My AI Agents a Performance Review. Three Got Fired.

If an agent does the work of a team member, manage it like one. The scorecard, the coaching loop, and the firing criteria for the agents on your product team.

AI agentsagent evaluationagent managementproduct operationsAI product managementagent reliabilityAI Agents
Helpful?

A performance scorecard for an AI agent showing four metrics: precision, escalation rate, time-to-output, and trust trend, with a verdict line of keep, coach, or fire.
Try it live
See this agent running in the sandbox

Stream a simulated run, inspect the notifications it would send on Slack and email, and see exactly where it sits in the 7-stage PM OS flow. No password required.

The short version

If an agent does the work of a team member, it should be managed like one, including being let go when it underperforms. I run a quarterly performance review on every agent in my fleet using four metrics: precision, escalation rate, time-to-output, and trust trend. Last cycle, three agents failed it and I shut them down. Not because they never worked, but because they were confidently wrong in ways I could not catch at review time, which makes an agent worse than useless. The industry talks about agents like magic that either works or does not. The honest version is that agents are workers with track records, and most teams are keeping net-negative ones running out of sunk cost. Here is the scorecard, the coaching loop, and the criteria for when to stop coaching and fire the thing.

Last quarter I sat down and did something that felt slightly absurd. I gave my AI agents performance reviews. Same structure I would use for a person: a scorecard, a look at the trend, and a verdict. Keep, coach, or fire.

Three got fired.

I am telling you this because the way the industry talks about agents is dishonest. Agents are sold as magic. They either work or they do not, and if they do not you tweak the prompt until the demo looks good again. That is not how anything that does real work gets managed, and it is why so many teams have a fleet of agents that are quietly making them slower.

The mindset shift: an agent is a worker, not a feature

Here is the reframe that changed how I run this. An agent that drafts your release notes, triages your tickets, or summarizes your research is not a feature you shipped. It is a worker you hired. It has a job, a track record, and a cost. The cost is not the API bill. The cost is the human time spent checking its work and the damage when a check gets missed.

Once you see agents as headcount, the management questions write themselves. Is this one good at its job? Is it getting better or worse? Would I be faster without it? Those are performance questions, and they deserve performance answers, not vibes. This is the operational half of what I mean by PM as a team of AI agents: the team part is not a metaphor, it includes the part where you manage people out.

An agent that does real work is not a feature you shipped. It is a worker you hired, and a worker with no review is a worker you are not actually managing.

, The reframe

The four-metric scorecard

I grade every agent on four things. They travel across any agent, from a research summarizer to an autonomous bug-fixer.

Precision. The share of its outputs that are correct and usable without rework. This is the headline number, but on its own it lies, which is why it is one of four.

Escalation rate. How often the agent flagged its own uncertainty versus how often it should have. This is the metric nobody tracks and the one that matters most. An agent that never escalates is not confident, it is dangerous, because it is making calls it should be handing up. I would rather have an agent that escalates too much than one that has never once said "I am not sure."

Time-to-output. Wall-clock from trigger to usable result. An agent that produces a correct answer slower than I could is a net loss no matter how impressive the output.

Trust trend. Over the quarter, are my reviewers approving more of its work untouched, or less? The trend is more informative than any single score. A 70% precision agent climbing is a keep. A 90% precision agent sliding is a problem I need to understand now.

Why I fired three

The three I shut down did not fail because they never worked. They failed on the metric that gets people hurt: they were confidently wrong in ways I could not detect at review time.

That is the disqualifying failure. An agent that is sometimes wrong and tells you when it is unsure is manageable, you put a gate where the uncertainty is. An agent that is wrong with full confidence and a clean-looking output is a trap, because the only way to catch it is to redo the work, and if I have to redo the work, I do not have an agent, I have a generator of plausible mistakes that costs me a review every time.

One of the three was a competitive-intel summarizer that produced beautiful, structured briefs with occasional invented details I only caught because I happened to know the source. The output quality is exactly what made it dangerous. A worse-looking agent would have made me check. This one earned a trust it had not actually built. Fired.

An agent that is wrong with confidence and a clean-looking output is not underperforming. It is a trap with good production values.

, The disqualifying failure

Coaching is diagnosis, not prompt-tweaking

Before you fire one, you coach it, and coaching is where most teams go wrong. They treat every failure as a prompt problem and tweak words until the demo passes. That is not coaching. That is superstition with a text box.

Real coaching is diagnosis. When an agent misses, I categorize why. Was the prompt ambiguous? Was the context I fed it incomplete? Or is the task simply wrong for an agent, something that needs judgment the model does not have? Each diagnosis fixes a different layer, and only one of them is the prompt. I run this as a weekly loop borrowed from how I run prompt operations: collect the misses, cluster them, fix the single most common failure mode, re-measure. If the trust trend does not move after two cycles of honest coaching, it was never a coaching case.

Agents need an SLA, not applause

The last piece is accountability. Any agent that runs unattended and touches something that matters needs a defined service level, the same as a piece of infrastructure. What is it allowed to do alone. What must it escalate. How fast a human responds when it does. I cover the mechanics in shipping with observability, but the principle is simple: an autonomous worker with no SLA is a decision-maker with no accountability owner, and that is how one bad unattended run becomes a customer-facing incident.

This is the unglamorous part of the agent story that the demos skip. Agents are not magic and they are not employees, but the management discipline you apply to them is closer to managing people than managing software. They have good days and bad stretches. They drift. They need clear expectations and a real review. And sometimes, after you have honestly tried to coach them, the right call is to let them go.

Run the review on your fleet this quarter. Pick your four metrics, grade every agent, and look hardest at the escalation rate, because that is where the dangerous ones hide. I will bet you have at least one agent you have been keeping alive out of sunk cost that you would never re-hire today. That is the one to fire first.

Sources: Anthropic, on agent evaluation and reliability · Hamel Husain, on evals for LLM systems · OpenAI, on building reliable agents

Share this post

Also on Medium

Full archive →

Frequently asked

What does it mean to give an AI agent a performance review?+

It means evaluating an agent on a fixed scorecard the way you would evaluate a team member: how often it is right, how often it should have escalated but did not, how fast it produces usable output, and whether your trust in it is rising or falling over time. The point is to stop treating agents as magic that either works or does not, and start treating them as workers with a measurable track record you can coach or remove. If an agent does the work of a headcount, it earns the scrutiny of one.

What metrics should you use to evaluate an AI agent?+

Four that travel across any agent. Precision, the share of its outputs that are correct and usable without rework. Escalation rate, how often it flagged uncertainty versus how often it should have, because an agent that never escalates is more dangerous than one that escalates too much. Time-to-output, the wall-clock from trigger to usable result. And trust trend, whether the human reviewers are approving more of its work untouched over time or less. The trend matters more than any single score.

When should you fire an AI agent instead of fixing it?+

Fire it when the failure is structural, not promptable. If an agent is confidently wrong in ways you cannot detect at review time, if it requires so much checking that it is slower than doing the task yourself, or if its trust trend is falling despite repeated tuning, it is not a coaching case, it is a removal case. Keeping a net-negative agent running because it was hard to build is the same sunk-cost mistake as keeping a bad hire because recruiting was painful.

How is coaching an agent different from just changing the prompt?+

Prompt changes are one tool, but coaching an agent means diagnosing why it failed and fixing the right layer. Sometimes that is the prompt, sometimes it is the context you are feeding it, sometimes it is the task being wrong for an agent at all. A coaching loop is structured: review the misses weekly, categorize them, fix the most common failure mode, and re-measure. Random prompt-tweaking without measurement is not coaching, it is superstition, and most teams are doing the superstition version.

Should agents have something like an on-call or SLA?+

Yes, if they touch anything that matters. An agent that runs unattended needs a defined service level: what it is allowed to do alone, what it must escalate, and how fast a human responds when it does. Without that, you have a system making decisions with no accountability owner. Treating agents like infrastructure with an SLA, rather than like a clever toy, is what separates teams that scale agents safely from teams that get burned by one bad autonomous run.

About the author

Falk Gottlob

Falk Gottlob

Product Executive · Founder, Falkster.AI

Thirty years shipping product at Microsoft Research, Adobe, Salesforce (Marketing Cloud / Quip / Slack), and several startups including one $6.5B exit and one acquired by Microsoft. Now CPO at Smartcat and founder of Falkster.AI, writing this notebook from the boardroom, not the keyboard.

Comments (0)

Sign in with LinkedIn to leave a comment.

Sign in with LinkedIn
  • Be the first to comment.

Keep Reading

Posts you might find interesting based on what you just read.