Trust, Safety, and the Guardrail as a Product Decision

The team you accidentally outsourced your product to

Most companies have outsourced "trust and safety" to legal and compliance. That worked when the product was deterministic and rules were enforced by a config file. It doesn't work now. Every guardrail in an AI product is a product decision: what you refuse, what you warn on, what you silently block, what you allow with a disclaimer, what you escalate to a human.

These decisions shape the product experience more than your feature decisions do. And in 2026, the difference between a product customers trust and a product they abandon is mostly in guardrail design.

Take it back from legal. Own it as product.

Why the old model breaks

The old model: legal writes a policy, hands it to engineering, engineering implements a "safety filter," product team works around the filter when it's annoying. The result is a product where:

Refusals feel arbitrary and inconsistent.
The same input gets different responses depending on phrasing.
The product apologizes for things that aren't problems.
Real risks slip through because the filter was tuned for the wrong things.
Engineering builds shadow workarounds because the filter is in the way.
Customers learn to "jailbreak" past the filter, which works, which means the filter wasn't really doing anything except annoying compliant users.

This is a worst-of-all-worlds outcome. The product is annoying and unsafe. You'd be better off with no filter, which is a sentence you don't want to say to your CEO. But it's true.

The guardrail tier system

What I run instead is a structured guardrail design with explicit tiers. Every potentially-risky behavior gets categorized.

Tier 0: Hard block. Product refuses, full stop, with a clear explanation. Reserved for behaviors that are genuinely dangerous, illegal, or violating to users. The list is short. If your Tier 0 list has more than 15 items, you're over-blocking and your refusal rate will hurt retention.

Tier 1: Soft warning. Product proceeds, flags the user. "I'll do this, but here's something to consider." For grey areas where context matters. The user is treated like an adult who can make their own decision.

Tier 2: Logged-only. Product proceeds normally; the action is logged for review. For when you want telemetry on edge behavior without a user-facing intervention. Most "trust and safety" instrumentation should live here, not at Tier 0.

Tier 3: Allowed. No filter. The default for the vast majority of inputs.

Every input passes through this hierarchy. Most pass to Tier 3. A small fraction trigger Tier 2 logging. A smaller fraction trigger Tier 1 warning. A tiny fraction trigger Tier 0 refusal.

I define what goes in each tier. Legal reviews. Security audits. I ship.

Refusal rate as a product metric

The move most teams haven't made: track refusal rate as a primary product metric.

Refusal rate up week-over-week is a problem. It means the model is becoming more cautious (often due to silent vendor updates, see The Living Changelog), or your guardrails are over-firing, or your prompts are inadvertently triggering refusals, or your users are testing limits more aggressively. All PM problems.

Refusal rate down week-over-week is also a problem. Maybe your guardrails are eroding, new edge cases are slipping through, or you've moved to a less-aligned model. Investigate.

A healthy product has a stable, low refusal rate with explainable spikes. A product with a wandering refusal rate has a guardrail design problem.

Add refusal rate to your dashboard. Treat it like conversion or activation.

The "overly cautious assistant" failure mode

The most common AI product failure I see in 2026 is the overly cautious assistant. The product won't summarize a news article because it might contain political content. Won't answer a medical question because it might be advice. Won't write the email because it might be persuasive. The user concludes: this product is useless. They cancel.

This failure is invisible if you only watch for "harmful outputs." The product is producing zero harmful outputs. It's also producing zero useful outputs. The metric you need: per surface, what percent of legitimate requests get refused?

If that number is over a single-digit percent for general-purpose features, your product is broken in the most expensive way: customers churn quietly, blaming themselves, and you never see it in your error logs.

Build the eval set that measures this. Real customer requests, labeled "legitimate" or "actually problematic." Run the refusal eval daily. Optimize aggressively for legitimate-request approval, with hard bounds on the genuinely-problematic refusal rate. The shape of the curve matters: high approval on legitimate, near-perfect refusal on problematic, and a small explainable middle ground.

How loud should the caveat be?

When the product proceeds with caveats, how loud is the caveat?

Loud caveats erode trust ("the assistant won't shut up about its limitations"). Silent compliance is risky ("the assistant just did a thing without acknowledging the risk"). The right answer is contextual:

For high-stakes domains (health, legal, finance), the caveat is part of the value.
For everyday tasks, the caveat is friction. Skip it.
For ambiguous tasks, the caveat is a one-liner offered once per session, not on every turn.

The instinct to over-caveat comes from a defensive crouch, covering yourself in case anything goes wrong. Customers read it as the product being unsure of itself. They lose confidence. They leave. Be calibrated, not defensive.

Map the actual risks

Spend a day mapping the real risks in your product. Not the theoretical ones legal would worry about. The real ones, ranked by probability times impact.

Most products' real top three risks are some version of:

The product confidently states something false (hallucination).
The product reveals data it shouldn't (data leak across users, or from documents the user shouldn't see).
The product takes an irreversible action incorrectly (sends an email, makes a change, completes a transaction).

Legal will worry about other things: copyright, defamation, regulatory exposure. Those are real but usually downstream of the operational risks above. Address operational ones first; legal ones get smaller automatically.

For each risk in your top 5, you should have:

A specific eval that tests for it.
A guardrail tier assignment.
A logged metric.
A documented response runbook.

If any of these is missing for any of your top 5, that's the gap.

Pick one thing this week

Spend 90 minutes writing your trust and safety design doc.

Open a doc. Title it "Trust and Safety Design."
List your tiers (Tier 0 through Tier 3) and what behaviors fall into each.
List your top 5 product risks, with their evals and runbooks. If you don't have an eval for any of them, mark them as gaps.
Calculate your current refusal rate per surface (or note that you don't measure this and add it to the gap list).
Schedule a 30-minute review with one person from legal and one from security. Walk them through the doc. Ask: "What am I missing?"

The doc is now the contract between product, legal, and security. With it, the conversation is "do we agree on this design?" Without it, the conversation is "you should have thought of that" in retrospect, after the incident.

A guardrail is a product decision. The PM who outsources it gets a product they didn't design and a customer experience they wouldn't have approved.

Trust, Safety, and the Guardrail as a Product Decision

The team you accidentally outsourced your product to

Why the old model breaks

The guardrail tier system

Refusal rate as a product metric

The "overly cautious assistant" failure mode

How loud should the caveat be?

Map the actual risks

Pick one thing this week

Frequently asked

Related reading

When Not to Use AI

Gross Margin Is Your Job Now

Prompt Ops

The Living Changelog

The Eval Is The Spec

Pricing for AI Products

Audit, workshop, or advisory.

Follow on LinkedIn.

Browse the toolkit.