Managing Risk in Production AI Systems

Why production AI needs risk measurement, not just predictions — and how to turn a model's uncertainty into enforceable risk policies.

Based on our talk at the MIT Enterprise AI Forum (April 2026).

Most AI tooling answers one question: what did the model predict? In a demo that’s enough. In production it isn’t — because the question that actually decides whether you can deploy is a different one: how much should you trust this particular prediction? A model that’s usually right but occasionally, silently wrong is hard to build on, since in the moment you can’t tell the good answers from the bad.

That gap is what strands so many projects between a promising pilot and a dependable production system. Once real decisions ride on the output, a wrong answer isn’t just a dip in an accuracy metric — it’s cost, eroded trust, and liability.

We think the missing layer is risk measurement: alongside every prediction, an estimate of the model’s own uncertainty — a sense of when it’s on familiar ground and when it’s guessing. Our framework, Capsa, wraps an existing model and adds exactly that.

But a number on its own doesn’t change anything. The leverage comes from turning uncertainty into action — risk policies that decide what the system does when confidence is high versus low. In practice they fall into three patterns:

Abstain — when a prediction (or an intermediate reasoning step) is too uncertain to trust, don’t act on it: block the unsafe output or roll the step back.
Escalate — route unfamiliar or high-stakes inputs to a second stage — a human reviewer or a stricter check — instead of letting the model decide alone.
Adapt — use uncertainty to improve over time: flag the most informative examples for labeling, and watch for distribution shift so you know when to retrain.

A concrete example is automated support-ticket triage. When the model confidently identifies a routine issue, let it run end to end. When it’s handed an ambiguous ticket and its confidence drops, that’s the signal — surface the uncertainty, offer an alternate hypothesis, and escalate rather than guess. The payoff is a system that automates the easy majority and knows which cases to hand off.

That’s the shift we’re arguing for: stop treating a model as a black box that emits answers, and start treating it as a component whose reliability you can measure and govern. “Know what your model doesn’t know” isn’t a slogan — it’s the prerequisite for deploying AI where being wrong is expensive.

Capsa is documented here — request access to try it on your own models.