Confidence Is Not a Feeling

The Category Error Behind Every AI Trust Debate


I argued with three smart engineers about the wrong thing for forty-five minutes.

One said confidence in AI is subjective, a felt sense shaped by individual experience. Another said we should drop the fuzzy language entirely and stick to engineering terms: reliability, precision, recall, predictability. A third kept circling back to trust, because that’s the word the industry gave them.

Each was arguing for a necessary concept and treating it as sufficient. None of them could answer the only question that mattered: when an organization hands decision-making authority from a human to a system, what governs that decision?

Not feelings. Not system metrics alone. Something else. The industry has the letters. The word has yet to be constructed.

I spelled the word.

Confidence is the measured basis for authority delegation decisions.

That’s what this piece is about. Not a rebuttal. Not a terminology preference. A definition missing from AI adoption conversations, and one whose absence keeps producing governance failures the industry should already recognize.

If you’re going to read the Confidence Engineering series, you need to understand why that word was chosen, what it actually means, and why the alternatives don’t work. This is where that starts.


Why Not Trust

Trust is relational. It belongs between humans. You extend trust based on character, history, and shared values. Trust can be betrayed. It implies vulnerability to someone else’s choices. None of that applies to a system.

But we apply it to systems anyway, and I know this because I’ve done it myself.

Early in my consulting career, I was deploying a flash storage array and a three-node Hyper-V cluster for a small firm. I had an assistant handling the storage zoning while I worked the compute side. Implementation went clean. Servers racked, powered, cabled. Storage provisioned and fibered to the SAN switches. Cluster created. VMs migrated. Everything looked right. I trusted the hardware. I trusted my assistant to zone the storage correctly as I’d instructed. I trusted the process.

That night, the building lost power during maintenance. I got a panicked call: systems were down. My first thought was “this is impossible, they must be mistaken. It’s DNS.” Back onsite, alone this time, I started running through checks. Layer 1, fine. Layer 2, fine. Then I got into the configurations. Only half the storage was zoned correctly. The second switch was forgotten entirely. The cluster panicked on power restoration, VMs couldn’t recover cleanly, and everything was stopped in an error state.

I trusted the process when I should have tested the outcome. I felt good about the deployment when I should have verified the deployment. And when the evidence arrived that something was wrong, my first instinct was to reject it, because trust has inertia. Once you’ve decided something is solid, a lot has to happen before you’re willing to reconsider.

That’s the problem with trust. Not that it’s irrational. That it’s sticky. And sticky is dangerous when the environment requires constant re-evaluation.

Philosopher Mark Ryan demonstrated that AI fails every philosophical account of trust except the rational one, and his conclusion is precise: rational trust “is not really trust at all but reliance.” The word carries baggage the evidence-based version doesn’t earn.

We were set up to fail from the beginning. AI systems construct sentences that sound like a person constructed them. That single design decision distorted every subsequent conversation about how to evaluate, govern, and deploy them. We say the model “hallucinates” instead of saying it produced incorrect output. We say it “thinks” instead of saying it performed inference. There’s a difference between using evocative language to describe the human experience of working alongside AI, which I’ve done extensively, and using anthropomorphic language to describe the system’s behavior in engineering and governance contexts. The first is reflection. The second is misdirection.

Trust anthropomorphizes the decision. It turns an engineering question into a relationship question. And relationships don’t have SLOs.


Why Not Reliability

If the problem with “trust” is that it’s too human, the engineering instinct is to remove the human entirely. Describe the system in mechanical terms. Reliability. Predictability. Precision. Recall. Calibration.

Every one of those terms is necessary. None is sufficient.

Anyone who has spent time in a managed services operations center knows this instinctively. The customer calls in: “my application is just slow.” Operations looks at the dashboard: “but everything is green.” I’ve lived this conversation more times than I can count. CPU at 2%, system is idle waiting for work. Memory nominal. Disk latency within thresholds. Every metric says healthy. The customer says otherwise. And the customer is right, because the metrics were accurate and completely insufficient. Reliability said the system was healthy. The outcome said it was not.

That’s the gap. And it’s the same gap the AI industry is falling into.

A multi-reader study in radiology proved how dangerous this gap becomes when the stakes are human. When AI provided incorrect diagnostic results, radiologists’ false negative rates jumped from 2.7% without AI to 33% with incorrect AI. The system didn’t just fail on its own. It made the humans worse. The aggregate accuracy metrics were fine. The authority delegation decision was not.

System metrics tell you what the machine is doing. They don’t tell you what the organization should do about it. That gap is where the damage happens.


What Confidence Actually Is

I’ve been doing confidence engineering my entire career. I just didn’t have the vocabulary for it.

A disaster recovery system creates a snapshot. We can see whether the backup succeeded or failed. We’re confident in that measurement. But to advance to the next level of confidence, we test the backup by performing a restore and verifying actual functionality. The snapshot tells you the system did something. The restore tells you the something was useful. Two different confidence levels. Two different types of evidence. Two different authority decisions: “can we say we have a backup?” versus “can we say we can recover?”

Reliability would tell you the snapshot succeeded. It would not tell you whether to bet your business on it during an actual disaster.

Trust would tell you how the CTO feels about the DR plan on a Tuesday afternoon.

Confidence tells you what you’ve demonstrated, under what conditions, and what authority that evidence justifies.

That’s the distinction, and it matters operationally.

Measured means empirical. Observable criteria, defined in advance, tracked over time, producing data that informs the decision. In my time at managed service organizations, we didn’t ask customers to trust that their infrastructure was healthy. We showed them. Dashboards. Alerts. Metrics. Evidence.

Basis means it informs but doesn’t dictate. Two organizations looking at identical confidence data might make different authority delegation choices based on risk tolerance, regulatory environment, or operational maturity. The confidence data isn’t the decision. It’s the ground the decision stands on.

Authority delegation is the actual question nobody is asking. Not “is this AI good?” Not “do we trust this tool?” The question is: given what we can observe about this system’s behavior in this specific context, what level of authority should we delegate to it, and under what conditions should that change?

Decisions means this is active, not passive. Confidence increases when evidence supports expanded authority. It decreases when evidence suggests contraction. It has explicit thresholds for advancement and explicit conditions for rollback. It degrades when you stop measuring.

This isn’t new. In 1978, Sheridan and Verplank defined ten levels of automation authority. Aviation has operated on graduated authority models for decades. The question “how much authority does the autopilot have right now?” has a precise, measurable, situation-dependent answer in every cockpit. It is not binary. It is not permanent. It is not based on trust.

Yet the AI industry treats authority delegation as all-or-nothing. And the result is the same cycle I’ve watched repeat across two decades and more environments than I can count. The technology arrives. Early adopters move fast. Something breaks. Leadership overcorrects with blanket restrictions. Adoption stalls. Eventually someone builds the governance framework that should have existed from the beginning, and the cycle restarts from a lower energy state.

Cloud migration. DevOps transformation. SRE adoption. Every single one followed this arc. AI is following it now.

And every single time, it’s the practitioner left standing in the gap. Not the executive who said “we need AI.” Not the vendor who sold the platform. The architect, the engineer, the operator who has to govern a system tomorrow morning with vocabulary that was broken before they inherited it.

Google’s SRE practice understood the fix intuitively. Service Level Objectives are set by business owners based on criticality, not by engineers based on what’s technically achievable. The SLO doesn’t ask “is the system reliable?” It asks “is the system reliable enough for the authority we’ve given it?” That “enough” is a human organizational judgment informed by engineering data, not replaced by it.

Confidence Engineering applies the same principle to AI. Not because it’s novel. Because the industry forgot to bring the discipline it already had into the one domain where it matters most.

Confidence is not a feeling. It’s not a system property. It’s the bridge between what you can observe and what you’re willing to delegate. And building that bridge is engineering work.


What Comes Next

The vocabulary was wrong. Now it isn’t.

Build from here.


This is the prequel to the Confidence Engineering series. Continue to Part 1: Why the Trust Discourse Is Sabotaging Itself, which examines the trust problem in depth. Part 2: The Practice introduces the framework. Part 3: Adoption Déjà Vu addresses the organizational preconditions that determine whether any of it sticks.


Photo by Nik on Unsplash