Hallucination

Hallucination is the failure mode in which a large language model generates plausible-sounding output that is factually incorrect, confidently producing wrong answers rather than acknowledging uncertainty.
Hallucination is one of the defining limitations of foundation models and a particularly dangerous failure mode for AI SRE applications. Under uncertainty, LLMs are systematically biased toward generating fluent answers rather than declining to answer. In a chat application, a hallucinated answer is annoying but recoverable. The user notices and corrects. In an incident response context, a confidently-wrong root cause hypothesis is more dangerous than no hypothesis at all: it can trigger wrong rollbacks, wasted escalations, and 30+ minutes of investigation pursuing an explanation that turns out to be coincidental.
The mitigation is architectural, not prompt-based. No amount of "be careful" instructions to an LLM prevents hallucination at the rates required for production reliability work. What prevents it is grounding: every claim the system produces must be able to be followed to retrievable evidence in the customer's actual environment. A hypothesis without supporting evidence chain isn't a hypothesis; it's a guess dressed up in confident language. Traversal's Causal Search Engine™ is built on this principle: investigations are evidence-graph traversals, not LLM completions. The model's role is to reason over evidence the system has already gathered, not to generate explanations from training data.
Related model limitations every operator should understand: calibration (models are systematically overconfident, especially on novel failure patterns), drift (provider model updates can shift behavior without warning), and latency (inference adds time that compounds across multi-step investigations). Each requires its own mitigation in production AI SRE design. The cost of a confidently-wrong AI SRE in production is higher than the cost of no AI SRE at all—and avoiding that cost is the difference between a system that works at enterprise scale and one that doesn't.