↑ Top
Linh Truong · LLM reliability · May 2026

AI Hallucination Solution How I cut hallucinations in production: why models confabulate, what to measure, and the stack I actually ship—RAG, grounding, verification, structured outputs, guardrails, and evals.

Audience: Engineers · Architects · ML/AI builders Stack: Model-agnostic (Claude · GPT · Gemini · Llama) Revised: May 2026 Outcome: Reliable, grounded, observable AI

Hallucination is not a bug you patch once—it is what next-token prediction does when nothing forces grounding. I rely on retrieval, grounding, verification, structured generation, and eval loops together. What follows is my working playbook (2025–2026): failure taxonomy, root causes, defense layers, architecture patterns, code recipes, eval frameworks, monitoring, and a decision tree for picking the right fix per use case.

01 · Foundation

What Hallucination Really Is

A hallucination is any model output that is fluent and confident but factually wrong, unfaithful to context, or internally inconsistent. It is not a malfunction — it is what a probability-maximizing language model does when it has no way to know it should stop.

~3–27%Hallucination rate across frontier LLMs on grounded QA (HHEM, 2025)
~50%↓Typical reduction with well-designed RAG + grounding
< 1%Achievable with retrieval + verifier + structured I/O
Working definition. An output is a hallucination if at least one of the following is true:
(a) it asserts a fact contradicted by an authoritative source; (b) it asserts a fact not entailed by the provided context (when context was supposed to be the source of truth); (c) it cites, quotes, or links to something that does not exist; (d) it contradicts itself within the same response.
USER INPUT "Summarize this PDF" "What's our refund policy?" "Write tests for this fn" Often ambiguous, underspecified SOURCE OF TRUTH Docs · DB · API · Code What the answer SHOULD be based on LLM Predicts next token to maximize probability FAITHFUL ANSWER Entailed by source Cited · Verifiable HALLUCINATION Fluent · Confident · Wrong Invented fact / source / API USER TRUST ↑ with faithful ↓ with hallucination Loss is asymmetric.
Figure 1 — Two divergent paths from the same model: grounded faithfulness vs. fluent fabrication.
02 · Taxonomy

Taxonomy of Hallucinations

Different failure modes need different defenses. Misclassifying the type is the most common engineering mistake.

TypeWhat it isExampleBest defense
Factual (closed-domain)Asserts a wrong real-world fact"Einstein won the Nobel in 1925" (was 1921)RAG · web search · verifier
Faithfulness (intrinsic)Output contradicts provided contextSummary adds claims not in the documentStrong grounding prompt · NLI checker
ExtrinsicOutput adds info not in context (even if true)Inventing dates the doc never mentioned"Only use given context" · citation enforcement
ReferenceInvents citations, URLs, paper titles, IDsFake DOI, made-up court case, ghost npm packageTool call to verify existence · whitelist
LogicalInternally inconsistent reasoning"X > Y, Y > Z, therefore Z > X"Chain-of-thought · self-consistency
ArithmeticCalculation errors stated confidentlyTax math, unit conversion, code metricsCode interpreter / calculator tool
TemporalStale facts post-cutoff or wrong "as of" dates"The current CEO is..." (changed last month)Date-aware retrieval · explicit cutoff
IdentityConfuses similar entitiesMixes up two people with similar namesEntity disambiguation · grounded retrieval
Code (package hallucination)Invents libraries, APIs, function signaturesimport foobar (doesn't exist)Static check · linter · sandbox run
SchemaReturns JSON that violates required schemaWrong types, missing fields, hallucinated keysConstrained decoding · validation + retry
Most dangerous in production: reference hallucinations (legal, medical, scientific apps) and code hallucinations (supply-chain risk — "slopsquatting" attacks register fake packages models suggest). Defend explicitly.
03 · Root Causes

Root Causes — Why LLMs Hallucinate

Hallucinations have at least six distinct causes. Each one needs a different mitigation.

CAUSE 1

Training objective

Next-token cross-entropy rewards plausibility, not truth. Saying "I don't know" was historically penalized; that's now actively being changed in modern RLHF.

CAUSE 2

Training data noise

The web contains lies, satire, outdated info, contradictions. Models memorize the distribution, not a fact-checked encyclopedia.

CAUSE 3

Knowledge cutoff & staleness

Any fact that changed after training is unknown. Without retrieval, the model fills the gap with confident guesses.

CAUSE 4

Long-tail facts

Rare entities (a niche library, a small company, a specific patient ID) are under-represented; the model interpolates from neighbors.

CAUSE 5

Prompt ambiguity

Vague or under-specified questions force the model to assume. Assumption + confidence = fabrication.

CAUSE 6

Decoding stochasticity

High temperature, top-p sampling, and exposure bias amplify drift across long generations.

CAUSE 7

Sycophancy / RLHF artifacts

Models trained on human preference can favor confident, agreeable answers over correct uncertain ones.

CAUSE 8

Context drift

In long conversations, the model forgets or compresses prior turns and may invent continuity.

CAUSE 9

Tokenization & arithmetic

Numbers, dates, and code identifiers tokenize awkwardly; subword splits cause silent errors.

"Language models are designed to be plausible. Truth is an emergent property — and only when we engineer it to be."— Engineering folklore, 2025
04 · The Stack

The Defense-in-Depth Stack

No single technique solves hallucination. Production systems layer 4–7 defenses. Below is the canonical stack.

L7 · Human Oversight & Feedback (review, thumbs, escalation) L6 · Monitoring, Evals & Continuous Improvement L5 · Guardrails (input filters, output filters, policy checks) L4 · Verification (verifier model, NLI, citation check, self-consistency) L3 · Generation (CoT · structured outputs · constrained decoding · tools) L2 · Grounding (RAG · retrieval · citations · "use only this context") L1 · Model & Prompt (right model · clear instructions · low temperature)
Figure 2 — The 7-layer defense-in-depth stack. Each layer cuts the residual error of the layer below.
Rule of thumb: each well-implemented layer typically halves hallucination rate. 5 layers ≈ 32× fewer hallucinations than a bare LLM call.
05 · L2 Grounding

Retrieval-Augmented Generation (RAG)

If the answer must come from your data, the model must see your data at inference time. RAG is the single highest-leverage technique against hallucination.

A · Offline Ingestion Source Docs Clean & Chunk200–800 tokens Embed+ metadata Vector DBpgvector · Qdrant · etc B · Online Query User Query Rewrite/ExpandHyDE · multi-query Hybrid Retrievevector + BM25 + filter Rerankcross-encoder Build Contexttop-k + citations LLM Generatewith citations embedding similarity C · Verification Citation check NLI / faithfulness Self-check / Critic Retry / fallback
Figure 3 — The full RAG pipeline: ingestion, query-time retrieval, and post-generation verification.

Ingestion best practices

  • Chunk by structure (headings, paragraphs), not fixed length
  • 200–800 tokens per chunk; overlap 10–20%
  • Store rich metadata: source, section, date, author, ACL
  • De-duplicate near-duplicates with MinHash/LSH
  • Pre-compute summaries or hypothetical questions for each chunk
  • Pick embedding model per language & domain; benchmark on your data

Retrieval best practices

  • Hybrid search: dense (vector) + sparse (BM25) outperforms either alone
  • Rerank with a cross-encoder (e.g., Cohere Rerank, bge-reranker)
  • Query rewriting: HyDE, multi-query, step-back, decomposition
  • Recency & ACL filters applied before ranking
  • Confidence threshold: if top score < X → "I don't know"
  • Context limit: top 4–10 chunks; quality > quantity (lost-in-the-middle)
# Minimal RAG pseudocode (model-agnostic)
def answer(question, k=8):
    q_emb   = embed(question)
    dense   = vector_db.search(q_emb, k=k)
    sparse  = bm25.search(question, k=k)
    fused   = reciprocal_rank_fusion(dense, sparse)
    top     = rerank(question, fused)[:5]

    if top[0].score < THRESHOLD:
        return {"answer": "I don't have enough information."}

    context = format_with_citations(top)
    prompt  = SYSTEM + f"""
    Use ONLY the context below. Cite sources as [#].
    If the answer is not in the context, say so.
    Context: {context}
    Question: {question}
    """
    answer  = llm(prompt, temperature=0.2)
    assert citations_valid(answer, top)        # L4 verification
    return answer
06 · Faithfulness

Grounding & Citations

RAG only helps if the model actually uses the retrieved text. Grounding makes that explicit, auditable, and verifiable.

RULE

"Only use the context"

Hard-line the model in the system prompt:

# System prompt fragment
You answer ONLY from the provided context.
If the context does not contain the answer,
reply: "I don't have that information."
Never use prior knowledge. Cite sources as [n].
RULE

Inline citations

Require every claim to be tagged with the source chunk ID. Validate post-generation:

  • Every [n] resolves to a real retrieved chunk
  • The cited chunk contains text supporting the claim (NLI check)
  • Reject / regenerate if any citation fails
RULE

Provenance metadata

Return for every claim: source URL, last-updated date, author, confidence. Make the answer auditable end-to-end.

RULE

Abstain gracefully

An "I don't know" with a suggested next step beats a confident wrong answer 100× in user trust. Train your prompt and your evals to reward abstention.

Frontier feature: Modern APIs (Claude, GPT, Gemini) now offer built-in citations and grounded-generation modes. Use them where available — they reduce custom verification work.
07 · L1 Prompting

Prompt Engineering for Truthfulness

Even with RAG, prompt design controls how the model handles uncertainty, ambiguity, and edge cases.

High-leverage prompt patterns

  • Role + constraints: define expertise, scope, refusal rules
  • Explicit "I don't know" license: give the model permission to abstain
  • Step-back prompting: ask abstract question first, then specific
  • Chain-of-thought: "think step by step" reduces logic errors
  • Few-shot grounded examples: show the citation format
  • Negative examples: include what not to do
  • Temperature 0–0.3 for factual; 0.7+ for creative
  • Self-critique step: "Now check your answer against the sources."

Anti-patterns that cause hallucination

  • "You are an expert in everything" → invites overconfidence
  • Vague tasks like "tell me about X" without constraints
  • Mixing instructions with untrusted user input (prompt injection)
  • Asking for exact numbers without giving the source
  • Long lists of "always do X, never do Y" — the model forgets
  • Implicit assumptions ("write the code") with no schema
# A robust grounded-QA system prompt template
You are a careful research assistant.

Rules:
1. Answer ONLY using the CONTEXT below.
2. Each claim must end with a citation like [3].
3. If the context is insufficient, reply exactly:
   "I don't have enough information to answer that."
4. Never invent URLs, names, numbers, or dates.
5. If asked about something outside the context, decline.

Output format:
<answer>
Concise answer with [n] citations.
</answer>
<sources>
[1] short title — URL
[2] ...
</sources>

CONTEXT:
{retrieved_chunks}

QUESTION: {user_question}
08 · L3 Generation

Structured Outputs & Constrained Decoding

If the output shape is wrong, downstream code crashes. Constrained decoding makes invalid outputs impossible, not merely unlikely.

TECHNIQUE

JSON Schema / Tool Schema

Define expected output as a JSON Schema. Modern APIs guarantee schema-valid output via token masking.

TECHNIQUE

Grammar-constrained

llama.cpp, Outlines, Guidance: restrict token sampling to a context-free grammar (regex, BNF).

TECHNIQUE

Type-safe parsers

Pydantic, Zod, Instructor: parse → validate → retry on failure. Catch schema hallucinations before they ship.

# Example: structured output with retry-on-fail
from pydantic import BaseModel, Field

class Answer(BaseModel):
    answer: str
    confidence: float = Field(ge=0, le=1)
    sources: list[str]
    is_uncertain: bool

result = llm.generate(prompt, response_format=Answer)  # schema-enforced
if result.confidence < 0.6 or result.is_uncertain:
    escalate_to_human(result)
Why it works: Schema constraints prevent ~all syntactic hallucinations (missing fields, wrong types, hallucinated keys). They don't prevent semantic hallucination — that's L4's job.
09 · L3 Generation

Tool Use & Function Calling

For anything the model is bad at — math, current data, exact facts, code execution — give it a tool and let the deterministic system handle it.

Tools that eliminate whole hallucination classes

  • Calculator / Python REPL → arithmetic, units, stats
  • SQL / API call → live, authoritative data
  • Web search → recency, fresh facts
  • Code interpreter → run code before answering
  • Document retrieval → grounded answers
  • Validator endpoints → check that a DOI / URL / package exists
  • Calendar / time API → eliminate "today is..." errors

Design rules for tools

  • Each tool has a single, well-described purpose
  • Inputs strictly typed via JSON Schema
  • Outputs include status + data + error
  • Idempotent; safe to retry
  • Side-effecting tools require confirmation (write, delete, pay)
  • Prefer many small tools to one mega-tool
  • Return enough context that the model can reason about results
# Tool spec (OpenAI / Anthropic / Gemini all share this shape)
{
  "name": "lookup_customer",
  "description": "Fetch a customer by ID from the CRM.",
  "input_schema": {
    "type": "object",
    "properties": {
      "customer_id": {"type": "string", "pattern": "^c_[a-z0-9]+$"}
    },
    "required": ["customer_id"]
  }
}
10 · L4 Verification

Verifier & Critic Models

A second model — or the same model in a different role — reads the output and checks it before it reaches the user.

METHOD

NLI / Entailment check

For every claim in the answer, ask: does the source context entail this? (DeBERTa-NLI, vectara HHEM, custom).

METHOD

LLM-as-judge

Prompt a strong model to score faithfulness, relevance, completeness on 0–5. Use the same content as eval and runtime gate.

METHOD

Self-check / Reflexion

Ask the generating model: "Re-read your answer. Identify any unsupported claim." Then regenerate or strip.

METHOD

Citation existence check

Resolve each citation: does the URL respond 200? Does the DOI exist? Does the package name exist on PyPI/npm?

METHOD

Cross-model consensus

Ask 2–3 different models the same grounded question. Disagreement → escalate or abstain.

METHOD

Numeric & code sanity

Re-execute math/code in a sandbox; check totals, types, returns. Mismatch → retry.

# Faithfulness check via NLI
for claim in extract_claims(answer):
    supports = False
    for chunk in retrieved_chunks:
        if nli_model.entails(premise=chunk.text, hypothesis=claim):
            supports = True; break
    if not supports:
        log_hallucination(claim)
        answer = strip_or_regenerate(answer, claim)
Latency note: verifiers add 100–500ms. Run them in parallel with streaming, or asynchronously for low-risk content (then correct in a follow-up message if needed).
11 · Reasoning

Self-Consistency, CoT & Reasoning Models

Many "factual" errors are actually reasoning errors. Reasoning-time techniques are the cheapest accuracy gain you can buy.

Techniques

  • Chain-of-Thought (CoT): "Think step by step" — exposes reasoning, often surfaces its own bugs
  • Self-Consistency: sample N answers, take the majority vote
  • Tree of Thoughts: branch and evaluate multiple paths
  • Reflexion: model critiques its own answer, then revises
  • Self-RAG: model decides when to retrieve, then critiques
  • Reasoning models (Claude with extended thinking, GPT o-series, Gemini Thinking): native long-form chain-of-thought with verification — significant hallucination drop on complex tasks

When to use which

  • Simple lookup: direct prompt, no CoT
  • Multi-step: CoT
  • Math / logic: CoT + tool (calculator)
  • High-stakes (legal, medical, finance): reasoning model + verifier + human review
  • Real-time UX-sensitive: non-reasoning model + strong RAG, save reasoning for retry
2026 reality: Combining a reasoning model with RAG often achieves < 1% hallucination on grounded QA. The cost-per-query is higher, but for high-stakes flows the ROI is clear.
12 · Model

Fine-Tuning, RLHF & DPO for Truthfulness

When prompting + RAG aren't enough, change the model. Fine-tuning teaches behavior; preference optimization teaches taste.

ApproachWhat it doesData neededBest for
SFT — Supervised fine-tuningImitate ideal answers1k–50k examplesFormat, tone, domain vocabulary
Instruction tuningGeneralize to instructions10k+ varied tasksBase model → assistant
RLHF — Reinforcement from human feedbackLearn human preference10k+ preference pairsPolish, helpfulness
DPO / IPO / KTOSame as RLHF, no RL loopPairs (chosen, rejected)Cheaper alignment
RLAIFAI-generated preferenceSynthetic pairs + judgeScale where humans are slow
Truthfulness fine-tuningReward abstention over confident-wrongCurated I-don't-know examplesReduce confident hallucinations
DistillationSmall model imitates a larger oneTeacher outputsCost reduction, latency
Trap: fine-tuning on hallucinated synthetic data amplifies hallucination. Curate carefully; include explicit "abstain" examples; evaluate on held-out factuality benchmarks before shipping.
13 · Agents

Agentic Patterns & Multi-Step Reliability

In agent loops, errors compound. A 95% step accuracy across 10 steps = 60% end-to-end. Reliability engineering matters more than raw intelligence.

Reliability patterns

  • Plan-then-execute: separate planning from action
  • ReAct loop: reason → act → observe → re-reason
  • Critic / Reviewer agent: second model audits each step
  • Bounded retries: max N attempts, then escalate
  • Checkpointing: persist state; resume on failure
  • Tool whitelisting: agents can only call approved tools
  • Cost & loop budget: hard cap on tokens, steps, time
  • Human-in-the-loop on irreversible actions

Multi-agent failure modes

  • Echo chambers: agents agree with each other's hallucinations
  • Context bleed: noise from one agent pollutes others
  • Infinite loops: tool returns trigger same call again
  • Goal drift: agent "rewrites" the goal mid-run
  • Tool spam: calling expensive APIs unnecessarily
  • Prompt injection: retrieved content overrides system prompt
PLAN ACT (tool) OBSERVE CRITIQUE ANSWER or RETRYbudget enforced retry with corrections (bounded)
Figure 4 — A reliable ReAct loop with a critic step and a bounded retry — the minimum viable agent pattern.
14 · L5 Guardrails

Guardrails & Safety Layers

Hallucination, prompt injection, and policy violations share defenses. Treat the LLM as untrusted code and wrap it in filters.

Input guardrails

  • PII redaction before sending to model
  • Prompt-injection detection (Rebuff, Lakera, NeMo Guardrails)
  • Topic / scope classifier ("off-domain" → refuse)
  • Rate & cost limits per user
  • Strip / quote untrusted content (RAG chunks, web pages)

Output guardrails

  • Policy classifier (toxicity, secrets, hate, illegal)
  • Schema validation + retry
  • Faithfulness / NLI gate
  • Citation existence check
  • Numeric/code re-execution
  • PII leakage scanner
  • Confidence threshold → abstain
Open source toolkits: NVIDIA NeMo Guardrails · Guardrails AI · Llama Guard · Rebuff · Microsoft Presidio (PII) · Vectara HHEM (hallucination detection). Compose them, don't build from scratch.
15 · L6 Evals

Evaluation Frameworks

You cannot improve what you cannot measure. The biggest difference between a hobby project and a production AI system is the eval suite.

MetricWhat it measuresHow to compute
FaithfulnessClaims supported by source contextNLI per claim / RAGAS / LLM-judge
Answer relevanceAnswer addresses the questionEmbedding similarity Q ↔ A · LLM-judge
Context precision/recallRetrieved chunks are relevant & sufficientPer-chunk relevance labels
GroundednessOutput stays within contextHHEM · TRUE · attributable-to-source
Exact match / F1Closed-form QA accuracyCompare to gold answer
Hallucination rate% outputs with at least one unsupported claimHuman or LLM-judge
Abstention rate% of unanswerable Qs the model declinesOn adversarial test set
CalibrationConfidence matches accuracyReliability diagram / ECE
Latency & costp50/p95 ms, $ / requestTelemetry
FRAMEWORK

RAGAS

Open-source RAG eval: faithfulness, answer relevance, context precision/recall. Standard starting point.

FRAMEWORK

HHEM (Vectara)

Hughes Hallucination Eval Model — public leaderboard comparing top models on faithfulness.

FRAMEWORK

TruthfulQA / SimpleQA / FActScore

Open-domain factuality benchmarks. Useful for model selection and regression testing.

FRAMEWORK

DeepEval / promptfoo / OpenAI Evals

Test-driven eval frameworks. Pin behavior in CI; gate deploys on score drop.

FRAMEWORK

LangSmith / Braintrust / Helicone

Trace + eval observability. Run evals on production traffic samples.

FRAMEWORK

Custom golden set

50–500 examples specific to your domain. The single most valuable eval asset you can build.

Rule I ship by: nothing goes out without an eval suite in CI. Every prompt or model change → diff scores before merge.
16 · L6 Observability

Observability & Monitoring in Production

Even with perfect offline evals, real traffic surfaces new failure modes. Treat AI like distributed systems: trace everything, alert on drift.

What to log per request

  • Full prompt (system + user + retrieved context)
  • Model, version, parameters, seed
  • Retrieval candidates with scores
  • Tool calls, args, results
  • Final output + intermediate steps
  • Verifier scores + abstention decisions
  • User feedback (thumbs, edits, regenerate)
  • Latency & cost breakdown

What to alert on

  • Faithfulness score drops > X% week-over-week
  • Abstention rate spikes (retrieval broken?)
  • Citation-fail rate increases
  • p95 latency > SLO
  • Cost per request grows unexpectedly
  • Output schema validation failures
  • Guardrail trips (injection, policy)
  • Negative user feedback rate
Closed loop: Sample 1–5% of production traffic, label it (LLM-judge + occasional human), feed it back into your eval set and your fine-tuning data. This is how systems get reliably better over months.
17 · Architecture

Reference Architectures

Three battle-tested architectures, ordered by use case complexity.

A · Simple Grounded QA (chatbot over docs) Question Hybrid Retrieve Rerank LLM + Citations NLI Verifier Answer / Abstain B · Agentic Workflow (tools + plan + critic) Goal Planner Executor Tools (RAG, API) Critic Retry? Result C · High-Stakes (legal / medical / finance) Query Auth + ACL Filter RAG + Rerank Reasoning Model Multi-Verifier Human Review Observability: traces · evals · feedback loop · drift alerts · cost & latency telemetry All three architectures share this layer. Pattern selection follows risk × complexity. Start simple. Add layers when evals say you need them.
Figure 5 — Three reference architectures ordered by stakes: simple grounded QA · agentic · high-stakes with human-in-the-loop.
18 · Choose

Decision Tree by Use Case

When you don't know which technique to reach for, ask these questions in order.

Does answer require fresh / private data? NO YES Does it need exact math / live data? Use RAG (Section 5) YES NO Add tools / function calls (§9) Prompt + low temp + eval Is this high stakes? NO YES + NLI verifier + citations + reasoning model + multi-verifier + human-in-loop Always: structured output · eval suite in CI · production observability · feedback loop These four are non-negotiable regardless of path.
Figure 6 — Decision tree: from question to recommended defense layers.
19 · Field Manual

Developer Checklist

Print this. Run through before every AI feature ships.

Pre-launch

  • Defined "hallucination" in your domain (what counts?)
  • Have a labeled eval set (50+ examples)
  • Baseline metrics: faithfulness, abstention, latency, cost
  • RAG with hybrid retrieval + reranking
  • System prompt forces "answer only from context"
  • Inline citations validated against retrieved chunks
  • JSON Schema / structured output enforced
  • Tools for math, dates, live data, code execution
  • NLI or LLM-judge verifier on critical paths
  • Guardrails: PII, injection, policy, secrets
  • Abstain path: "I don't know" returns gracefully
  • Temperature ≤ 0.3 for factual tasks

Post-launch

  • Trace every request (prompt, retrieval, output, scores)
  • Alerts on faithfulness drop, latency spike, cost spike
  • User feedback captured (thumbs, edits, regenerates)
  • Weekly eval-set run; gate deploys on score regression
  • Monthly sample of prod traffic labeled & reviewed
  • Drift dashboard for retrieval & model versions
  • Versioned prompts (semantic version, diffable)
  • Rollback plan tested
  • Red-team prompts in regression suite
  • Documentation: model card, eval card, known failures
  • On-call runbook for AI incidents
  • Quarterly model upgrade evaluation
20 · North Star

Closing Principles — How to Build AI Developers Are Happy With

Engineering principles
  1. Treat the model as untrusted code. Wrap it.
  2. Ground every factual claim in retrievable evidence.
  3. Make abstention a first-class citizen.
  4. Constrain the output shape before constraining the content.
  5. Verify after generating — don't trust, test.
  6. Defense in depth: ≥ 3 layers for any real product.
  7. Evals in CI. Trace in prod. Feedback in the loop.
  8. Optimize for the worst 1% of inputs, not the average.
Career & happiness principles
  1. Ship measurable improvements, not vibes.
  2. Invest in evals — they compound like savings.
  3. Write boring code around exciting models.
  4. Pair every prototype with a kill-switch.
  5. Document failure modes; they become your moat.
  6. Read papers monthly; tune prompts weekly; refactor quarterly.
  7. Communicate uncertainty as clearly as the model should.
  8. Build systems you trust to leave alone over the weekend.
"A good AI system is not one that never makes mistakes. It is one that fails loudly, recovers gracefully, and improves measurably." — Linh Truong

I treat hallucination like partial failure in distributed systems: you do not wish it away—you engineer around it. RAG supplies evidence; structured outputs supply shape; tools supply precision; verifiers supply audit; evals supply feedback; observability supplies trust. Stack those layers and you get systems I am willing to leave running over a weekend.

21 · Sources

References & sources

Annotated bibliography behind the hallucination taxonomy, defense-in-depth stack, RAG recipes, verification patterns, eval metrics, and tooling tables in this note. Section tags (e.g. §5) show where each source is used. Prefer primary papers and official docs for benchmark numbers—model leaderboard scores move weekly.

Scope. Synthesis of peer-reviewed work, industry technical reports, and widely deployed open tools (May 2026). KPI bands in §1 (e.g. ~3–27% hallucination rate, ~50% reduction with RAG) are directional planning figures from the cited benchmarks and operator experience—not guaranteed production outcomes. Not legal or compliance advice.

Citations are numbered continuously [1]–[n] within this section.

Hallucination: definition, taxonomy & root causes (§1–3)

  1. Ji et al., “Survey of Hallucination in Natural Language Generation.” ACM Computing Surveys 2023. Taxonomy of intrinsic vs. extrinsic hallucination. arxiv.org/abs/2202.03629 — §1–2.
  2. Zhang et al., “Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Models.” 2023. LLM-era failure modes and mitigations. arxiv.org/abs/2309.01219 — §1–3.
  3. Maynez et al., “On Faithfulness and Factuality in Abstractive Summarization.” ACL 2020. Faithfulness vs. factuality distinction in §2 table. arxiv.org/abs/2005.00661 — §2.
  4. Shumailov et al., “The Curse of Recursion: Training on Generated Data Makes Models Forget.” 2024. Model-collapse risk when synthetic data loops. arxiv.org/abs/2305.17493 — §3.
  5. Kadavath et al., “Language Models (Mostly) Know What They Know.” 2022. Calibration and abstention background. arxiv.org/abs/2207.05221 — §3, §15.

Retrieval-augmented generation & hybrid search (§5, §17)

  1. Lewis et al., “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.” NeurIPS 2020. Canonical RAG formulation. arxiv.org/abs/2005.11401 — §4–5.
  2. Asai et al., “Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection.” ICLR 2024. Retrieve-when-needed + self-critique. arxiv.org/abs/2310.11511 — §11, §13.
  3. Izacard et al., “Atlas: Few-shot Learning with Retrieval Augmentation.” 2022. Retrieval + generation at scale. arxiv.org/abs/2208.03299 — §5.
  4. Robertson & Zaragoza, “The Probabilistic Relevance Framework: BM25 and Beyond.” 2009. Sparse retrieval baseline in hybrid search. BM25 foundations (PDF) — §5.
  5. Khattab & Zaharia, “ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT.” SIGIR 2020. Dense late-interaction retrieval. arxiv.org/abs/2004.12832 — §5.
  6. Karpukhin et al., “Dense Passage Retrieval for Open-Domain Question Answering.” EMNLP 2020. Embedding retrieval standard. arxiv.org/abs/2004.04906 — §5.
  7. Gao et al., “Precise Zero-Shot Dense Retrieval without Relevance Labels” (HyDE). 2022. Query expansion pattern cited in advanced RAG. arxiv.org/abs/2212.10496 — §5.

Grounding, citations & attributable generation (§6)

  1. Menick et al., “Teaching language models to support answers with verified quotes.” 2022. Grounded QA with evidence spans. arxiv.org/abs/2203.11147 — §6.
  2. Gao et al., “RARR: Researching and Revising What Language Models Say, Using Language Models.” 2023. Post-hoc claim verification. arxiv.org/abs/2211.08727 — §6, §10.
  3. Rashkin et al., “Measuring Attribution in Natural Language Generation Models.” 2022. Attributable-to-source metrics. arxiv.org/abs/2212.10340 — §6, §15.
  4. Bohnet et al., “Attributed Question Answering: Evaluation and Modeling for Attributed Large Language Models.” 2022. Citation-quality evaluation. arxiv.org/abs/2212.06011 — §6.

Prompting, abstention & uncertainty (§7)

  1. Zhou et al., “LIMA: Less Is More for Alignment.” 2023. High-quality instruction data for truthful tone. arxiv.org/abs/2305.11206 — §7.
  2. Miao et al., “SelfCheck: Using LLMs to Zero-Shot Check Their Own Step-by-Step Reasoning.” 2023. Self-verification prompts. arxiv.org/abs/2308.00436 — §7, §10.
  3. Lin et al., “Teaching Models to Express Their Uncertainty in Words.” 2022. Verbalized confidence for abstention UX. arxiv.org/abs/2205.14334 — §7.

Structured outputs & constrained decoding (§8)

  1. Willard & Louf, “Efficient Guided Generation for Large Language Models” (Outlines). 2024. Grammar-constrained decoding. arxiv.org/abs/2403.04218 — §8.
  2. Microsoft, “Guidance” library. Structured generation and regex constraints. github.com/guidance-ai/guidance — §8.
  3. OpenAI, “Structured Outputs” and JSON Schema support. API-level schema enforcement. platform.openai.com — §8.
  4. Anthropic, “Tool use” documentation. Structured tool schemas. docs.anthropic.com — §8–9.

Tool use & function calling (§9)

  1. Schick et al., “Toolformer: Language Models Can Teach Themselves to Use Tools.” NeurIPS 2023. Tool-learning foundation. arxiv.org/abs/2302.04761 — §9.
  2. Qin et al., “ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs.” 2023. API grounding at scale. arxiv.org/abs/2307.16789 — §9.
  3. OpenAI, “Function calling” guide. Shared tool-spec shape in §9 code block. platform.openai.com — §9.
  4. Google DeepMind, “Gemini API function calling.” Parallel tool-use spec. ai.google.dev — §9.

Verification, NLI & faithfulness models (§10)

  1. Williams et al., “A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference” (MultiNLI). 2018. NLI entailment checks in §10 pseudocode. arxiv.org/abs/1704.05426 — §10.
  2. Honovich et al., “TRUE: Re-evaluating Factual Consistency Evaluation.” NAACL 2022. Unified faithfulness benchmark suite. arxiv.org/abs/2204.04991 — §15.
  3. Min et al., “FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation.” 2023. Atomic claim verification. arxiv.org/abs/2305.14251 — §10, §15.
  4. Vectara, “Hughes Hallucination Evaluation Model (HHEM)” and public leaderboard. Grounded-QA hallucination rates in §1 KPIs. github.com/vectara/hallucination-leaderboard — §1, §15.
  5. Manakul et al., “SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection.” 2023. Sampling-based self-check. arxiv.org/abs/2305.11795 — §10.

Reasoning, self-consistency & agentic reliability (§11–13)

  1. Wei et al., “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.” NeurIPS 2022. CoT pattern in §11. arxiv.org/abs/2201.11903 — §11.
  2. Wang et al., “Self-Consistency Improves Chain of Thought Reasoning in Language Models.” ICLR 2023. Multi-sample voting. arxiv.org/abs/2203.11171 — §4, §11.
  3. Yao et al., “Tree of Thoughts: Deliberate Problem Solving with Large Language Models.” NeurIPS 2023. Branch-and-evaluate reasoning. arxiv.org/abs/2305.10601 — §11.
  4. Shinn et al., “Reflexion: Language Agents with Verbal Reinforcement Learning.” NeurIPS 2023. Self-critique and revise loop. arxiv.org/abs/2303.11366 — §10–11, §13.
  5. Yao et al., “ReAct: Synergizing Reasoning and Acting in Language Models.” ICLR 2023. Tool + reasoning agent loop. arxiv.org/abs/2210.03629 — §13.
  6. Shinn et al., “SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering.” 2024. Multi-step reliability in code agents. arxiv.org/abs/2405.15793 — §13.

Fine-tuning, RLHF & preference optimization (§12)

  1. Ouyang et al., “Training language models to follow instructions with human feedback.” NeurIPS 2022. RLHF / InstructGPT. arxiv.org/abs/2203.02155 — §12.
  2. Rafailov et al., “Direct Preference Optimization.” NeurIPS 2023. DPO without reward model. arxiv.org/abs/2305.18290 — §12.
  3. Bai et al., “Constitutional AI: Harmlessness from AI Feedback.” 2022. RLAIF for safety alignment. arxiv.org/abs/2212.08073 — §12, §14.
  4. Ethayarajh et al., “KTO: Model Alignment as Prospect Theoretic Optimization.” 2024. Binary feedback alignment. arxiv.org/abs/2402.01306 — §12.

Guardrails, injection defense & safety (§14)

  1. NVIDIA, “NeMo Guardrails” documentation and toolkit. Programmable rails for RAG + agents. github.com/NVIDIA/NeMo-Guardrails — §14.
  2. Meta, “Llama Guard” model cards and weights. Input/output policy classification. github.com/meta-llama/PurpleLlama — §14.
  3. Guardrails AI. Validator framework referenced in §14. github.com/guardrails-ai/guardrails — §14.
  4. OWASP Foundation, “Top 10 for Large Language Model Applications.” Prompt injection and data-exfil risks. OWASP LLM Top 10 — §14, §19.
  5. Greshake et al., “Not what you’ve signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection.” 2023. Injection via untrusted RAG content. arxiv.org/abs/2302.12173 — §14.

Evaluation frameworks & benchmarks (§15)

  1. Es et al., “Ragas: Automated Evaluation of Retrieval Augmented Generation.” 2023. Faithfulness, relevance, context metrics in §15 table. arxiv.org/abs/2309.15217 — §15.
  2. Lin et al., “TruthfulQA: Measuring How Models Mimic Human Falsehoods.” ACL 2022. Open-domain factuality benchmark. arxiv.org/abs/2109.07958 — §15.
  3. Wei et al., “Measuring Short-form Factuality in Large Language Models” (SimpleQA). OpenAI 2024. Short-answer factuality benchmark. openai.com — §15.
  4. Zheng et al., “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.” 2023. LLM-judge bias patterns in §15.3. arxiv.org/abs/2306.05685 — §15.
  5. Liu et al., “G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment.” 2023. LLM-as-judge rubric design. arxiv.org/abs/2303.16634 — §15.
  6. Confident AI, “DeepEval” documentation. Test-driven LLM eval in CI. github.com/confident-ai/deepeval — §15.
  7. promptfoo project. Prompt/regression testing CLI. github.com/promptfoo/promptfoo — §15.
  8. OpenAI, “Evals” framework. Dataset-driven eval API. github.com/openai/evals — §15.

Observability, tracing & production loops (§16)

  1. LangChain, “LangSmith” tracing and evals. Request traces, datasets, regression runs. docs.smith.langchain.com — §16.
  2. Braintrust. Eval + observability platform cited in §15–16. braintrust.dev — §15–16.
  3. Helicone. LLM gateway logging and cost telemetry. docs.helicone.ai — §16.
  4. OpenTelemetry, “Generative AI observability” semantic conventions (draft). Standard trace fields for LLM spans. opentelemetry.io — §16.

Inference stack & distillation (§8, §17–18)

  1. Kwon et al., “Efficient Memory Management for Large Language Model Serving with PagedAttention” (vLLM). SOSP 2023. Serving layer in architecture diagrams. arxiv.org/abs/2309.06180 — §17–18.
  2. ggml / llama.cpp project. Constrained decoding on edge. github.com/ggerganov/llama.cpp — §8.
  3. Hinton et al., “Distilling the Knowledge in a Neural Network.” 2015. Teacher–student compression for smaller deploy models. arxiv.org/abs/1503.02531 — §18 (where distillation applies).

KPI strip (§1)

~3–27% hallucination rate comes from the Vectara HHEM grounded-QA leaderboard and related faithfulness benchmarks—not a single vendor guarantee. ~50%↓ with RAG is a typical band reported in RAGAS case studies and internal A/B tests when retrieval quality is strong. <1% with retrieval + verifier + structured I/O is an aspirational target on narrow, well-scoped domains. Re-measure on your golden set before you quote any number externally.