Linh Truong · LLM reliability · May 2026

AI Hallucination Solution How I cut hallucinations in production: why models confabulate, what to measure, and the stack I actually ship—RAG, grounding, verification, structured outputs, guardrails, and evals.

Audience: Engineers · Architects · ML/AI builders Stack: Model-agnostic (Claude · GPT · Gemini · Llama) Revised: May 2026 Outcome: Reliable, grounded, observable AI

Author: Linh Truong, MA (Harvard), MBA · Source: LinhTruong.com · Email: Linh@Alumni.Harvard.edu

Hallucination is not a bug you patch once—it is what next-token prediction does when nothing forces grounding. I rely on retrieval, grounding, verification, structured generation, and eval loops together. What follows is my working playbook (2025–2026): failure taxonomy, root causes, defense layers, architecture patterns, code recipes, eval frameworks, monitoring, and a decision tree for picking the right fix per use case.

01 · Foundation

What Hallucination Really Is

A hallucination is any model output that is fluent and confident but factually wrong, unfaithful to context, or internally inconsistent. It is not a malfunction — it is what a probability-maximizing language model does when it has no way to know it should stop.

~3–27%Hallucination rate across frontier LLMs on grounded QA (HHEM, 2025)

~50%↓Typical reduction with well-designed RAG + grounding

< 1%Achievable with retrieval + verifier + structured I/O

Working definition. An output is a hallucination if at least one of the following is true:
(a) it asserts a fact contradicted by an authoritative source; (b) it asserts a fact not entailed by the provided context (when context was supposed to be the source of truth); (c) it cites, quotes, or links to something that does not exist; (d) it contradicts itself within the same response.

Figure 1 — Two divergent paths from the same model: grounded faithfulness vs. fluent fabrication.

02 · Taxonomy

Taxonomy of Hallucinations

Different failure modes need different defenses. Misclassifying the type is the most common engineering mistake.

Type	What it is	Example	Best defense
Factual (closed-domain)	Asserts a wrong real-world fact	"Einstein won the Nobel in 1925" (was 1921)	RAG · web search · verifier
Faithfulness (intrinsic)	Output contradicts provided context	Summary adds claims not in the document	Strong grounding prompt · NLI checker
Extrinsic	Output adds info not in context (even if true)	Inventing dates the doc never mentioned	"Only use given context" · citation enforcement
Reference	Invents citations, URLs, paper titles, IDs	Fake DOI, made-up court case, ghost npm package	Tool call to verify existence · whitelist
Logical	Internally inconsistent reasoning	"X > Y, Y > Z, therefore Z > X"	Chain-of-thought · self-consistency
Arithmetic	Calculation errors stated confidently	Tax math, unit conversion, code metrics	Code interpreter / calculator tool
Temporal	Stale facts post-cutoff or wrong "as of" dates	"The current CEO is..." (changed last month)	Date-aware retrieval · explicit cutoff
Identity	Confuses similar entities	Mixes up two people with similar names	Entity disambiguation · grounded retrieval
Code (package hallucination)	Invents libraries, APIs, function signatures	`import foobar` (doesn't exist)	Static check · linter · sandbox run
Schema	Returns JSON that violates required schema	Wrong types, missing fields, hallucinated keys	Constrained decoding · validation + retry

Most dangerous in production: reference hallucinations (legal, medical, scientific apps) and code hallucinations (supply-chain risk — "slopsquatting" attacks register fake packages models suggest). Defend explicitly.

03 · Root Causes

Root Causes — Why LLMs Hallucinate

Hallucinations have at least six distinct causes. Each one needs a different mitigation.

CAUSE 1

Training objective

Next-token cross-entropy rewards plausibility, not truth. Saying "I don't know" was historically penalized; that's now actively being changed in modern RLHF.

CAUSE 2

Training data noise

The web contains lies, satire, outdated info, contradictions. Models memorize the distribution, not a fact-checked encyclopedia.

CAUSE 3

Knowledge cutoff & staleness

Any fact that changed after training is unknown. Without retrieval, the model fills the gap with confident guesses.

CAUSE 4

Long-tail facts

Rare entities (a niche library, a small company, a specific patient ID) are under-represented; the model interpolates from neighbors.

CAUSE 5

Prompt ambiguity

Vague or under-specified questions force the model to assume. Assumption + confidence = fabrication.

CAUSE 6

Decoding stochasticity

High temperature, top-p sampling, and exposure bias amplify drift across long generations.

CAUSE 7

Sycophancy / RLHF artifacts

Models trained on human preference can favor confident, agreeable answers over correct uncertain ones.

CAUSE 8

Context drift

In long conversations, the model forgets or compresses prior turns and may invent continuity.

CAUSE 9

Tokenization & arithmetic

Numbers, dates, and code identifiers tokenize awkwardly; subword splits cause silent errors.

"Language models are designed to be plausible. Truth is an emergent property — and only when we engineer it to be."— Engineering folklore, 2025

04 · The Stack

The Defense-in-Depth Stack

No single technique solves hallucination. Production systems layer 4–7 defenses. Below is the canonical stack.

Figure 2 — The 7-layer defense-in-depth stack. Each layer cuts the residual error of the layer below.

Rule of thumb: each well-implemented layer typically halves hallucination rate. 5 layers ≈ 32× fewer hallucinations than a bare LLM call.

05 · L2 Grounding

Retrieval-Augmented Generation (RAG)

If the answer must come from your data, the model must see your data at inference time. RAG is the single highest-leverage technique against hallucination.

Figure 3 — The full RAG pipeline: ingestion, query-time retrieval, and post-generation verification.

Ingestion best practices

Chunk by structure (headings, paragraphs), not fixed length
200–800 tokens per chunk; overlap 10–20%
Store rich metadata: source, section, date, author, ACL
De-duplicate near-duplicates with MinHash/LSH
Pre-compute summaries or hypothetical questions for each chunk
Pick embedding model per language & domain; benchmark on your data

Retrieval best practices

Hybrid search: dense (vector) + sparse (BM25) outperforms either alone
Rerank with a cross-encoder (e.g., Cohere Rerank, bge-reranker)
Query rewriting: HyDE, multi-query, step-back, decomposition
Recency & ACL filters applied before ranking
Confidence threshold: if top score < X → "I don't know"
Context limit: top 4–10 chunks; quality > quantity (lost-in-the-middle)

# Minimal RAG pseudocode (model-agnostic)
def answer(question, k=8):
    q_emb   = embed(question)
    dense   = vector_db.search(q_emb, k=k)
    sparse  = bm25.search(question, k=k)
    fused   = reciprocal_rank_fusion(dense, sparse)
    top     = rerank(question, fused)[:5]

    if top[0].score < THRESHOLD:
        return {"answer": "I don't have enough information."}

    context = format_with_citations(top)
    prompt  = SYSTEM + f"""
    Use ONLY the context below. Cite sources as [#].
    If the answer is not in the context, say so.
    Context: {context}
    Question: {question}
    """
    answer  = llm(prompt, temperature=0.2)
    assert citations_valid(answer, top)        # L4 verification
    return answer

06 · Faithfulness

Grounding & Citations

RAG only helps if the model actually uses the retrieved text. Grounding makes that explicit, auditable, and verifiable.

RULE

"Only use the context"

Hard-line the model in the system prompt:

# System prompt fragment
You answer ONLY from the provided context.
If the context does not contain the answer,
reply: "I don't have that information."
Never use prior knowledge. Cite sources as [n].

RULE

Inline citations

Require every claim to be tagged with the source chunk ID. Validate post-generation:

Every [n] resolves to a real retrieved chunk
The cited chunk contains text supporting the claim (NLI check)
Reject / regenerate if any citation fails

RULE

Provenance metadata

Return for every claim: source URL, last-updated date, author, confidence. Make the answer auditable end-to-end.

RULE

Abstain gracefully

An "I don't know" with a suggested next step beats a confident wrong answer 100× in user trust. Train your prompt and your evals to reward abstention.

Frontier feature: Modern APIs (Claude, GPT, Gemini) now offer built-in citations and grounded-generation modes. Use them where available — they reduce custom verification work.

07 · L1 Prompting

Prompt Engineering for Truthfulness

Even with RAG, prompt design controls how the model handles uncertainty, ambiguity, and edge cases.

High-leverage prompt patterns

Role + constraints: define expertise, scope, refusal rules
Explicit "I don't know" license: give the model permission to abstain
Step-back prompting: ask abstract question first, then specific
Chain-of-thought: "think step by step" reduces logic errors
Few-shot grounded examples: show the citation format
Negative examples: include what not to do
Temperature 0–0.3 for factual; 0.7+ for creative
Self-critique step: "Now check your answer against the sources."

Anti-patterns that cause hallucination

"You are an expert in everything" → invites overconfidence
Vague tasks like "tell me about X" without constraints
Mixing instructions with untrusted user input (prompt injection)
Asking for exact numbers without giving the source
Long lists of "always do X, never do Y" — the model forgets
Implicit assumptions ("write the code") with no schema

# A robust grounded-QA system prompt template
You are a careful research assistant.

Rules:
1. Answer ONLY using the CONTEXT below.
2. Each claim must end with a citation like [3].
3. If the context is insufficient, reply exactly:
   "I don't have enough information to answer that."
4. Never invent URLs, names, numbers, or dates.
5. If asked about something outside the context, decline.

Output format:
<answer>
Concise answer with [n] citations.
</answer>
<sources>
[1] short title — URL
[2] ...
</sources>

CONTEXT:
{retrieved_chunks}

QUESTION: {user_question}

08 · L3 Generation

Structured Outputs & Constrained Decoding

If the output shape is wrong, downstream code crashes. Constrained decoding makes invalid outputs impossible, not merely unlikely.

TECHNIQUE

JSON Schema / Tool Schema

Define expected output as a JSON Schema. Modern APIs guarantee schema-valid output via token masking.

TECHNIQUE

Grammar-constrained

llama.cpp, Outlines, Guidance: restrict token sampling to a context-free grammar (regex, BNF).

TECHNIQUE

Type-safe parsers

Pydantic, Zod, Instructor: parse → validate → retry on failure. Catch schema hallucinations before they ship.

# Example: structured output with retry-on-fail
from pydantic import BaseModel, Field

class Answer(BaseModel):
    answer: str
    confidence: float = Field(ge=0, le=1)
    sources: list[str]
    is_uncertain: bool

result = llm.generate(prompt, response_format=Answer)  # schema-enforced
if result.confidence < 0.6 or result.is_uncertain:
    escalate_to_human(result)

Why it works: Schema constraints prevent ~all syntactic hallucinations (missing fields, wrong types, hallucinated keys). They don't prevent semantic hallucination — that's L4's job.

09 · L3 Generation

Tool Use & Function Calling

For anything the model is bad at — math, current data, exact facts, code execution — give it a tool and let the deterministic system handle it.

Tools that eliminate whole hallucination classes

Calculator / Python REPL → arithmetic, units, stats
SQL / API call → live, authoritative data
Web search → recency, fresh facts
Code interpreter → run code before answering
Document retrieval → grounded answers
Validator endpoints → check that a DOI / URL / package exists
Calendar / time API → eliminate "today is..." errors

Design rules for tools

Each tool has a single, well-described purpose
Inputs strictly typed via JSON Schema
Outputs include status + data + error
Idempotent; safe to retry
Side-effecting tools require confirmation (write, delete, pay)
Prefer many small tools to one mega-tool
Return enough context that the model can reason about results

# Tool spec (OpenAI / Anthropic / Gemini all share this shape)
{
  "name": "lookup_customer",
  "description": "Fetch a customer by ID from the CRM.",
  "input_schema": {
    "type": "object",
    "properties": {
      "customer_id": {"type": "string", "pattern": "^c_[a-z0-9]+$"}
    },
    "required": ["customer_id"]
  }
}

10 · L4 Verification

Verifier & Critic Models

A second model — or the same model in a different role — reads the output and checks it before it reaches the user.

METHOD

NLI / Entailment check

For every claim in the answer, ask: does the source context entail this? (DeBERTa-NLI, vectara HHEM, custom).

METHOD

LLM-as-judge

Prompt a strong model to score faithfulness, relevance, completeness on 0–5. Use the same content as eval and runtime gate.

METHOD

Self-check / Reflexion

Ask the generating model: "Re-read your answer. Identify any unsupported claim." Then regenerate or strip.

METHOD

Citation existence check

Resolve each citation: does the URL respond 200? Does the DOI exist? Does the package name exist on PyPI/npm?

METHOD

Cross-model consensus

Ask 2–3 different models the same grounded question. Disagreement → escalate or abstain.

METHOD

Numeric & code sanity

Re-execute math/code in a sandbox; check totals, types, returns. Mismatch → retry.

# Faithfulness check via NLI
for claim in extract_claims(answer):
    supports = False
    for chunk in retrieved_chunks:
        if nli_model.entails(premise=chunk.text, hypothesis=claim):
            supports = True; break
    if not supports:
        log_hallucination(claim)
        answer = strip_or_regenerate(answer, claim)

Latency note: verifiers add 100–500ms. Run them in parallel with streaming, or asynchronously for low-risk content (then correct in a follow-up message if needed).

11 · Reasoning

Self-Consistency, CoT & Reasoning Models

Many "factual" errors are actually reasoning errors. Reasoning-time techniques are the cheapest accuracy gain you can buy.

Techniques

Chain-of-Thought (CoT): "Think step by step" — exposes reasoning, often surfaces its own bugs
Self-Consistency: sample N answers, take the majority vote
Tree of Thoughts: branch and evaluate multiple paths
Reflexion: model critiques its own answer, then revises
Self-RAG: model decides when to retrieve, then critiques
Reasoning models (Claude with extended thinking, GPT o-series, Gemini Thinking): native long-form chain-of-thought with verification — significant hallucination drop on complex tasks

When to use which

Simple lookup: direct prompt, no CoT
Multi-step: CoT
Math / logic: CoT + tool (calculator)
High-stakes (legal, medical, finance): reasoning model + verifier + human review
Real-time UX-sensitive: non-reasoning model + strong RAG, save reasoning for retry

2026 reality: Combining a reasoning model with RAG often achieves < 1% hallucination on grounded QA. The cost-per-query is higher, but for high-stakes flows the ROI is clear.

12 · Model

Fine-Tuning, RLHF & DPO for Truthfulness

When prompting + RAG aren't enough, change the model. Fine-tuning teaches behavior; preference optimization teaches taste.

Approach	What it does	Data needed	Best for
SFT — Supervised fine-tuning	Imitate ideal answers	1k–50k examples	Format, tone, domain vocabulary
Instruction tuning	Generalize to instructions	10k+ varied tasks	Base model → assistant
RLHF — Reinforcement from human feedback	Learn human preference	10k+ preference pairs	Polish, helpfulness
DPO / IPO / KTO	Same as RLHF, no RL loop	Pairs (chosen, rejected)	Cheaper alignment
RLAIF	AI-generated preference	Synthetic pairs + judge	Scale where humans are slow
Truthfulness fine-tuning	Reward abstention over confident-wrong	Curated I-don't-know examples	Reduce confident hallucinations
Distillation	Small model imitates a larger one	Teacher outputs	Cost reduction, latency

Trap: fine-tuning on hallucinated synthetic data amplifies hallucination. Curate carefully; include explicit "abstain" examples; evaluate on held-out factuality benchmarks before shipping.

13 · Agents

Agentic Patterns & Multi-Step Reliability

In agent loops, errors compound. A 95% step accuracy across 10 steps = 60% end-to-end. Reliability engineering matters more than raw intelligence.

Reliability patterns

Plan-then-execute: separate planning from action
ReAct loop: reason → act → observe → re-reason
Critic / Reviewer agent: second model audits each step
Bounded retries: max N attempts, then escalate
Checkpointing: persist state; resume on failure
Tool whitelisting: agents can only call approved tools
Cost & loop budget: hard cap on tokens, steps, time
Human-in-the-loop on irreversible actions

Multi-agent failure modes

Echo chambers: agents agree with each other's hallucinations
Context bleed: noise from one agent pollutes others
Infinite loops: tool returns trigger same call again
Goal drift: agent "rewrites" the goal mid-run
Tool spam: calling expensive APIs unnecessarily
Prompt injection: retrieved content overrides system prompt

Figure 4 — A reliable ReAct loop with a critic step and a bounded retry — the minimum viable agent pattern.

14 · L5 Guardrails

Guardrails & Safety Layers

Hallucination, prompt injection, and policy violations share defenses. Treat the LLM as untrusted code and wrap it in filters.

Input guardrails

PII redaction before sending to model
Prompt-injection detection (Rebuff, Lakera, NeMo Guardrails)
Topic / scope classifier ("off-domain" → refuse)
Rate & cost limits per user
Strip / quote untrusted content (RAG chunks, web pages)

Output guardrails

Policy classifier (toxicity, secrets, hate, illegal)
Schema validation + retry
Faithfulness / NLI gate
Citation existence check
Numeric/code re-execution
PII leakage scanner
Confidence threshold → abstain

Open source toolkits: NVIDIA NeMo Guardrails · Guardrails AI · Llama Guard · Rebuff · Microsoft Presidio (PII) · Vectara HHEM (hallucination detection). Compose them, don't build from scratch.

15 · L6 Evals

Evaluation Frameworks

You cannot improve what you cannot measure. The biggest difference between a hobby project and a production AI system is the eval suite.

Metric	What it measures	How to compute
Faithfulness	Claims supported by source context	NLI per claim / RAGAS / LLM-judge
Answer relevance	Answer addresses the question	Embedding similarity Q ↔ A · LLM-judge
Context precision/recall	Retrieved chunks are relevant & sufficient	Per-chunk relevance labels
Groundedness	Output stays within context	HHEM · TRUE · attributable-to-source
Exact match / F1	Closed-form QA accuracy	Compare to gold answer
Hallucination rate	% outputs with at least one unsupported claim	Human or LLM-judge
Abstention rate	% of unanswerable Qs the model declines	On adversarial test set
Calibration	Confidence matches accuracy	Reliability diagram / ECE
Latency & cost	p50/p95 ms, $ / request	Telemetry

FRAMEWORK

RAGAS

Open-source RAG eval: faithfulness, answer relevance, context precision/recall. Standard starting point.

FRAMEWORK

HHEM (Vectara)

Hughes Hallucination Eval Model — public leaderboard comparing top models on faithfulness.

FRAMEWORK

TruthfulQA / SimpleQA / FActScore

Open-domain factuality benchmarks. Useful for model selection and regression testing.

FRAMEWORK

DeepEval / promptfoo / OpenAI Evals

Test-driven eval frameworks. Pin behavior in CI; gate deploys on score drop.

FRAMEWORK

LangSmith / Braintrust / Helicone

Trace + eval observability. Run evals on production traffic samples.

FRAMEWORK

Custom golden set

50–500 examples specific to your domain. The single most valuable eval asset you can build.

Rule I ship by: nothing goes out without an eval suite in CI. Every prompt or model change → diff scores before merge.

16 · L6 Observability

Observability & Monitoring in Production

Even with perfect offline evals, real traffic surfaces new failure modes. Treat AI like distributed systems: trace everything, alert on drift.

What to log per request

Full prompt (system + user + retrieved context)
Model, version, parameters, seed
Retrieval candidates with scores
Tool calls, args, results
Final output + intermediate steps
Verifier scores + abstention decisions
User feedback (thumbs, edits, regenerate)
Latency & cost breakdown

What to alert on

Faithfulness score drops > X% week-over-week
Abstention rate spikes (retrieval broken?)
Citation-fail rate increases
p95 latency > SLO
Cost per request grows unexpectedly
Output schema validation failures
Guardrail trips (injection, policy)
Negative user feedback rate

Closed loop: Sample 1–5% of production traffic, label it (LLM-judge + occasional human), feed it back into your eval set and your fine-tuning data. This is how systems get reliably better over months.

17 · Architecture

Reference Architectures

Three battle-tested architectures, ordered by use case complexity.

Figure 5 — Three reference architectures ordered by stakes: simple grounded QA · agentic · high-stakes with human-in-the-loop.

18 · Choose

Decision Tree by Use Case

When you don't know which technique to reach for, ask these questions in order.

Figure 6 — Decision tree: from question to recommended defense layers.

19 · Field Manual

Developer Checklist

Print this. Run through before every AI feature ships.

Pre-launch

Defined "hallucination" in your domain (what counts?)
Have a labeled eval set (50+ examples)
Baseline metrics: faithfulness, abstention, latency, cost
RAG with hybrid retrieval + reranking
System prompt forces "answer only from context"
Inline citations validated against retrieved chunks
JSON Schema / structured output enforced
Tools for math, dates, live data, code execution
NLI or LLM-judge verifier on critical paths
Guardrails: PII, injection, policy, secrets
Abstain path: "I don't know" returns gracefully
Temperature ≤ 0.3 for factual tasks

Post-launch

Trace every request (prompt, retrieval, output, scores)
Alerts on faithfulness drop, latency spike, cost spike
User feedback captured (thumbs, edits, regenerates)
Weekly eval-set run; gate deploys on score regression
Monthly sample of prod traffic labeled & reviewed
Drift dashboard for retrieval & model versions
Versioned prompts (semantic version, diffable)
Rollback plan tested
Red-team prompts in regression suite
Documentation: model card, eval card, known failures
On-call runbook for AI incidents
Quarterly model upgrade evaluation

20 · North Star

Closing Principles — How to Build AI Developers Are Happy With

Engineering principles

Treat the model as untrusted code. Wrap it.
Ground every factual claim in retrievable evidence.
Make abstention a first-class citizen.
Constrain the output shape before constraining the content.
Verify after generating — don't trust, test.
Defense in depth: ≥ 3 layers for any real product.
Evals in CI. Trace in prod. Feedback in the loop.
Optimize for the worst 1% of inputs, not the average.

Career & happiness principles

Ship measurable improvements, not vibes.
Invest in evals — they compound like savings.
Write boring code around exciting models.
Pair every prototype with a kill-switch.
Document failure modes; they become your moat.
Read papers monthly; tune prompts weekly; refactor quarterly.
Communicate uncertainty as clearly as the model should.
Build systems you trust to leave alone over the weekend.

"A good AI system is not one that never makes mistakes. It is one that fails loudly, recovers gracefully, and improves measurably." — Linh Truong

I treat hallucination like partial failure in distributed systems: you do not wish it away—you engineer around it. RAG supplies evidence; structured outputs supply shape; tools supply precision; verifiers supply audit; evals supply feedback; observability supplies trust. Stack those layers and you get systems I am willing to leave running over a weekend.

21 · Sources

References & sources

Annotated bibliography behind the hallucination taxonomy, defense-in-depth stack, RAG recipes, verification patterns, eval metrics, and tooling tables in this note. Section tags (e.g. §5) show where each source is used. Prefer primary papers and official docs for benchmark numbers—model leaderboard scores move weekly.

Scope. Synthesis of peer-reviewed work, industry technical reports, and widely deployed open tools (May 2026). KPI bands in §1 (e.g. ~3–27% hallucination rate, ~50% reduction with RAG) are directional planning figures from the cited benchmarks and operator experience—not guaranteed production outcomes. Not legal or compliance advice.

Citations are numbered continuously [1]–[n] within this section.

Hallucination: definition, taxonomy & root causes (§1–3)

Ji et al., “Survey of Hallucination in Natural Language Generation.” ACM Computing Surveys 2023. Taxonomy of intrinsic vs. extrinsic hallucination. arxiv.org/abs/2202.03629 — §1–2.
Zhang et al., “Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Models.” 2023. LLM-era failure modes and mitigations. arxiv.org/abs/2309.01219 — §1–3.
Maynez et al., “On Faithfulness and Factuality in Abstractive Summarization.” ACL 2020. Faithfulness vs. factuality distinction in §2 table. arxiv.org/abs/2005.00661 — §2.
Shumailov et al., “The Curse of Recursion: Training on Generated Data Makes Models Forget.” 2024. Model-collapse risk when synthetic data loops. arxiv.org/abs/2305.17493 — §3.
Kadavath et al., “Language Models (Mostly) Know What They Know.” 2022. Calibration and abstention background. arxiv.org/abs/2207.05221 — §3, §15.

Retrieval-augmented generation & hybrid search (§5, §17)

Lewis et al., “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.” NeurIPS 2020. Canonical RAG formulation. arxiv.org/abs/2005.11401 — §4–5.
Asai et al., “Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection.” ICLR 2024. Retrieve-when-needed + self-critique. arxiv.org/abs/2310.11511 — §11, §13.
Izacard et al., “Atlas: Few-shot Learning with Retrieval Augmentation.” 2022. Retrieval + generation at scale. arxiv.org/abs/2208.03299 — §5.
Robertson & Zaragoza, “The Probabilistic Relevance Framework: BM25 and Beyond.” 2009. Sparse retrieval baseline in hybrid search. BM25 foundations (PDF) — §5.
Khattab & Zaharia, “ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT.” SIGIR 2020. Dense late-interaction retrieval. arxiv.org/abs/2004.12832 — §5.
Karpukhin et al., “Dense Passage Retrieval for Open-Domain Question Answering.” EMNLP 2020. Embedding retrieval standard. arxiv.org/abs/2004.04906 — §5.
Gao et al., “Precise Zero-Shot Dense Retrieval without Relevance Labels” (HyDE). 2022. Query expansion pattern cited in advanced RAG. arxiv.org/abs/2212.10496 — §5.

Grounding, citations & attributable generation (§6)

Menick et al., “Teaching language models to support answers with verified quotes.” 2022. Grounded QA with evidence spans. arxiv.org/abs/2203.11147 — §6.
Gao et al., “RARR: Researching and Revising What Language Models Say, Using Language Models.” 2023. Post-hoc claim verification. arxiv.org/abs/2211.08727 — §6, §10.
Rashkin et al., “Measuring Attribution in Natural Language Generation Models.” 2022. Attributable-to-source metrics. arxiv.org/abs/2212.10340 — §6, §15.
Bohnet et al., “Attributed Question Answering: Evaluation and Modeling for Attributed Large Language Models.” 2022. Citation-quality evaluation. arxiv.org/abs/2212.06011 — §6.

Prompting, abstention & uncertainty (§7)

Zhou et al., “LIMA: Less Is More for Alignment.” 2023. High-quality instruction data for truthful tone. arxiv.org/abs/2305.11206 — §7.
Miao et al., “SelfCheck: Using LLMs to Zero-Shot Check Their Own Step-by-Step Reasoning.” 2023. Self-verification prompts. arxiv.org/abs/2308.00436 — §7, §10.
Lin et al., “Teaching Models to Express Their Uncertainty in Words.” 2022. Verbalized confidence for abstention UX. arxiv.org/abs/2205.14334 — §7.

Structured outputs & constrained decoding (§8)

Willard & Louf, “Efficient Guided Generation for Large Language Models” (Outlines). 2024. Grammar-constrained decoding. arxiv.org/abs/2403.04218 — §8.
Microsoft, “Guidance” library. Structured generation and regex constraints. github.com/guidance-ai/guidance — §8.
OpenAI, “Structured Outputs” and JSON Schema support. API-level schema enforcement. platform.openai.com — §8.
Anthropic, “Tool use” documentation. Structured tool schemas. docs.anthropic.com — §8–9.

Tool use & function calling (§9)

Schick et al., “Toolformer: Language Models Can Teach Themselves to Use Tools.” NeurIPS 2023. Tool-learning foundation. arxiv.org/abs/2302.04761 — §9.
Qin et al., “ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs.” 2023. API grounding at scale. arxiv.org/abs/2307.16789 — §9.
OpenAI, “Function calling” guide. Shared tool-spec shape in §9 code block. platform.openai.com — §9.
Google DeepMind, “Gemini API function calling.” Parallel tool-use spec. ai.google.dev — §9.

Verification, NLI & faithfulness models (§10)

Williams et al., “A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference” (MultiNLI). 2018. NLI entailment checks in §10 pseudocode. arxiv.org/abs/1704.05426 — §10.
Honovich et al., “TRUE: Re-evaluating Factual Consistency Evaluation.” NAACL 2022. Unified faithfulness benchmark suite. arxiv.org/abs/2204.04991 — §15.
Min et al., “FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation.” 2023. Atomic claim verification. arxiv.org/abs/2305.14251 — §10, §15.
Vectara, “Hughes Hallucination Evaluation Model (HHEM)” and public leaderboard. Grounded-QA hallucination rates in §1 KPIs. github.com/vectara/hallucination-leaderboard — §1, §15.
Manakul et al., “SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection.” 2023. Sampling-based self-check. arxiv.org/abs/2305.11795 — §10.

Reasoning, self-consistency & agentic reliability (§11–13)

Wei et al., “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.” NeurIPS 2022. CoT pattern in §11. arxiv.org/abs/2201.11903 — §11.
Wang et al., “Self-Consistency Improves Chain of Thought Reasoning in Language Models.” ICLR 2023. Multi-sample voting. arxiv.org/abs/2203.11171 — §4, §11.
Yao et al., “Tree of Thoughts: Deliberate Problem Solving with Large Language Models.” NeurIPS 2023. Branch-and-evaluate reasoning. arxiv.org/abs/2305.10601 — §11.
Shinn et al., “Reflexion: Language Agents with Verbal Reinforcement Learning.” NeurIPS 2023. Self-critique and revise loop. arxiv.org/abs/2303.11366 — §10–11, §13.
Yao et al., “ReAct: Synergizing Reasoning and Acting in Language Models.” ICLR 2023. Tool + reasoning agent loop. arxiv.org/abs/2210.03629 — §13.
Shinn et al., “SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering.” 2024. Multi-step reliability in code agents. arxiv.org/abs/2405.15793 — §13.

Fine-tuning, RLHF & preference optimization (§12)

Ouyang et al., “Training language models to follow instructions with human feedback.” NeurIPS 2022. RLHF / InstructGPT. arxiv.org/abs/2203.02155 — §12.
Rafailov et al., “Direct Preference Optimization.” NeurIPS 2023. DPO without reward model. arxiv.org/abs/2305.18290 — §12.
Bai et al., “Constitutional AI: Harmlessness from AI Feedback.” 2022. RLAIF for safety alignment. arxiv.org/abs/2212.08073 — §12, §14.
Ethayarajh et al., “KTO: Model Alignment as Prospect Theoretic Optimization.” 2024. Binary feedback alignment. arxiv.org/abs/2402.01306 — §12.

Guardrails, injection defense & safety (§14)

NVIDIA, “NeMo Guardrails” documentation and toolkit. Programmable rails for RAG + agents. github.com/NVIDIA/NeMo-Guardrails — §14.
Meta, “Llama Guard” model cards and weights. Input/output policy classification. github.com/meta-llama/PurpleLlama — §14.
Guardrails AI. Validator framework referenced in §14. github.com/guardrails-ai/guardrails — §14.
OWASP Foundation, “Top 10 for Large Language Model Applications.” Prompt injection and data-exfil risks. OWASP LLM Top 10 — §14, §19.
Greshake et al., “Not what you’ve signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection.” 2023. Injection via untrusted RAG content. arxiv.org/abs/2302.12173 — §14.

Evaluation frameworks & benchmarks (§15)

Es et al., “Ragas: Automated Evaluation of Retrieval Augmented Generation.” 2023. Faithfulness, relevance, context metrics in §15 table. arxiv.org/abs/2309.15217 — §15.
Lin et al., “TruthfulQA: Measuring How Models Mimic Human Falsehoods.” ACL 2022. Open-domain factuality benchmark. arxiv.org/abs/2109.07958 — §15.
Wei et al., “Measuring Short-form Factuality in Large Language Models” (SimpleQA). OpenAI 2024. Short-answer factuality benchmark. openai.com — §15.
Zheng et al., “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.” 2023. LLM-judge bias patterns in §15.3. arxiv.org/abs/2306.05685 — §15.
Liu et al., “G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment.” 2023. LLM-as-judge rubric design. arxiv.org/abs/2303.16634 — §15.
Confident AI, “DeepEval” documentation. Test-driven LLM eval in CI. github.com/confident-ai/deepeval — §15.
promptfoo project. Prompt/regression testing CLI. github.com/promptfoo/promptfoo — §15.
OpenAI, “Evals” framework. Dataset-driven eval API. github.com/openai/evals — §15.

Observability, tracing & production loops (§16)

LangChain, “LangSmith” tracing and evals. Request traces, datasets, regression runs. docs.smith.langchain.com — §16.
Braintrust. Eval + observability platform cited in §15–16. braintrust.dev — §15–16.
Helicone. LLM gateway logging and cost telemetry. docs.helicone.ai — §16.
OpenTelemetry, “Generative AI observability” semantic conventions (draft). Standard trace fields for LLM spans. opentelemetry.io — §16.

Inference stack & distillation (§8, §17–18)

Kwon et al., “Efficient Memory Management for Large Language Model Serving with PagedAttention” (vLLM). SOSP 2023. Serving layer in architecture diagrams. arxiv.org/abs/2309.06180 — §17–18.
ggml / llama.cpp project. Constrained decoding on edge. github.com/ggerganov/llama.cpp — §8.
Hinton et al., “Distilling the Knowledge in a Neural Network.” 2015. Teacher–student compression for smaller deploy models. arxiv.org/abs/1503.02531 — §18 (where distillation applies).

KPI strip (§1)

~3–27% hallucination rate comes from the Vectara HHEM grounded-QA leaderboard and related faithfulness benchmarks—not a single vendor guarantee. ~50%↓ with RAG is a typical band reported in RAGAS case studies and internal A/B tests when retrieval quality is strong. <1% with retrieval + verifier + structured I/O is an aspirational target on narrow, well-scoped domains. Re-measure on your golden set before you quote any number externally.