AI Hallucination Solution
How I cut hallucinations in production: why models confabulate, what to measure, and the stack I actually ship—RAG, grounding, verification, structured outputs, guardrails, and evals.
Hallucination is not a bug you patch once—it is what next-token prediction does when nothing forces grounding. I rely on retrieval, grounding, verification, structured generation, and eval loops together. What follows is my working playbook (2025–2026): failure taxonomy, root causes, defense layers, architecture patterns, code recipes, eval frameworks, monitoring, and a decision tree for picking the right fix per use case.
01 · Foundation
What Hallucination Really Is
A hallucination is any model output that is fluent and confident but factually wrong, unfaithful to context, or internally inconsistent. It is not a malfunction — it is what a probability-maximizing language model does when it has no way to know it should stop.
~3–27%Hallucination rate across frontier LLMs on grounded QA (HHEM, 2025)
~50%↓Typical reduction with well-designed RAG + grounding
< 1%Achievable with retrieval + verifier + structured I/O
Working definition. An output is a hallucination if at least one of the following is true:
(a) it asserts a fact contradicted by an authoritative source; (b) it asserts a fact not entailed by the provided context (when context was supposed to be the source of truth); (c) it cites, quotes, or links to something that does not exist; (d) it contradicts itself within the same response.
Figure 1 — Two divergent paths from the same model: grounded faithfulness vs. fluent fabrication.
02 · Taxonomy
Taxonomy of Hallucinations
Different failure modes need different defenses. Misclassifying the type is the most common engineering mistake.
Type
What it is
Example
Best defense
Factual (closed-domain)
Asserts a wrong real-world fact
"Einstein won the Nobel in 1925" (was 1921)
RAG · web search · verifier
Faithfulness (intrinsic)
Output contradicts provided context
Summary adds claims not in the document
Strong grounding prompt · NLI checker
Extrinsic
Output adds info not in context (even if true)
Inventing dates the doc never mentioned
"Only use given context" · citation enforcement
Reference
Invents citations, URLs, paper titles, IDs
Fake DOI, made-up court case, ghost npm package
Tool call to verify existence · whitelist
Logical
Internally inconsistent reasoning
"X > Y, Y > Z, therefore Z > X"
Chain-of-thought · self-consistency
Arithmetic
Calculation errors stated confidently
Tax math, unit conversion, code metrics
Code interpreter / calculator tool
Temporal
Stale facts post-cutoff or wrong "as of" dates
"The current CEO is..." (changed last month)
Date-aware retrieval · explicit cutoff
Identity
Confuses similar entities
Mixes up two people with similar names
Entity disambiguation · grounded retrieval
Code (package hallucination)
Invents libraries, APIs, function signatures
import foobar (doesn't exist)
Static check · linter · sandbox run
Schema
Returns JSON that violates required schema
Wrong types, missing fields, hallucinated keys
Constrained decoding · validation + retry
Most dangerous in production: reference hallucinations (legal, medical, scientific apps) and code hallucinations (supply-chain risk — "slopsquatting" attacks register fake packages models suggest). Defend explicitly.
03 · Root Causes
Root Causes — Why LLMs Hallucinate
Hallucinations have at least six distinct causes. Each one needs a different mitigation.
CAUSE 1
Training objective
Next-token cross-entropy rewards plausibility, not truth. Saying "I don't know" was historically penalized; that's now actively being changed in modern RLHF.
CAUSE 2
Training data noise
The web contains lies, satire, outdated info, contradictions. Models memorize the distribution, not a fact-checked encyclopedia.
CAUSE 3
Knowledge cutoff & staleness
Any fact that changed after training is unknown. Without retrieval, the model fills the gap with confident guesses.
CAUSE 4
Long-tail facts
Rare entities (a niche library, a small company, a specific patient ID) are under-represented; the model interpolates from neighbors.
CAUSE 5
Prompt ambiguity
Vague or under-specified questions force the model to assume. Assumption + confidence = fabrication.
CAUSE 6
Decoding stochasticity
High temperature, top-p sampling, and exposure bias amplify drift across long generations.
CAUSE 7
Sycophancy / RLHF artifacts
Models trained on human preference can favor confident, agreeable answers over correct uncertain ones.
CAUSE 8
Context drift
In long conversations, the model forgets or compresses prior turns and may invent continuity.
CAUSE 9
Tokenization & arithmetic
Numbers, dates, and code identifiers tokenize awkwardly; subword splits cause silent errors.
"Language models are designed to be plausible. Truth is an emergent property — and only when we engineer it to be."— Engineering folklore, 2025
04 · The Stack
The Defense-in-Depth Stack
No single technique solves hallucination. Production systems layer 4–7 defenses. Below is the canonical stack.
Figure 2 — The 7-layer defense-in-depth stack. Each layer cuts the residual error of the layer below.
Rule of thumb: each well-implemented layer typically halves hallucination rate. 5 layers ≈ 32× fewer hallucinations than a bare LLM call.
05 · L2 Grounding
Retrieval-Augmented Generation (RAG)
If the answer must come from your data, the model must see your data at inference time. RAG is the single highest-leverage technique against hallucination.
Figure 3 — The full RAG pipeline: ingestion, query-time retrieval, and post-generation verification.
Ingestion best practices
Chunk by structure (headings, paragraphs), not fixed length
200–800 tokens per chunk; overlap 10–20%
Store rich metadata: source, section, date, author, ACL
De-duplicate near-duplicates with MinHash/LSH
Pre-compute summaries or hypothetical questions for each chunk
Pick embedding model per language & domain; benchmark on your data
Retrieval best practices
Hybrid search: dense (vector) + sparse (BM25) outperforms either alone
Rerank with a cross-encoder (e.g., Cohere Rerank, bge-reranker)
Confidence threshold: if top score < X → "I don't know"
Context limit: top 4–10 chunks; quality > quantity (lost-in-the-middle)
# Minimal RAG pseudocode (model-agnostic)defanswer(question, k=8):
q_emb = embed(question)
dense = vector_db.search(q_emb, k=k)
sparse = bm25.search(question, k=k)
fused = reciprocal_rank_fusion(dense, sparse)
top = rerank(question, fused)[:5]
if top[0].score < THRESHOLD:
return {"answer": "I don't have enough information."}
context = format_with_citations(top)
prompt = SYSTEM + f""" Use ONLY the context below. Cite sources as [#]. If the answer is not in the context, say so. Context: {context} Question: {question} """
answer = llm(prompt, temperature=0.2)
assert citations_valid(answer, top) # L4 verificationreturn answer
06 · Faithfulness
Grounding & Citations
RAG only helps if the model actually uses the retrieved text. Grounding makes that explicit, auditable, and verifiable.
RULE
"Only use the context"
Hard-line the model in the system prompt:
# System prompt fragmentYou answer ONLY from the provided context.
If the context does not contain the answer,
reply: "I don't have that information."
Never use prior knowledge. Cite sources as [n].
RULE
Inline citations
Require every claim to be tagged with the source chunk ID. Validate post-generation:
Every [n] resolves to a real retrieved chunk
The cited chunk contains text supporting the claim (NLI check)
Reject / regenerate if any citation fails
RULE
Provenance metadata
Return for every claim: source URL, last-updated date, author, confidence. Make the answer auditable end-to-end.
RULE
Abstain gracefully
An "I don't know" with a suggested next step beats a confident wrong answer 100× in user trust. Train your prompt and your evals to reward abstention.
Frontier feature: Modern APIs (Claude, GPT, Gemini) now offer built-in citations and grounded-generation modes. Use them where available — they reduce custom verification work.
07 · L1 Prompting
Prompt Engineering for Truthfulness
Even with RAG, prompt design controls how the model handles uncertainty, ambiguity, and edge cases.
High-leverage prompt patterns
Role + constraints: define expertise, scope, refusal rules
Explicit "I don't know" license: give the model permission to abstain
Step-back prompting: ask abstract question first, then specific
Chain-of-thought: "think step by step" reduces logic errors
Few-shot grounded examples: show the citation format
Negative examples: include what not to do
Temperature 0–0.3 for factual; 0.7+ for creative
Self-critique step: "Now check your answer against the sources."
Anti-patterns that cause hallucination
"You are an expert in everything" → invites overconfidence
Vague tasks like "tell me about X" without constraints
Mixing instructions with untrusted user input (prompt injection)
Asking for exact numbers without giving the source
Long lists of "always do X, never do Y" — the model forgets
Implicit assumptions ("write the code") with no schema
# A robust grounded-QA system prompt templateYou are a careful research assistant.
Rules:
1. Answer ONLY using the CONTEXT below.
2. Each claim must end with a citation like [3].
3. If the context is insufficient, reply exactly:
"I don't have enough information to answer that."
4. Never invent URLs, names, numbers, or dates.
5. If asked about something outside the context, decline.
Output format:
<answer>
Concise answer with [n] citations.
</answer>
<sources>
[1] short title — URL
[2] ...
</sources>
CONTEXT:
{retrieved_chunks}
QUESTION: {user_question}
08 · L3 Generation
Structured Outputs & Constrained Decoding
If the output shape is wrong, downstream code crashes. Constrained decoding makes invalid outputs impossible, not merely unlikely.
TECHNIQUE
JSON Schema / Tool Schema
Define expected output as a JSON Schema. Modern APIs guarantee schema-valid output via token masking.
TECHNIQUE
Grammar-constrained
llama.cpp, Outlines, Guidance: restrict token sampling to a context-free grammar (regex, BNF).
TECHNIQUE
Type-safe parsers
Pydantic, Zod, Instructor: parse → validate → retry on failure. Catch schema hallucinations before they ship.
# Example: structured output with retry-on-failfrom pydantic import BaseModel, Field
classAnswer(BaseModel):
answer: str
confidence: float = Field(ge=0, le=1)
sources: list[str]
is_uncertain: bool
result = llm.generate(prompt, response_format=Answer) # schema-enforcedif result.confidence < 0.6 or result.is_uncertain:
escalate_to_human(result)
Return enough context that the model can reason about results
# Tool spec (OpenAI / Anthropic / Gemini all share this shape)
{
"name": "lookup_customer",
"description": "Fetch a customer by ID from the CRM.",
"input_schema": {
"type": "object",
"properties": {
"customer_id": {"type": "string", "pattern": "^c_[a-z0-9]+$"}
},
"required": ["customer_id"]
}
}
10 · L4 Verification
Verifier & Critic Models
A second model — or the same model in a different role — reads the output and checks it before it reaches the user.
METHOD
NLI / Entailment check
For every claim in the answer, ask: does the source context entail this? (DeBERTa-NLI, vectara HHEM, custom).
METHOD
LLM-as-judge
Prompt a strong model to score faithfulness, relevance, completeness on 0–5. Use the same content as eval and runtime gate.
METHOD
Self-check / Reflexion
Ask the generating model: "Re-read your answer. Identify any unsupported claim." Then regenerate or strip.
METHOD
Citation existence check
Resolve each citation: does the URL respond 200? Does the DOI exist? Does the package name exist on PyPI/npm?
METHOD
Cross-model consensus
Ask 2–3 different models the same grounded question. Disagreement → escalate or abstain.
METHOD
Numeric & code sanity
Re-execute math/code in a sandbox; check totals, types, returns. Mismatch → retry.
# Faithfulness check via NLIfor claim in extract_claims(answer):
supports = Falsefor chunk in retrieved_chunks:
if nli_model.entails(premise=chunk.text, hypothesis=claim):
supports = True; breakifnot supports:
log_hallucination(claim)
answer = strip_or_regenerate(answer, claim)
Latency note: verifiers add 100–500ms. Run them in parallel with streaming, or asynchronously for low-risk content (then correct in a follow-up message if needed).
11 · Reasoning
Self-Consistency, CoT & Reasoning Models
Many "factual" errors are actually reasoning errors. Reasoning-time techniques are the cheapest accuracy gain you can buy.
Techniques
Chain-of-Thought (CoT): "Think step by step" — exposes reasoning, often surfaces its own bugs
Self-Consistency: sample N answers, take the majority vote
Tree of Thoughts: branch and evaluate multiple paths
Reflexion: model critiques its own answer, then revises
Self-RAG: model decides when to retrieve, then critiques
Reasoning models (Claude with extended thinking, GPT o-series, Gemini Thinking): native long-form chain-of-thought with verification — significant hallucination drop on complex tasks
When to use which
Simple lookup: direct prompt, no CoT
Multi-step: CoT
Math / logic: CoT + tool (calculator)
High-stakes (legal, medical, finance): reasoning model + verifier + human review
Real-time UX-sensitive: non-reasoning model + strong RAG, save reasoning for retry
2026 reality: Combining a reasoning model with RAG often achieves < 1% hallucination on grounded QA. The cost-per-query is higher, but for high-stakes flows the ROI is clear.
12 · Model
Fine-Tuning, RLHF & DPO for Truthfulness
When prompting + RAG aren't enough, change the model. Fine-tuning teaches behavior; preference optimization teaches taste.
Approach
What it does
Data needed
Best for
SFT — Supervised fine-tuning
Imitate ideal answers
1k–50k examples
Format, tone, domain vocabulary
Instruction tuning
Generalize to instructions
10k+ varied tasks
Base model → assistant
RLHF — Reinforcement from human feedback
Learn human preference
10k+ preference pairs
Polish, helpfulness
DPO / IPO / KTO
Same as RLHF, no RL loop
Pairs (chosen, rejected)
Cheaper alignment
RLAIF
AI-generated preference
Synthetic pairs + judge
Scale where humans are slow
Truthfulness fine-tuning
Reward abstention over confident-wrong
Curated I-don't-know examples
Reduce confident hallucinations
Distillation
Small model imitates a larger one
Teacher outputs
Cost reduction, latency
Trap: fine-tuning on hallucinated synthetic data amplifies hallucination. Curate carefully; include explicit "abstain" examples; evaluate on held-out factuality benchmarks before shipping.
13 · Agents
Agentic Patterns & Multi-Step Reliability
In agent loops, errors compound. A 95% step accuracy across 10 steps = 60% end-to-end. Reliability engineering matters more than raw intelligence.
Reliability patterns
Plan-then-execute: separate planning from action
ReAct loop: reason → act → observe → re-reason
Critic / Reviewer agent: second model audits each step
Bounded retries: max N attempts, then escalate
Checkpointing: persist state; resume on failure
Tool whitelisting: agents can only call approved tools
Cost & loop budget: hard cap on tokens, steps, time
Human-in-the-loop on irreversible actions
Multi-agent failure modes
Echo chambers: agents agree with each other's hallucinations
Context bleed: noise from one agent pollutes others
Infinite loops: tool returns trigger same call again
Goal drift: agent "rewrites" the goal mid-run
Tool spam: calling expensive APIs unnecessarily
Prompt injection: retrieved content overrides system prompt
Figure 4 — A reliable ReAct loop with a critic step and a bounded retry — the minimum viable agent pattern.
14 · L5 Guardrails
Guardrails & Safety Layers
Hallucination, prompt injection, and policy violations share defenses. Treat the LLM as untrusted code and wrap it in filters.
Hughes Hallucination Eval Model — public leaderboard comparing top models on faithfulness.
FRAMEWORK
TruthfulQA / SimpleQA / FActScore
Open-domain factuality benchmarks. Useful for model selection and regression testing.
FRAMEWORK
DeepEval / promptfoo / OpenAI Evals
Test-driven eval frameworks. Pin behavior in CI; gate deploys on score drop.
FRAMEWORK
LangSmith / Braintrust / Helicone
Trace + eval observability. Run evals on production traffic samples.
FRAMEWORK
Custom golden set
50–500 examples specific to your domain. The single most valuable eval asset you can build.
Rule I ship by: nothing goes out without an eval suite in CI. Every prompt or model change → diff scores before merge.
16 · L6 Observability
Observability & Monitoring in Production
Even with perfect offline evals, real traffic surfaces new failure modes. Treat AI like distributed systems: trace everything, alert on drift.
What to log per request
Full prompt (system + user + retrieved context)
Model, version, parameters, seed
Retrieval candidates with scores
Tool calls, args, results
Final output + intermediate steps
Verifier scores + abstention decisions
User feedback (thumbs, edits, regenerate)
Latency & cost breakdown
What to alert on
Faithfulness score drops > X% week-over-week
Abstention rate spikes (retrieval broken?)
Citation-fail rate increases
p95 latency > SLO
Cost per request grows unexpectedly
Output schema validation failures
Guardrail trips (injection, policy)
Negative user feedback rate
Closed loop: Sample 1–5% of production traffic, label it (LLM-judge + occasional human), feed it back into your eval set and your fine-tuning data. This is how systems get reliably better over months.
17 · Architecture
Reference Architectures
Three battle-tested architectures, ordered by use case complexity.
Figure 5 — Three reference architectures ordered by stakes: simple grounded QA · agentic · high-stakes with human-in-the-loop.
18 · Choose
Decision Tree by Use Case
When you don't know which technique to reach for, ask these questions in order.
Figure 6 — Decision tree: from question to recommended defense layers.
19 · Field Manual
Developer Checklist
Print this. Run through before every AI feature ships.
Pre-launch
Defined "hallucination" in your domain (what counts?)
Communicate uncertainty as clearly as the model should.
Build systems you trust to leave alone over the weekend.
"A good AI system is not one that never makes mistakes. It is one that fails loudly, recovers gracefully, and improves measurably."
— Linh Truong
I treat hallucination like partial failure in distributed systems: you do not wish it away—you engineer around it. RAG supplies evidence; structured outputs supply shape; tools supply precision; verifiers supply audit; evals supply feedback; observability supplies trust. Stack those layers and you get systems I am willing to leave running over a weekend.
21 · Sources
References & sources
Annotated bibliography behind the hallucination taxonomy, defense-in-depth stack, RAG recipes, verification patterns, eval metrics, and tooling tables in this note. Section tags (e.g. §5) show where each source is used. Prefer primary papers and official docs for benchmark numbers—model leaderboard scores move weekly.
Scope. Synthesis of peer-reviewed work, industry technical reports, and widely deployed open tools (May 2026). KPI bands in §1 (e.g. ~3–27% hallucination rate, ~50% reduction with RAG) are directional planning figures from the cited benchmarks and operator experience—not guaranteed production outcomes. Not legal or compliance advice.
Citations are numbered continuously [1]–[n] within this section.
Ji et al., “Survey of Hallucination in Natural Language Generation.” ACM Computing Surveys 2023. Taxonomy of intrinsic vs. extrinsic hallucination. arxiv.org/abs/2202.03629 — §1–2.
Zhang et al., “Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Models.” 2023. LLM-era failure modes and mitigations. arxiv.org/abs/2309.01219 — §1–3.
Maynez et al., “On Faithfulness and Factuality in Abstractive Summarization.” ACL 2020. Faithfulness vs. factuality distinction in §2 table. arxiv.org/abs/2005.00661 — §2.
Shumailov et al., “The Curse of Recursion: Training on Generated Data Makes Models Forget.” 2024. Model-collapse risk when synthetic data loops. arxiv.org/abs/2305.17493 — §3.
Kadavath et al., “Language Models (Mostly) Know What They Know.” 2022. Calibration and abstention background. arxiv.org/abs/2207.05221 — §3, §15.
Menick et al., “Teaching language models to support answers with verified quotes.” 2022. Grounded QA with evidence spans. arxiv.org/abs/2203.11147 — §6.
Gao et al., “RARR: Researching and Revising What Language Models Say, Using Language Models.” 2023. Post-hoc claim verification. arxiv.org/abs/2211.08727 — §6, §10.
Rashkin et al., “Measuring Attribution in Natural Language Generation Models.” 2022. Attributable-to-source metrics. arxiv.org/abs/2212.10340 — §6, §15.
Bohnet et al., “Attributed Question Answering: Evaluation and Modeling for Attributed Large Language Models.” 2022. Citation-quality evaluation. arxiv.org/abs/2212.06011 — §6.
Prompting, abstention & uncertainty (§7)
Zhou et al., “LIMA: Less Is More for Alignment.” 2023. High-quality instruction data for truthful tone. arxiv.org/abs/2305.11206 — §7.
Miao et al., “SelfCheck: Using LLMs to Zero-Shot Check Their Own Step-by-Step Reasoning.” 2023. Self-verification prompts. arxiv.org/abs/2308.00436 — §7, §10.
Lin et al., “Teaching Models to Express Their Uncertainty in Words.” 2022. Verbalized confidence for abstention UX. arxiv.org/abs/2205.14334 — §7.
Structured outputs & constrained decoding (§8)
Willard & Louf, “Efficient Guided Generation for Large Language Models” (Outlines). 2024. Grammar-constrained decoding. arxiv.org/abs/2403.04218 — §8.
Schick et al., “Toolformer: Language Models Can Teach Themselves to Use Tools.” NeurIPS 2023. Tool-learning foundation. arxiv.org/abs/2302.04761 — §9.
Qin et al., “ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs.” 2023. API grounding at scale. arxiv.org/abs/2307.16789 — §9.
Google DeepMind, “Gemini API function calling.” Parallel tool-use spec. ai.google.dev — §9.
Verification, NLI & faithfulness models (§10)
Williams et al., “A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference” (MultiNLI). 2018. NLI entailment checks in §10 pseudocode. arxiv.org/abs/1704.05426 — §10.
Min et al., “FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation.” 2023. Atomic claim verification. arxiv.org/abs/2305.14251 — §10, §15.
Vectara, “Hughes Hallucination Evaluation Model (HHEM)” and public leaderboard. Grounded-QA hallucination rates in §1 KPIs. github.com/vectara/hallucination-leaderboard — §1, §15.
Wei et al., “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.” NeurIPS 2022. CoT pattern in §11. arxiv.org/abs/2201.11903 — §11.
Wang et al., “Self-Consistency Improves Chain of Thought Reasoning in Language Models.” ICLR 2023. Multi-sample voting. arxiv.org/abs/2203.11171 — §4, §11.
Yao et al., “Tree of Thoughts: Deliberate Problem Solving with Large Language Models.” NeurIPS 2023. Branch-and-evaluate reasoning. arxiv.org/abs/2305.10601 — §11.
Shinn et al., “Reflexion: Language Agents with Verbal Reinforcement Learning.” NeurIPS 2023. Self-critique and revise loop. arxiv.org/abs/2303.11366 — §10–11, §13.
Yao et al., “ReAct: Synergizing Reasoning and Acting in Language Models.” ICLR 2023. Tool + reasoning agent loop. arxiv.org/abs/2210.03629 — §13.
Shinn et al., “SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering.” 2024. Multi-step reliability in code agents. arxiv.org/abs/2405.15793 — §13.
Fine-tuning, RLHF & preference optimization (§12)
Ouyang et al., “Training language models to follow instructions with human feedback.” NeurIPS 2022. RLHF / InstructGPT. arxiv.org/abs/2203.02155 — §12.
Rafailov et al., “Direct Preference Optimization.” NeurIPS 2023. DPO without reward model. arxiv.org/abs/2305.18290 — §12.
Bai et al., “Constitutional AI: Harmlessness from AI Feedback.” 2022. RLAIF for safety alignment. arxiv.org/abs/2212.08073 — §12, §14.
Ethayarajh et al., “KTO: Model Alignment as Prospect Theoretic Optimization.” 2024. Binary feedback alignment. arxiv.org/abs/2402.01306 — §12.
OWASP Foundation, “Top 10 for Large Language Model Applications.” Prompt injection and data-exfil risks. OWASP LLM Top 10 — §14, §19.
Greshake et al., “Not what you’ve signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection.” 2023. Injection via untrusted RAG content. arxiv.org/abs/2302.12173 — §14.
Evaluation frameworks & benchmarks (§15)
Es et al., “Ragas: Automated Evaluation of Retrieval Augmented Generation.” 2023. Faithfulness, relevance, context metrics in §15 table. arxiv.org/abs/2309.15217 — §15.
Lin et al., “TruthfulQA: Measuring How Models Mimic Human Falsehoods.” ACL 2022. Open-domain factuality benchmark. arxiv.org/abs/2109.07958 — §15.
Wei et al., “Measuring Short-form Factuality in Large Language Models” (SimpleQA). OpenAI 2024. Short-answer factuality benchmark. openai.com — §15.
Zheng et al., “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.” 2023. LLM-judge bias patterns in §15.3. arxiv.org/abs/2306.05685 — §15.
Liu et al., “G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment.” 2023. LLM-as-judge rubric design. arxiv.org/abs/2303.16634 — §15.
Helicone. LLM gateway logging and cost telemetry. docs.helicone.ai — §16.
OpenTelemetry, “Generative AI observability” semantic conventions (draft). Standard trace fields for LLM spans. opentelemetry.io — §16.
Inference stack & distillation (§8, §17–18)
Kwon et al., “Efficient Memory Management for Large Language Model Serving with PagedAttention” (vLLM). SOSP 2023. Serving layer in architecture diagrams. arxiv.org/abs/2309.06180 — §17–18.
Hinton et al., “Distilling the Knowledge in a Neural Network.” 2015. Teacher–student compression for smaller deploy models. arxiv.org/abs/1503.02531 — §18 (where distillation applies).
KPI strip (§1)
~3–27% hallucination rate comes from the Vectara HHEM grounded-QA leaderboard and related faithfulness benchmarks—not a single vendor guarantee. ~50%↓ with RAG is a typical band reported in RAGAS case studies and internal A/B tests when retrieval quality is strong. <1% with retrieval + verifier + structured I/O is an aspirational target on narrow, well-scoped domains. Re-measure on your golden set before you quote any number externally.