AI engineering · field notes · May 2026

The AI Engineer Notes and diagrams for building real products on foundation models—where I put the effort, and what usually breaks in production

Source / canonical copy: LinhTruong.com. If you share this file, link there so attribution stays with the document.

Audience: Software engineers, tech leads, architects  ·  Scope: LLMs, RAG, agents, evals, LLM ops, production rollout  ·  Revised: May 2026

Contents

  1. What is an AI Engineer?
  2. Why this role exists now
  3. SWE vs ML vs AI Engineer
  4. The modern AI engineering stack
  5. Mental models & fundamentals
  6. Prompting & context engineering
  7. Retrieval-Augmented Generation (RAG)
  8. Agents, tools & multi-agent systems
  9. Evals: the new unit test
  10. Observability & LLMOps
  11. Latency, cost & reliability
  12. Safety, security & governance
  13. Fine-tuning, distillation, post-training
  14. Reference architectures
  15. Day-to-day workflow
  16. 12-month learning roadmap
  17. Anti-patterns & failure modes
  18. 2026 trends & what's next
  19. Closing
  20. References & sources

1 · What is an AI Engineer?

An AI Engineer is a software engineer whose primary craft is building applications on top of foundation models. They sit at the intersection of product engineering, applied ML, and systems design — but, unlike traditional ML engineers, they rarely train models from scratch. Instead, they compose pre-trained models, retrieval, tools, and evaluation harnesses into reliable products.

Core mandate

Make probabilistic systems behave like products

Turn a non-deterministic model API into a system that meets latency, cost, accuracy, and safety SLOs — every single deploy.

Primary outputs

Pipelines, prompts, evals, guardrails

Not model weights. The deliverable is a versioned system: prompts + retrieval + tools + policies + tests + monitors.

Posture

Empiricist, not theorist

Hypothesize → run an eval → measure → ship. Reading benchmarks ≠ knowing how a model behaves on your data.

Software Engineering APIs · systems · testing · DevOps Applied ML / Data embeddings · evals · stats Product & UX latency · trust · UX of uncertainty AI Engineer composes foundation models into reliable products
Figure 1. The AI Engineer sits at the convergence of three disciplines.

2 · Why this role exists now

Three structural shifts created a permanent role between SWE and ML:

  1. Capability is rented, not trained. Frontier models (Claude, GPT, Gemini, Llama) are accessible through APIs. The bottleneck moved from training to integrating.
  2. The interface is natural language. The hardest part is no longer math — it is specifying intent, retrieving the right context, and evaluating fuzzy output.
  3. Production is probabilistic. Classical SWE assumes deterministic functions. LLM systems require new disciplines: evals, guardrails, fallbacks, and observability designed for distributions of behaviour.

Talent gap

Teams I work with rarely lack someone who can train a model; they lack someone who will own prompts, retrieval, eval harnesses, and guardrails end to end. Most shipped AI is integration and measurement, not fresh pre-training.

Foundation Models Tool Use RAG Evals Guardrails

3 · Software Engineer vs ML Engineer vs AI Engineer

Dimension Software Engineer ML Engineer AI Engineer
Primary artifactDeterministic codeTrained models & pipelinesPrompted/orchestrated systems on top of pre-trained models
Math floorAlgorithms, DSLinear algebra, probability, optimizationProbability intuition, vector geometry, basic stats for evals
Core toolingGit, CI/CD, REST/RPC, DBsPyTorch/JAX, MLflow, GPUs, feature storesLLM SDKs, vector DBs, eval frameworks, tracing, orchestration
"Test" meansUnit/integration testsTrain/val/test split, AUC, F1Eval sets, LLM-as-judge, rubrics, regression suites
Failure modeCrash / wrong outputDrift / poor generalizationHallucination, prompt injection, silent quality drop
Iteration speedMinutesHours–days (training)Minutes (prompts) → days (eval-driven)
Owns in prodLatency, correctnessModel performance, retrainingEnd-to-end behaviour: quality + cost + safety + UX

4 · The modern AI engineering stack

Think of the stack as eight layers. Master one at a time, top-down, then deepen as needed.

① Product & UX surface chat, copilots, autocomplete, agents in IDEs / docs / dashboards · streaming UX · trust signals · citations ② Orchestration & agent runtime prompt graphs · tool calling · planners · memory · sub-agents · Claude Agent SDK, LangGraph, in-house ③ Context layer — retrieval, memory, structured tools vector DB · BM25 · re-rankers · graph/SQL retrieval · function/tool schemas · MCP servers ④ Model layer — foundation + specialised Claude / GPT / Gemini / Llama · routing · prompt caching · embedding models · re-rankers · OCR/ASR ⑤ Evaluation & quality offline eval sets · LLM-as-judge · rubrics · A/B · regression suites · human review queues ⑥ Safety, security & guardrails prompt-injection defence · PII redaction · output filtering · policy / RBAC · audit logs ⑦ LLMOps — observability, cost, deploy traces · token spend · latency SLOs · prompt versioning · canary releases · feature flags ⑧ Foundations — data, infra, compliance data contracts · privacy (SOC2/GDPR/HIPAA) · GPUs/inference infra (only if self-hosting) · IAM
Figure 2. The 8-layer AI engineering stack — read top-down for product impact, bottom-up for reliability.

Reference tools per layer (representative, not exhaustive)

LayerRepresentative tools / services (2026)
Model APIsAnthropic Claude (Opus 4.7 / Sonnet 4.6 / Haiku 4.5), OpenAI GPT, Google Gemini, Meta Llama, Mistral, Cohere
Orchestration / AgentsClaude Agent SDK, LangGraph, LlamaIndex, DSPy, Inngest, Temporal (durable execution)
Retrievalpgvector, Qdrant, Pinecone, Weaviate, Turbopuffer, Vespa, Elasticsearch, BM25
Embeddings & rerankersVoyage, Cohere Rerank, OpenAI text-embedding-3, BGE, ColBERT
EvalsBraintrust, LangSmith, Promptfoo, Inspect, Ragas, OpenAI Evals, in-house harnesses
ObservabilityHelicone, LangSmith, Langfuse, Arize, Datadog LLM Observability, Honeycomb
GuardrailsNVIDIA NeMo Guardrails, Guardrails AI, Lakera, Protect AI, custom classifiers
Self-hosted inferencevLLM, TGI, TensorRT-LLM, SGLang, Triton, Ollama (dev)
Fine-tuningLoRA/QLoRA, Unsloth, Axolotl, provider-hosted fine-tuning, RFT/DPO
ProtocolsMCP (Model Context Protocol) for tool/context interop · OpenAPI for HTTP tools

5 · Mental models & fundamentals every AI engineer needs

The model is a function over context

Conceptually: output = f(context, decoding). You don't get to change f at runtime — you only control context (system + user + retrieved + tools) and decoding (temperature, top-p, max-tokens, stop-sequences, JSON-mode).

Context is a budget, not a bucket

Bigger contexts ≠ better answers. Models attend unevenly, and cost/latency scale with tokens. Curate context like you curate a small slide deck for an executive: ruthless relevance over completeness.

Tokens, not characters

Pricing, limits, and quality are all token-denominated. Internalize "1 token ≈ 4 chars ≈ ¾ of an English word." Multi-lingual and code tokens are denser — measure, don't guess.

Embeddings are geometry

Similar meaning ≈ small angle. Cosine similarity is a noisy proxy for relevance. Always pair vector search with a re-ranker or BM25 for grounding-critical tasks.

Determinism is a choice you opt into

Set temperature=0, fix seed, freeze the model version, and pin prompts. Treat the model version like a dependency in package.json.

Failures are distributions, not exceptions

Your code doesn't "crash" — it degrades silently on 3% of inputs. Build evals that surface the tail, and dashboards that track quality, not just uptime.

6 · Prompting & context engineering

Prompt engineering has matured into context engineering: the discipline of assembling the smallest sufficient context for a model to act correctly. A useful taxonomy:

System / Role identity, capabilities, refusals, format rules "You are…" "Output JSON …" Retrieved Context documents, snippets, prior turns, memory chunked & ranked cite-able sources Tools / Schemas function signatures, JSON Schema args when to call how to recover User Turn + Hints the actual question, few-shot examples, structured input, "think step by step" Together these form the context window — every token is paid for in latency, cost, and attention.
Figure 3. The four ingredients of a well-engineered context window.

Techniques worth knowing cold

Anti-pattern: stuffing the entire knowledge base into the prompt because the context window is large. Bigger context degrades attention, balloons cost, and hides where the answer actually came from. Retrieve, don't dump.

7 · Retrieval-Augmented Generation (RAG)

RAG is the default architecture for grounding LLMs in private or fresh knowledge. The deceptively-simple promise — "search then generate" — hides a dozen design decisions.

OFFLINE — INDEXING Sources docs · DBs · APIs Loaders + clean parse · OCR · dedupe Chunk + enrich semantic chunks · metadata Embed embedding model Vector + lexical index pgvector · Qdrant · BM25 ONLINE — QUERY TIME User query raw question / state Query rewrite decompose · HyDE Hybrid retrieve vector + BM25 + filters Rerank cross-encoder · top-k Assemble context dedupe · cite · token budget LLM generation prompt + context Verify + cite grounded? quote source? Respond + log stream · trace · eval-tag
Figure 4. A production RAG pipeline — offline indexing (top) and online retrieval + generation (bottom).

RAG design decisions, and their usual answers in 2026

DecisionTypical answerWhen to deviate
Chunk size300–800 tokens, with 10–20% overlapLong-form prose: larger; code/structured: smaller, by symbol
Chunking strategySemantic / heading-aware over fixed-sizeTables, code, schemas — chunk by structure
RetrievalHybrid (vector + BM25) + metadata filtersHighly entity-driven domains may need graph or SQL retrieval
Re-rankerYes — cross-encoder on top 50 → keep 5–10Latency-critical paths can skip if recall is high
Cite sourcesAlways for user-facing factual answersInternal pipelines can skip; logs should still record provenance
FreshnessIncremental indexing + soft TTL on chunksCompliance domains: hard expiration + audit trail
EvalRecall@k, faithfulness, answer correctness — judged offlineHigh-volume products: add online click/feedback signals

Beyond vanilla RAG

8 · Agents, tools & multi-agent systems

An agent is an LLM that runs in a loop, calling tools and observing results until a goal is satisfied. Agents are the natural extension of RAG: instead of retrieving once, they retrieve, act, verify, and replan.

Plan / Think LLM decides next step Act — call tool function · API · code · search Observe tool result → context Reflect / Verify progress? stop? replan? Agent loop bounded by step / token / cost budget
Figure 5. The canonical Plan → Act → Observe → Reflect agent loop.

Tool design — the single biggest determinant of agent quality

Make tools small, sharp, and named like a senior API

  • Each tool does one thing well. Compose, don't overload.
  • Names are the documentation: search_invoices_by_customer > db_query.
  • Use strict JSON schemas. Reject under-specified calls early.
  • Return small, structured results — not whole pages.

Make failures legible

  • Errors should explain how to fix the call: "missing field: customer_id (UUID)".
  • Truncate large outputs with a marker the agent understands.
  • Idempotency keys on mutating tools.
  • Always log tool args + outputs for replay.

Agent topologies

TopologyShapeUse it whenCaution
Single-agent loopOne LLM + toolsDefault. Covers ~80% of real-world cases.None — start here, always.
Planner + executorOne agent plans, another executesLong tasks where planning ≠ tool calling cleanly.Adds latency & failure surface.
Sub-agent delegationParent spawns specialised sub-agentsIndependent subtasks that can run in parallel.Don't fan out for the sake of it — costs multiply.
Swarm / multi-agent debateSeveral agents argue / voteOpen-ended creative or research tasks.Often less reliable than a single careful agent + verifier.
Human-in-the-loopAgent pauses for approval at risky stepsAnything irreversible (payments, deletes, sends).Design the UX of waiting — silence is a failure.
Heuristic: if a deterministic workflow would do the job, use one. Agents are appropriate when the path itself is unknown until you start. Most "agent" projects that fail were workflow problems in disguise.

The Model Context Protocol (MCP)

MCP is the emerging open protocol for exposing tools, resources, and prompts to any LLM client. Treat it the way you treat HTTP: a contract between agent runtimes and the systems they act on. Wrap your internal APIs in an MCP server once, and they become available to every model and IDE that speaks MCP.

9 · Evals: the new unit test

Evals are the deliverable. Without a pinned suite, you cannot tell whether yesterday's prompt tweak helped, regressed a corner case, or just moved noise around.

Eval set

A curated, version-controlled collection of inputs + expected behaviours. Grows from real production traces, bug reports, and red-team examples.

Scorer

How "good" is measured: exact match, regex, embedding similarity, code-executed test, rubric-graded LLM-as-judge, or human label.

Harness

The runner that pairs eval set × scorer × system version, produces metrics, and compares against baseline. Runs in CI on every prompt or code change.

The eval pyramid

Human review · red team A/B tests in production live traffic, real users LLM-as-judge rubrics graded scenarios in CI Deterministic unit evals exact-match · schema · code-run · regex
Figure 6. Stack cheap deterministic checks at the base; reserve human judgement for the top.

LLM-as-judge — the workhorse, with caveats

What to measure, by feature type

FeaturePrimary metricsGuardrail metrics
RAG Q&AFaithfulness, answer correctness, citation accuracyRefusal rate, latency p95, cost / query
SummarizationCoverage, factuality, conciseness (LLM-judge)Hallucinated-entity rate, length distribution
Code generationpass@k on hidden test suite, lint cleanBuild success, security-rule violations
Agents / tool useTask success rate, steps-to-successTool error rate, cost per task, runaway loops
Classification / extractionPrecision, recall, F1 on labelled setConfidence calibration, abstention rate

10 · Observability & LLMOps

Classical metrics miss too much in LLM stacks. I watch three extra primitives alongside normal service telemetry:

Traces

Tree of every prompt, retrieval, tool call, and sub-agent for a single user request. The flamegraph of LLM apps.

Token + cost ledger

Per-tenant, per-feature spend. Cost is a first-class SLO — track it on the same dashboard as latency.

Quality signals

Thumbs, retries, regenerate clicks, abandonment, judge scores on a sample of live traffic.

Production loop

Production traffic streaming + traced Logs & traces inputs · outputs · tools Triage + label bad cases → eval set Iterate offline prompt · retrieval · model Canary deploy % rollout + alerts
Figure 7. The LLMOps flywheel — every production failure becomes tomorrow's regression test.

11 · Latency, cost & reliability

Latency levers

  • Stream tokens to the UI; perceived latency is mostly time-to-first-token.
  • Cache stable prompt prefixes (system + tool defs + style guide).
  • Route easy queries to a smaller/faster model (Haiku-tier), escalate when uncertain.
  • Parallelize independent tool calls in agents.
  • Speculative decoding & structured output reduce tail latency.

Cost levers

  • Smaller models for classification, extraction, routing, summarization.
  • Aggressive prompt caching — often 50–90% input-token savings.
  • Trim retrieved context: re-rank then keep top-k, not top-50.
  • Batch async workloads through provider batch APIs (often ~50% off).
  • Pre-compute embeddings; don't re-embed unchanged content.

Reliability levers

  • Fallback model on provider failures — health-check both.
  • Retries with jitter, capped at sane budgets — never infinite.
  • Circuit breakers per tool / per provider.
  • Timeouts everywhere, including streaming response stalls.
  • Idempotency on mutating tool calls.

SLO suggestions

  • Time-to-first-token: p95 < 1.5s for chat.
  • End-to-end completion: p95 < 8s for chat, < 45s for agents.
  • Quality regression: alert on judge-score drop > 3 points week-over-week.
  • Cost per session: alert on > 2× rolling 7-day median.

12 · Safety, security & governance

LLM apps carry a different threat model than typical CRUD. When I review features, I walk the OWASP LLM Top-10 list explicitly—here are the rows that show up most in incident-style reviews:

ThreatWhat it looks likeMitigation
Prompt injectionUntrusted content tells the model to ignore instructions, exfiltrate data, or call tools.Treat all retrieved content as untrusted; separate roles; sanitize; limit tool permissions; least-privilege agents.
Data exfiltrationModel includes secrets from context in its output, or via a tool call to an attacker-controlled URL.Output filters, URL allowlists, secret scanners, no secrets in prompts, sandboxed tool execution.
Insecure tool useAgent calls a destructive tool on adversarial input.Human approval for irreversible ops, idempotency, strict schema validation, RBAC at tool level.
Supply chainCompromised model weights, embeddings, or third-party MCP server.Pin versions, verify hashes, vet providers, isolate untrusted MCP servers.
PII leakageLogs/traces contain user PII sent to third-party providers.PII detection + redaction at the boundary; data-processing agreements; regional routing.
JailbreaksUsers coax the model past its safety policies.Layered defenses: system prompt, classifier guardrails, content filters, abuse-rate-limits.
Hallucination as riskConfident wrong answers in regulated domains (medical, legal, financial).Citations, abstention, "I don't know" prompting, human-in-the-loop for high-stakes decisions.
Rule: the agent's effective permissions are the union of every tool it can call. Design tool surfaces as if they will be invoked with adversarial inputs — because they will be.

Governance checklist (enterprise)

13 · Fine-tuning, distillation & post-training

Most product problems are solved by better prompts, better retrieval, or better evals — not by training. But there is a real role for post-training when:

Spectrum of post-training

TechniqueWhat it changesEffortRight when
Prompting + few-shotInputs onlyMinutesAlways try first.
RAGInputs (with knowledge)DaysKnowledge gap, not skill gap.
SFT (LoRA/QLoRA)A small adapter on a base modelDays–weeksStyle/format/narrow skill; you have ≥ a few thousand high-quality examples.
Preference tuning (DPO / RFT)Model preferences via pairwise dataWeeksYou can grade outputs better than you can write them.
DistillationSmaller model mimics bigger one's outputsWeeksLatency/cost critical, frontier-quality optional.
Continued pre-trainingThe model's core weights, broadlyMonths, $$$Rarely justified outside foundation labs.
Order of operations: prompt → few-shot → retrieval → tools → eval-driven prompt rewrite → small-model routing → LoRA / DPO. Skipping ahead burns months.

14 · Reference architectures

14.1 The "Copilot for X" reference architecture

Client UI streaming · citations · undo API gateway auth · rate limit · tenants Orchestrator / Agent runtime policy · tool routing prompt + context assembly retries · fallbacks Retrieval service hybrid · rerank · filter tenant-scoped Tools / MCP servers CRUD APIs · search · code RBAC + audit Model router opus · sonnet · haiku · OSS prompt cache · fallback Embedding + rerank batched · cached multi-region Guardrails layer PII · injection · output filter Eval & trace pipeline spans · scores · regressions Cost & latency monitor per-tenant ledger · alerts Feedback store labels → eval set
Figure 8. A production copilot — a thin client, an orchestrator, three "shoulder" services (retrieval, tools, models), and a cross-cutting LLMOps plane.

14.2 Pattern catalogue

Single-shot LLM call

One prompt, one response, structured output. Use for extraction, classification, summarisation. The 80% solution to most "AI features".

RAG + cite

Retrieve → generate → cite. Default for any Q&A over private knowledge.

Tool-using agent

Loop with bounded steps and a verifier tool. Use when the path of action is not known in advance.

Router + specialist

A small classifier model routes to one of N specialist prompts/models. Cheap, fast, and easy to govern.

Workflow + LLM steps

A deterministic DAG with LLM nodes for specific subtasks. The right answer when "agent" feels like overkill.

Async / batch

Background jobs (summarise nightly, classify backlogs). Use batch APIs for ~50% cost; observability through job IDs.

15 · Day-to-day workflow of an effective AI engineer

  1. Define the task with a sample of 20 real examples. Not synthetic — real. If you can't find 20, you don't have a product yet, you have a hypothesis.
  2. Label the desired output for each. This is your initial eval set.
  3. Build the simplest baseline that could work. One prompt, one model, no retrieval. Measure.
  4. Add one capability at a time: retrieval → tools → verification → routing. Re-run evals after every change.
  5. Instrument from day one. Traces, token counts, latency, judge scores. You will need them next week.
  6. Ship behind a feature flag, canary at 1% → 10% → 50% → 100%, with quality + cost guardrail metrics.
  7. Mine production traces weekly. Every failure becomes an eval example. The eval set is alive.
  8. Re-evaluate model choice quarterly. Models get cheaper and better — your routing and prompt should follow.

16 · 12-month learning roadmap for software engineers

MONTH 1–2 Foundations tokens · embeddings prompting · JSON tools first chatbot w/ streaming read OWASP LLM Top-10 ship: a small internal Q&A bot MONTH 3–5 RAG + Evals chunking · hybrid search rerank · citations build first eval harness LLM-as-judge calibration ship: RAG with measurable quality MONTH 6–8 Agents + Tools tool design · MCP planner-executor patterns human-in-the-loop UX cost & latency tuning ship: an agentic workflow MONTH 9–12 Scale + Specialise LLMOps · observability guardrails & red teaming SFT / DPO / distillation pick a vertical specialty lead: a production AI feature
Figure 9. A pragmatic 12-month curriculum — each phase ends with a shipped artifact, not a certificate.

Per-phase recommended exercises

Phase 1 — Foundations

  • Write a tokenizer-aware "trim context" function for your stack.
  • Build a multi-turn chat with streaming, cancellation, and undo.
  • Force a model into strict JSON for a real internal task. Measure schema-violation rate.

Phase 2 — RAG + Evals

  • Index a real corpus you care about; compare chunking strategies on a 50-question eval set.
  • Add a re-ranker; show recall@k improvement.
  • Build CI that fails a PR if eval-mean drops >2 points.

Phase 3 — Agents + Tools

  • Take an existing internal workflow with ≤5 steps and replace it with an agent. Compare.
  • Wrap one internal API as an MCP server.
  • Add a verifier sub-agent; measure task-success uplift.

Phase 4 — Scale + Specialise

  • Route 30% of traffic to a smaller model via a learned router; track quality and cost delta.
  • Run a quarterly red-team exercise with documented findings.
  • Distil a frontier prompt into a small fine-tuned model for one high-volume task.

17 · Anti-patterns & failure modes

Anti-patternWhy it happensFix
Demoware in prodDemo prompt promoted without evals or guardrails.No ship without an eval set + SLOs + canary.
"Big context" everything"Models have 1M context, just dump it all."Retrieve, rank, trim. Token budget is a discipline.
Single-string mega-promptYears of patches accumulate in one file.Compose modular prompts; version them; cache the stable parts.
No-evals fast iteration"It seemed better." Vibes-driven engineering.Every change runs the eval suite. The number is the decision.
Vendor-locked codeSDK-specific abstractions leak everywhere.Thin internal interface for model calls; swap providers in a day.
Trusting LLM-as-judge blindlyIt looks like rigour; it isn't, unless calibrated.Calibrate to humans on 100+ examples; use pairwise; rotate judges.
Agents where workflows sufficeHype. Loops feel powerful.Start with a workflow; promote to an agent only when the path is data-dependent.
Unbounded tool permissions"It's just a prototype."RBAC at the tool layer; human approval on irreversible ops; audit logs.
Ignoring tail latencyp50 looks fine; p99 is 40 seconds.Stream, cache, parallelize, timeout, and alert on p95/p99.
No feedback loopUsers complain in Slack, not in the product.In-product 👍/👎, retry-as-signal, weekly trace review with PM.

18 · 2026 trends & what's coming next

Long-horizon, durable agents

Agents that run for hours / days, with explicit memory, checkpoints, and resumability — closer to Temporal / workflow engines than chatbots.

Computer / browser use

Models that operate UIs directly. Powerful for legacy systems with no APIs — but the highest-blast-radius surface to govern.

Open standards win

MCP for tools/context and OpenAPI for HTTP are becoming the lingua franca. Build to the protocol, not the SDK.

Reasoning models for everything hard

Math, code, planning, multi-step extraction. The cost gap closes; the quality gap on hard tasks does not.

Per-tenant fine-tuning

LoRA adapters per customer for style and domain. Cheap, fast, and a real moat for vertical SaaS.

Eval-as-a-product

Evals are sold, traded, and audited like test suites. Expect SOC2-style attestations of eval coverage in enterprise procurement.

On-device + edge inference

Small capable models (1–8B) on phones and laptops handle private/latency-critical paths; cloud reserved for hard cases.

Regulation hardens

EU AI Act enforcement, US sector-specific rules. The AI engineer's responsibility extends to documentation, audit, and lifecycle controls.

19 · Closing

How I tie a bow on reviews. Treat the model as a probabilistic dependency—same as any flaky upstream—and engineer the rest: context in, structured I/O, evals that regress bad behavior, and traces that show you when quality drifts. Nail that loop and you can swap model versions without scrambling.
8
stack layers to know
1
eval set: your single biggest asset
production traces to learn from

20 · References & sources

The diagrams and opinions in this note are mine; the factual and technical anchors below are where I send people who want primary material on transformers, RAG, agents, evals, security, and production practice. Entries are grouped so you can jump to what a section leaned on—papers for mechanisms, standards for shared vocabulary, regulators for compliance context.

Scope note: arXiv and conference versions suffice for citations; always check the published venue if you need a DOI for academic writing. Vendor APIs and model names change frequently—treat provider documentation as the source of truth for quotas, pricing, and safety policies.

Foundation models, transformers & prompting

  1. Vaswani et al., “Attention Is All You Need.” NeurIPS 2017. Transformer architecture underlying essentially all frontier LLMs discussed in §4–§6. https://arxiv.org/abs/1706.03762
  2. Brown et al., “Language Models are Few-Shot Learners” (GPT-3). NeurIPS 2020. Grounds the “capability is rented / in-context learning” framing in §2. https://arxiv.org/abs/2005.14165
  3. OpenAI, “GPT-4 Technical Report.” 2023. Baseline for capability envelopes, mitigations, and evaluation limits—always cross-check with your own evals (§9). https://arxiv.org/abs/2303.08774
  4. Touvron et al., “LLaMA: Open and Efficient Foundation Language Models.” Meta, 2023. Open-weights and on-prem options in §4, §13. https://arxiv.org/abs/2302.13971
  5. Wei et al., “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.” NeurIPS 2022. Reasoning-style prompting in §6. https://arxiv.org/abs/2201.11903
  6. Kojima et al., “Large Language Models are Zero-Shot Reasoners.” NeurIPS 2022. “Let’s think step by step” line of work in §6. https://arxiv.org/abs/2205.11916
  7. Wang et al., “Self-Consistency Improves Chain of Thought Reasoning in Language Models.” ICLR 2023. Sampling / voting ideas in §6. https://arxiv.org/abs/2203.11171
  8. Ouyang et al., “Training language models to follow instructions with human feedback” (InstructGPT). 2022. Instruction tuning, refusal behavior, and alignment-shaped APIs—§6–§8. https://arxiv.org/abs/2203.02155

Retrieval-augmented generation & information retrieval

  1. Lewis et al., “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.” NeurIPS 2020. Canonical RAG formulation in §7. https://arxiv.org/abs/2005.11401
  2. Robertson & Zaragoza, “The Probabilistic Relevance Framework: BM25 and Beyond.” Foundations and Trends in Information Retrieval, 2009. Lexical / BM25 retrieval in hybrid search (§7). https://dl.acm.org/doi/10.1561/1500000019
  3. Karpukhin et al., “Dense Passage Retrieval for Open-Domain Question Answering.” EMNLP 2020. Dense retrieval adjacent to embedding-based RAG (§7). https://arxiv.org/abs/2004.04906
  4. Gao et al., “Retrieval-Augmented Generation: A Survey.” 2024. One-paper overview of chunking, indexing, and query reformulation beyond this note. https://arxiv.org/abs/2312.10997
  5. Edge et al., “From Local to Global: A Graph RAG Approach to Query-Focused Summarization.” Microsoft Research, 2024. Background for GraphRAG in §7. https://arxiv.org/abs/2404.16130

Tool use, agents & planning loops

  1. Schick et al., “Toolformer: Language Models Can Teach Themselves to Use Tools.” 2023. Systematic LLM + tools framing related to §8. https://arxiv.org/abs/2302.04761
  2. Yao et al., “ReAct: Synergizing Reasoning and Acting in Language Models.” ICLR 2023. Plan / act / observe loop behind the agent diagram in §8. https://arxiv.org/abs/2210.03629
  3. Shinn et al., “Reflexion: Language Agents with Verbal Reinforcement Learning.” 2023. Reflection / verification patterns in §8–§9. https://arxiv.org/abs/2303.11366
  4. Model Context Protocol (MCP). Anthropic-initiated open specification for tools, resources, and prompts across clients—§4, §8. https://modelcontextprotocol.io/ Schemas and revisions are published on the official site and GitHub org.

Evaluation, benchmarking & LLM-as-judge

  1. Zheng et al., “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.” NeurIPS 2023. Pairwise judging methodology relevant to §9. https://arxiv.org/abs/2306.05685
  2. Dubois et al., “AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback.” 2023. Cheap evaluation and simulator bias—counterweight when building harnesses (§9). https://arxiv.org/abs/2305.14387
  3. Es et al., “Ragas: Automated Evaluation of Retrieval Augmented Generation.” 2023. Metrics behind the RAGAS tooling named in §4. https://arxiv.org/abs/2309.15217
  4. Lin, “ROUGE: A Package for Automatic Evaluation of Summaries.” ACL Workshop 2004. Classical overlap metrics still used in summarization evals (§9). https://aclanthology.org/W04-1013/

Alignment, preference optimization & post-training

  1. Rafailov et al., “Direct Preference Optimization: Your Language Model is Secretly a Reward Model.” NeurIPS 2023. DPO in §13. https://arxiv.org/abs/2305.18290
  2. Hu et al., “LoRA: Low-Rank Adaptation of Large Language Models.” ICLR 2022. Adapter fine-tuning in §13. https://arxiv.org/abs/2106.09685
  3. Hinton, Vinyals & Dean, “Distilling the Knowledge in a Neural Network.” NIPS Deep Learning Workshop 2015. Distillation vocabulary in §13. https://arxiv.org/abs/1503.02531

Security, abuse & privacy baselines

  1. OWASP Top 10 for Large Language Model Applications. Community standard for LLM-specific threats; §12 follows this taxonomy. https://owasp.org/www-project-top-10-for-large-language-model-applications/
  2. Greshake et al., “Not What You’ve Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection.” 2023. Prompt injection via retrieved or external content—pairs with §12. https://arxiv.org/abs/2302.12173
  3. NIST AI Risk Management Framework (AI RMF 1.0). U.S. voluntary framework; language for governance checklists in §12. https://www.nist.gov/itl/ai-risk-management-framework
  4. European Union, Artificial Intelligence Act (Regulation (EU) 2024/1689). Primary legal text for the regulation bullet in §18—prefer EUR-Lex over secondary summaries. https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX%3A32024R1689

Reliability, SLOs & production discipline

  1. Beyer et al., Site Reliability Engineering: How Google Runs Production Systems. O’Reilly, 2016 (free online). SLO / error-budget thinking in §5, §10–§11. https://sre.google/sre-book/table-of-contents/
  2. Google, The Site Reliability Workbook. 2018. Incident response and monitoring adjacent to §10–§11. https://sre.google/workbook/table-of-contents/
  3. Huyen, Designing Machine Learning Systems. O’Reilly, 2022. ML features that share a fleet with classical services (§10–§14). ISBN 978-1098107969.
  4. Kleppmann, Designing Data-Intensive Applications. O’Reilly, 2017. Data consistency, streaming, and system design behind serious RAG / agent stacks (§6–§7, §14). ISBN 978-1449373320.

Observability, tracing & interoperability

  1. OpenTelemetry Project. Vendor-neutral tracing, metrics, logs; GenAI-related semantic conventions evolve here—§10’s trace-tree model. https://opentelemetry.io/
  2. W3C Trace Context. Trace propagation across services when the LLM gateway fans out to retrieval and tools (§10). https://www.w3.org/TR/trace-context/

Protocols & API design

  1. OpenAPI Specification. Machine-readable HTTP contracts—REST tool surfaces in §4, §8. https://spec.openapis.org/oas/latest.html

Inference serving & systems (optional depth)

  1. Agrawal et al., “Sarathi-Serve: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills.” OSDI 2024. Throughput / scheduling if you self-host (§4, §11). https://arxiv.org/abs/2308.16369
  2. Pope et al., “Efficiently Scaling Transformer Inference.” MLSys 2023. Multi-device serving background. https://arxiv.org/abs/2211.05102
Attribution. This HTML is authored by Linh Truong. The list above does not imply endorsement by any author, institution, or regulator. When redistributing the file, link the canonical note at LinhTruong.com.