AI engineering · field notes · May 2026

The AI Engineer Notes and diagrams for building real products on foundation models—where I put the effort, and what usually breaks in production

Linh Truong, MA (Harvard), MBA · LinhTruong.com · Linh@Alumni.Harvard.edu

Source / canonical copy: LinhTruong.com. If you share this file, link there so attribution stays with the document.

Audience: Software engineers, tech leads, architects · Scope: LLMs, RAG, agents, evals, LLM ops, production rollout · Revised: May 2026

1 · What is an AI Engineer?

An AI Engineer is a software engineer whose primary craft is building applications on top of foundation models. They sit at the intersection of product engineering, applied ML, and systems design — but, unlike traditional ML engineers, they rarely train models from scratch. Instead, they compose pre-trained models, retrieval, tools, and evaluation harnesses into reliable products.

Core mandate

Make probabilistic systems behave like products

Turn a non-deterministic model API into a system that meets latency, cost, accuracy, and safety SLOs — every single deploy.

Primary outputs

Pipelines, prompts, evals, guardrails

Not model weights. The deliverable is a versioned system: prompts + retrieval + tools + policies + tests + monitors.

Posture

Empiricist, not theorist

Hypothesize → run an eval → measure → ship. Reading benchmarks ≠ knowing how a model behaves on your data.

Figure 1. The AI Engineer sits at the convergence of three disciplines.

2 · Why this role exists now

Three structural shifts created a permanent role between SWE and ML:

Capability is rented, not trained. Frontier models (Claude, GPT, Gemini, Llama) are accessible through APIs. The bottleneck moved from training to integrating.
The interface is natural language. The hardest part is no longer math — it is specifying intent, retrieving the right context, and evaluating fuzzy output.
Production is probabilistic. Classical SWE assumes deterministic functions. LLM systems require new disciplines: evals, guardrails, fallbacks, and observability designed for distributions of behaviour.

Talent gap

Teams I work with rarely lack someone who can train a model; they lack someone who will own prompts, retrieval, eval harnesses, and guardrails end to end. Most shipped AI is integration and measurement, not fresh pre-training.

Foundation Models Tool Use RAG Evals Guardrails

3 · Software Engineer vs ML Engineer vs AI Engineer

Dimension	Software Engineer	ML Engineer	AI Engineer
Primary artifact	Deterministic code	Trained models & pipelines	Prompted/orchestrated systems on top of pre-trained models
Math floor	Algorithms, DS	Linear algebra, probability, optimization	Probability intuition, vector geometry, basic stats for evals
Core tooling	Git, CI/CD, REST/RPC, DBs	PyTorch/JAX, MLflow, GPUs, feature stores	LLM SDKs, vector DBs, eval frameworks, tracing, orchestration
"Test" means	Unit/integration tests	Train/val/test split, AUC, F1	Eval sets, LLM-as-judge, rubrics, regression suites
Failure mode	Crash / wrong output	Drift / poor generalization	Hallucination, prompt injection, silent quality drop
Iteration speed	Minutes	Hours–days (training)	Minutes (prompts) → days (eval-driven)
Owns in prod	Latency, correctness	Model performance, retraining	End-to-end behaviour: quality + cost + safety + UX

4 · The modern AI engineering stack

Think of the stack as eight layers. Master one at a time, top-down, then deepen as needed.

Figure 2. The 8-layer AI engineering stack — read top-down for product impact, bottom-up for reliability.

Reference tools per layer (representative, not exhaustive)

Layer	Representative tools / services (2026)
Model APIs	Anthropic Claude (Opus 4.7 / Sonnet 4.6 / Haiku 4.5), OpenAI GPT, Google Gemini, Meta Llama, Mistral, Cohere
Orchestration / Agents	Claude Agent SDK, LangGraph, LlamaIndex, DSPy, Inngest, Temporal (durable execution)
Retrieval	pgvector, Qdrant, Pinecone, Weaviate, Turbopuffer, Vespa, Elasticsearch, BM25
Embeddings & rerankers	Voyage, Cohere Rerank, OpenAI text-embedding-3, BGE, ColBERT
Evals	Braintrust, LangSmith, Promptfoo, Inspect, Ragas, OpenAI Evals, in-house harnesses
Observability	Helicone, LangSmith, Langfuse, Arize, Datadog LLM Observability, Honeycomb
Guardrails	NVIDIA NeMo Guardrails, Guardrails AI, Lakera, Protect AI, custom classifiers
Self-hosted inference	vLLM, TGI, TensorRT-LLM, SGLang, Triton, Ollama (dev)
Fine-tuning	LoRA/QLoRA, Unsloth, Axolotl, provider-hosted fine-tuning, RFT/DPO
Protocols	MCP (Model Context Protocol) for tool/context interop · OpenAPI for HTTP tools

5 · Mental models & fundamentals every AI engineer needs

The model is a function over context

Conceptually: output = f(context, decoding). You don't get to change f at runtime — you only control context (system + user + retrieved + tools) and decoding (temperature, top-p, max-tokens, stop-sequences, JSON-mode).

Context is a budget, not a bucket

Bigger contexts ≠ better answers. Models attend unevenly, and cost/latency scale with tokens. Curate context like you curate a small slide deck for an executive: ruthless relevance over completeness.

Tokens, not characters

Pricing, limits, and quality are all token-denominated. Internalize "1 token ≈ 4 chars ≈ ¾ of an English word." Multi-lingual and code tokens are denser — measure, don't guess.

Embeddings are geometry

Similar meaning ≈ small angle. Cosine similarity is a noisy proxy for relevance. Always pair vector search with a re-ranker or BM25 for grounding-critical tasks.

Determinism is a choice you opt into

Set temperature=0, fix seed, freeze the model version, and pin prompts. Treat the model version like a dependency in package.json.

Failures are distributions, not exceptions

Your code doesn't "crash" — it degrades silently on 3% of inputs. Build evals that surface the tail, and dashboards that track quality, not just uptime.

6 · Prompting & context engineering

Prompt engineering has matured into context engineering: the discipline of assembling the smallest sufficient context for a model to act correctly. A useful taxonomy:

Figure 3. The four ingredients of a well-engineered context window.

Techniques worth knowing cold

Few-shot examples — 2–5 diverse, edge-case-aware examples beat instructions almost every time.
Chain-of-thought / reasoning models — for math, code, planning, prefer reasoning models or explicit step-by-step prompting.
Structured output — JSON schema / tool calling produces parseable output and is more reliable than ad-hoc regex.
Self-consistency & sampling — for hard answers, sample N times and vote / verify.

Role / persona constraints — narrow the function to reduce drift ("You are a senior tax accountant…").
Negative instructions sparingly — models follow "do X" better than "don't do Y."
Prompt caching — keep large stable preludes (style guide, tools, docs) at the top to hit provider caches.
Decomposition — split big asks into a planner + executor + verifier. Smaller subtasks are easier to eval.

Anti-pattern: stuffing the entire knowledge base into the prompt because the context window is large. Bigger context degrades attention, balloons cost, and hides where the answer actually came from. Retrieve, don't dump.

7 · Retrieval-Augmented Generation (RAG)

RAG is the default architecture for grounding LLMs in private or fresh knowledge. The deceptively-simple promise — "search then generate" — hides a dozen design decisions.

Figure 4. A production RAG pipeline — offline indexing (top) and online retrieval + generation (bottom).

RAG design decisions, and their usual answers in 2026

Decision	Typical answer	When to deviate
Chunk size	300–800 tokens, with 10–20% overlap	Long-form prose: larger; code/structured: smaller, by symbol
Chunking strategy	Semantic / heading-aware over fixed-size	Tables, code, schemas — chunk by structure
Retrieval	Hybrid (vector + BM25) + metadata filters	Highly entity-driven domains may need graph or SQL retrieval
Re-ranker	Yes — cross-encoder on top 50 → keep 5–10	Latency-critical paths can skip if recall is high
Cite sources	Always for user-facing factual answers	Internal pipelines can skip; logs should still record provenance
Freshness	Incremental indexing + soft TTL on chunks	Compliance domains: hard expiration + audit trail
Eval	Recall@k, faithfulness, answer correctness — judged offline	High-volume products: add online click/feedback signals

Beyond vanilla RAG

Agentic retrieval — the model decides what to search for, runs multi-hop queries, and stops when satisfied. Stronger on open-ended research questions.
GraphRAG — retrieval over knowledge graphs / entity-relationship structures. Helps when answers depend on connections, not just text similarity.
Structured RAG — text-to-SQL / text-to-API retrieval over operational data. Useful for analytics-style products once you harden query safety and tenancy.
Long-context as cache — for small corpora (<1M tokens), put everything in-context with prompt caching and skip retrieval entirely.

8 · Agents, tools & multi-agent systems

An agent is an LLM that runs in a loop, calling tools and observing results until a goal is satisfied. Agents are the natural extension of RAG: instead of retrieving once, they retrieve, act, verify, and replan.

Figure 5. The canonical Plan → Act → Observe → Reflect agent loop.

Tool design — the single biggest determinant of agent quality

Make tools small, sharp, and named like a senior API

Each tool does one thing well. Compose, don't overload.
Names are the documentation: search_invoices_by_customer > db_query.
Use strict JSON schemas. Reject under-specified calls early.
Return small, structured results — not whole pages.

Make failures legible

Errors should explain how to fix the call: "missing field: customer_id (UUID)".
Truncate large outputs with a marker the agent understands.
Idempotency keys on mutating tools.
Always log tool args + outputs for replay.

Agent topologies

Topology	Shape	Use it when	Caution
Single-agent loop	One LLM + tools	Default. Covers ~80% of real-world cases.	None — start here, always.
Planner + executor	One agent plans, another executes	Long tasks where planning ≠ tool calling cleanly.	Adds latency & failure surface.
Sub-agent delegation	Parent spawns specialised sub-agents	Independent subtasks that can run in parallel.	Don't fan out for the sake of it — costs multiply.
Swarm / multi-agent debate	Several agents argue / vote	Open-ended creative or research tasks.	Often less reliable than a single careful agent + verifier.
Human-in-the-loop	Agent pauses for approval at risky steps	Anything irreversible (payments, deletes, sends).	Design the UX of waiting — silence is a failure.

Heuristic: if a deterministic workflow would do the job, use one. Agents are appropriate when the path itself is unknown until you start. Most "agent" projects that fail were workflow problems in disguise.

The Model Context Protocol (MCP)

MCP is the emerging open protocol for exposing tools, resources, and prompts to any LLM client. Treat it the way you treat HTTP: a contract between agent runtimes and the systems they act on. Wrap your internal APIs in an MCP server once, and they become available to every model and IDE that speaks MCP.

9 · Evals: the new unit test

Evals are the deliverable. Without a pinned suite, you cannot tell whether yesterday's prompt tweak helped, regressed a corner case, or just moved noise around.

Eval set

A curated, version-controlled collection of inputs + expected behaviours. Grows from real production traces, bug reports, and red-team examples.

Scorer

How "good" is measured: exact match, regex, embedding similarity, code-executed test, rubric-graded LLM-as-judge, or human label.

Harness

The runner that pairs eval set × scorer × system version, produces metrics, and compares against baseline. Runs in CI on every prompt or code change.

The eval pyramid

Figure 6. Stack cheap deterministic checks at the base; reserve human judgement for the top.

LLM-as-judge — the workhorse, with caveats

Use pairwise comparison (A vs B) — judges are more reliable comparing than scoring absolutely.
Always include a rubric the judge must follow, with negative examples.
Calibrate the judge against ~100 human-labelled examples. If judge ≠ human, fix the judge before trusting it.
Beware self-preference bias — judges tend to prefer outputs from the same model family.

What to measure, by feature type

Feature	Primary metrics	Guardrail metrics
RAG Q&A	Faithfulness, answer correctness, citation accuracy	Refusal rate, latency p95, cost / query
Summarization	Coverage, factuality, conciseness (LLM-judge)	Hallucinated-entity rate, length distribution
Code generation	pass@k on hidden test suite, lint clean	Build success, security-rule violations
Agents / tool use	Task success rate, steps-to-success	Tool error rate, cost per task, runaway loops
Classification / extraction	Precision, recall, F1 on labelled set	Confidence calibration, abstention rate

10 · Observability & LLMOps

Classical metrics miss too much in LLM stacks. I watch three extra primitives alongside normal service telemetry:

Traces

Tree of every prompt, retrieval, tool call, and sub-agent for a single user request. The flamegraph of LLM apps.

Token + cost ledger

Per-tenant, per-feature spend. Cost is a first-class SLO — track it on the same dashboard as latency.

Quality signals

Thumbs, retries, regenerate clicks, abandonment, judge scores on a sample of live traffic.

Production loop

Figure 7. The LLMOps flywheel — every production failure becomes tomorrow's regression test.

11 · Latency, cost & reliability

Latency levers

Stream tokens to the UI; perceived latency is mostly time-to-first-token.
Cache stable prompt prefixes (system + tool defs + style guide).
Route easy queries to a smaller/faster model (Haiku-tier), escalate when uncertain.
Parallelize independent tool calls in agents.
Speculative decoding & structured output reduce tail latency.

Cost levers

Smaller models for classification, extraction, routing, summarization.
Aggressive prompt caching — often 50–90% input-token savings.
Trim retrieved context: re-rank then keep top-k, not top-50.
Batch async workloads through provider batch APIs (often ~50% off).
Pre-compute embeddings; don't re-embed unchanged content.

Reliability levers

Fallback model on provider failures — health-check both.
Retries with jitter, capped at sane budgets — never infinite.
Circuit breakers per tool / per provider.
Timeouts everywhere, including streaming response stalls.
Idempotency on mutating tool calls.

SLO suggestions

Time-to-first-token: p95 < 1.5s for chat.
End-to-end completion: p95 < 8s for chat, < 45s for agents.
Quality regression: alert on judge-score drop > 3 points week-over-week.
Cost per session: alert on > 2× rolling 7-day median.

12 · Safety, security & governance

LLM apps carry a different threat model than typical CRUD. When I review features, I walk the OWASP LLM Top-10 list explicitly—here are the rows that show up most in incident-style reviews:

Threat	What it looks like	Mitigation
Prompt injection	Untrusted content tells the model to ignore instructions, exfiltrate data, or call tools.	Treat all retrieved content as untrusted; separate roles; sanitize; limit tool permissions; least-privilege agents.
Data exfiltration	Model includes secrets from context in its output, or via a tool call to an attacker-controlled URL.	Output filters, URL allowlists, secret scanners, no secrets in prompts, sandboxed tool execution.
Insecure tool use	Agent calls a destructive tool on adversarial input.	Human approval for irreversible ops, idempotency, strict schema validation, RBAC at tool level.
Supply chain	Compromised model weights, embeddings, or third-party MCP server.	Pin versions, verify hashes, vet providers, isolate untrusted MCP servers.
PII leakage	Logs/traces contain user PII sent to third-party providers.	PII detection + redaction at the boundary; data-processing agreements; regional routing.
Jailbreaks	Users coax the model past its safety policies.	Layered defenses: system prompt, classifier guardrails, content filters, abuse-rate-limits.
Hallucination as risk	Confident wrong answers in regulated domains (medical, legal, financial).	Citations, abstention, "I don't know" prompting, human-in-the-loop for high-stakes decisions.

Rule: the agent's effective permissions are the union of every tool it can call. Design tool surfaces as if they will be invoked with adversarial inputs — because they will be.

Governance checklist (enterprise)

Data-flow diagram per feature: what data leaves the trust boundary, to which provider, under what DPA.
Model card / system card kept in version control: model, version, intended use, known limits, eval scores.
Audit log for any agent action affecting users, money, or records (immutable, queryable).
Red-team cadence (quarterly minimum) — prompt-injection, jailbreak, exfiltration.
Region / residency routing for regulated tenants (EU-only, on-prem option, etc.).

13 · Fine-tuning, distillation & post-training

Most product problems are solved by better prompts, better retrieval, or better evals — not by training. But there is a real role for post-training when:

You have a narrow, high-volume task where latency / cost dominate (intent classification, structured extraction, routing).
You need to teach a stable style or format that's hard to convey via examples.
You're distilling a frontier model's behaviour onto a smaller open-weights model for offline / on-prem deployment.

Spectrum of post-training

Technique	What it changes	Effort	Right when
Prompting + few-shot	Inputs only	Minutes	Always try first.
RAG	Inputs (with knowledge)	Days	Knowledge gap, not skill gap.
SFT (LoRA/QLoRA)	A small adapter on a base model	Days–weeks	Style/format/narrow skill; you have ≥ a few thousand high-quality examples.
Preference tuning (DPO / RFT)	Model preferences via pairwise data	Weeks	You can grade outputs better than you can write them.
Distillation	Smaller model mimics bigger one's outputs	Weeks	Latency/cost critical, frontier-quality optional.
Continued pre-training	The model's core weights, broadly	Months, $$$	Rarely justified outside foundation labs.

Order of operations: prompt → few-shot → retrieval → tools → eval-driven prompt rewrite → small-model routing → LoRA / DPO. Skipping ahead burns months.

14 · Reference architectures

14.1 The "Copilot for X" reference architecture

Figure 8. A production copilot — a thin client, an orchestrator, three "shoulder" services (retrieval, tools, models), and a cross-cutting LLMOps plane.

14.2 Pattern catalogue

Single-shot LLM call

One prompt, one response, structured output. Use for extraction, classification, summarisation. The 80% solution to most "AI features".

RAG + cite

Retrieve → generate → cite. Default for any Q&A over private knowledge.

Tool-using agent

Loop with bounded steps and a verifier tool. Use when the path of action is not known in advance.

Router + specialist

A small classifier model routes to one of N specialist prompts/models. Cheap, fast, and easy to govern.

Workflow + LLM steps

A deterministic DAG with LLM nodes for specific subtasks. The right answer when "agent" feels like overkill.

Async / batch

Background jobs (summarise nightly, classify backlogs). Use batch APIs for ~50% cost; observability through job IDs.

15 · Day-to-day workflow of an effective AI engineer

Define the task with a sample of 20 real examples. Not synthetic — real. If you can't find 20, you don't have a product yet, you have a hypothesis.
Label the desired output for each. This is your initial eval set.
Build the simplest baseline that could work. One prompt, one model, no retrieval. Measure.
Add one capability at a time: retrieval → tools → verification → routing. Re-run evals after every change.
Instrument from day one. Traces, token counts, latency, judge scores. You will need them next week.
Ship behind a feature flag, canary at 1% → 10% → 50% → 100%, with quality + cost guardrail metrics.
Mine production traces weekly. Every failure becomes an eval example. The eval set is alive.
Re-evaluate model choice quarterly. Models get cheaper and better — your routing and prompt should follow.

16 · 12-month learning roadmap for software engineers

Figure 9. A pragmatic 12-month curriculum — each phase ends with a shipped artifact, not a certificate.

Per-phase recommended exercises

Phase 1 — Foundations

Write a tokenizer-aware "trim context" function for your stack.
Build a multi-turn chat with streaming, cancellation, and undo.
Force a model into strict JSON for a real internal task. Measure schema-violation rate.

Phase 2 — RAG + Evals

Index a real corpus you care about; compare chunking strategies on a 50-question eval set.
Add a re-ranker; show recall@k improvement.
Build CI that fails a PR if eval-mean drops >2 points.

Phase 3 — Agents + Tools

Take an existing internal workflow with ≤5 steps and replace it with an agent. Compare.
Wrap one internal API as an MCP server.
Add a verifier sub-agent; measure task-success uplift.

Phase 4 — Scale + Specialise

Route 30% of traffic to a smaller model via a learned router; track quality and cost delta.
Run a quarterly red-team exercise with documented findings.
Distil a frontier prompt into a small fine-tuned model for one high-volume task.

17 · Anti-patterns & failure modes

Anti-pattern	Why it happens	Fix
Demoware in prod	Demo prompt promoted without evals or guardrails.	No ship without an eval set + SLOs + canary.
"Big context" everything	"Models have 1M context, just dump it all."	Retrieve, rank, trim. Token budget is a discipline.
Single-string mega-prompt	Years of patches accumulate in one file.	Compose modular prompts; version them; cache the stable parts.
No-evals fast iteration	"It seemed better." Vibes-driven engineering.	Every change runs the eval suite. The number is the decision.
Vendor-locked code	SDK-specific abstractions leak everywhere.	Thin internal interface for model calls; swap providers in a day.
Trusting LLM-as-judge blindly	It looks like rigour; it isn't, unless calibrated.	Calibrate to humans on 100+ examples; use pairwise; rotate judges.
Agents where workflows suffice	Hype. Loops feel powerful.	Start with a workflow; promote to an agent only when the path is data-dependent.
Unbounded tool permissions	"It's just a prototype."	RBAC at the tool layer; human approval on irreversible ops; audit logs.
Ignoring tail latency	p50 looks fine; p99 is 40 seconds.	Stream, cache, parallelize, timeout, and alert on p95/p99.
No feedback loop	Users complain in Slack, not in the product.	In-product 👍/👎, retry-as-signal, weekly trace review with PM.

18 · 2026 trends & what's coming next

Long-horizon, durable agents

Agents that run for hours / days, with explicit memory, checkpoints, and resumability — closer to Temporal / workflow engines than chatbots.

Computer / browser use

Models that operate UIs directly. Powerful for legacy systems with no APIs — but the highest-blast-radius surface to govern.

Open standards win

MCP for tools/context and OpenAPI for HTTP are becoming the lingua franca. Build to the protocol, not the SDK.

Reasoning models for everything hard

Math, code, planning, multi-step extraction. The cost gap closes; the quality gap on hard tasks does not.

Per-tenant fine-tuning

LoRA adapters per customer for style and domain. Cheap, fast, and a real moat for vertical SaaS.

Eval-as-a-product

Evals are sold, traded, and audited like test suites. Expect SOC2-style attestations of eval coverage in enterprise procurement.

On-device + edge inference

Small capable models (1–8B) on phones and laptops handle private/latency-critical paths; cloud reserved for hard cases.

Regulation hardens

EU AI Act enforcement, US sector-specific rules. The AI engineer's responsibility extends to documentation, audit, and lifecycle controls.

19 · Closing

How I tie a bow on reviews. Treat the model as a probabilistic dependency—same as any flaky upstream—and engineer the rest: context in, structured I/O, evals that regress bad behavior, and traces that show you when quality drifts. Nail that loop and you can swap model versions without scrambling.

stack layers to know

eval set: your single biggest asset

∞

production traces to learn from

20 · References & sources

The diagrams and opinions in this note are mine; the factual and technical anchors below are where I send people who want primary material on transformers, RAG, agents, evals, security, and production practice. Entries are grouped so you can jump to what a section leaned on—papers for mechanisms, standards for shared vocabulary, regulators for compliance context.

Scope note: arXiv and conference versions suffice for citations; always check the published venue if you need a DOI for academic writing. Vendor APIs and model names change frequently—treat provider documentation as the source of truth for quotas, pricing, and safety policies.

Foundation models, transformers & prompting

Vaswani et al., “Attention Is All You Need.” NeurIPS 2017. Transformer architecture underlying essentially all frontier LLMs discussed in §4–§6. https://arxiv.org/abs/1706.03762
Brown et al., “Language Models are Few-Shot Learners” (GPT-3). NeurIPS 2020. Grounds the “capability is rented / in-context learning” framing in §2. https://arxiv.org/abs/2005.14165
OpenAI, “GPT-4 Technical Report.” 2023. Baseline for capability envelopes, mitigations, and evaluation limits—always cross-check with your own evals (§9). https://arxiv.org/abs/2303.08774
Touvron et al., “LLaMA: Open and Efficient Foundation Language Models.” Meta, 2023. Open-weights and on-prem options in §4, §13. https://arxiv.org/abs/2302.13971
Wei et al., “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.” NeurIPS 2022. Reasoning-style prompting in §6. https://arxiv.org/abs/2201.11903
Kojima et al., “Large Language Models are Zero-Shot Reasoners.” NeurIPS 2022. “Let’s think step by step” line of work in §6. https://arxiv.org/abs/2205.11916
Wang et al., “Self-Consistency Improves Chain of Thought Reasoning in Language Models.” ICLR 2023. Sampling / voting ideas in §6. https://arxiv.org/abs/2203.11171
Ouyang et al., “Training language models to follow instructions with human feedback” (InstructGPT). 2022. Instruction tuning, refusal behavior, and alignment-shaped APIs—§6–§8. https://arxiv.org/abs/2203.02155

Retrieval-augmented generation & information retrieval

Lewis et al., “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.” NeurIPS 2020. Canonical RAG formulation in §7. https://arxiv.org/abs/2005.11401
Robertson & Zaragoza, “The Probabilistic Relevance Framework: BM25 and Beyond.” Foundations and Trends in Information Retrieval, 2009. Lexical / BM25 retrieval in hybrid search (§7). https://dl.acm.org/doi/10.1561/1500000019
Karpukhin et al., “Dense Passage Retrieval for Open-Domain Question Answering.” EMNLP 2020. Dense retrieval adjacent to embedding-based RAG (§7). https://arxiv.org/abs/2004.04906
Gao et al., “Retrieval-Augmented Generation: A Survey.” 2024. One-paper overview of chunking, indexing, and query reformulation beyond this note. https://arxiv.org/abs/2312.10997
Edge et al., “From Local to Global: A Graph RAG Approach to Query-Focused Summarization.” Microsoft Research, 2024. Background for GraphRAG in §7. https://arxiv.org/abs/2404.16130

Tool use, agents & planning loops

Schick et al., “Toolformer: Language Models Can Teach Themselves to Use Tools.” 2023. Systematic LLM + tools framing related to §8. https://arxiv.org/abs/2302.04761
Yao et al., “ReAct: Synergizing Reasoning and Acting in Language Models.” ICLR 2023. Plan / act / observe loop behind the agent diagram in §8. https://arxiv.org/abs/2210.03629
Shinn et al., “Reflexion: Language Agents with Verbal Reinforcement Learning.” 2023. Reflection / verification patterns in §8–§9. https://arxiv.org/abs/2303.11366
Model Context Protocol (MCP). Anthropic-initiated open specification for tools, resources, and prompts across clients—§4, §8. https://modelcontextprotocol.io/ Schemas and revisions are published on the official site and GitHub org.

Evaluation, benchmarking & LLM-as-judge

Zheng et al., “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.” NeurIPS 2023. Pairwise judging methodology relevant to §9. https://arxiv.org/abs/2306.05685
Dubois et al., “AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback.” 2023. Cheap evaluation and simulator bias—counterweight when building harnesses (§9). https://arxiv.org/abs/2305.14387
Es et al., “Ragas: Automated Evaluation of Retrieval Augmented Generation.” 2023. Metrics behind the RAGAS tooling named in §4. https://arxiv.org/abs/2309.15217
Lin, “ROUGE: A Package for Automatic Evaluation of Summaries.” ACL Workshop 2004. Classical overlap metrics still used in summarization evals (§9). https://aclanthology.org/W04-1013/

Alignment, preference optimization & post-training

Rafailov et al., “Direct Preference Optimization: Your Language Model is Secretly a Reward Model.” NeurIPS 2023. DPO in §13. https://arxiv.org/abs/2305.18290
Hu et al., “LoRA: Low-Rank Adaptation of Large Language Models.” ICLR 2022. Adapter fine-tuning in §13. https://arxiv.org/abs/2106.09685
Hinton, Vinyals & Dean, “Distilling the Knowledge in a Neural Network.” NIPS Deep Learning Workshop 2015. Distillation vocabulary in §13. https://arxiv.org/abs/1503.02531

Security, abuse & privacy baselines

OWASP Top 10 for Large Language Model Applications. Community standard for LLM-specific threats; §12 follows this taxonomy. https://owasp.org/www-project-top-10-for-large-language-model-applications/
Greshake et al., “Not What You’ve Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection.” 2023. Prompt injection via retrieved or external content—pairs with §12. https://arxiv.org/abs/2302.12173
NIST AI Risk Management Framework (AI RMF 1.0). U.S. voluntary framework; language for governance checklists in §12. https://www.nist.gov/itl/ai-risk-management-framework
European Union, Artificial Intelligence Act (Regulation (EU) 2024/1689). Primary legal text for the regulation bullet in §18—prefer EUR-Lex over secondary summaries. https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX%3A32024R1689

Reliability, SLOs & production discipline

Beyer et al., Site Reliability Engineering: How Google Runs Production Systems. O’Reilly, 2016 (free online). SLO / error-budget thinking in §5, §10–§11. https://sre.google/sre-book/table-of-contents/
Google, The Site Reliability Workbook. 2018. Incident response and monitoring adjacent to §10–§11. https://sre.google/workbook/table-of-contents/
Huyen, Designing Machine Learning Systems. O’Reilly, 2022. ML features that share a fleet with classical services (§10–§14). ISBN 978-1098107969.
Kleppmann, Designing Data-Intensive Applications. O’Reilly, 2017. Data consistency, streaming, and system design behind serious RAG / agent stacks (§6–§7, §14). ISBN 978-1449373320.

Observability, tracing & interoperability

OpenTelemetry Project. Vendor-neutral tracing, metrics, logs; GenAI-related semantic conventions evolve here—§10’s trace-tree model. https://opentelemetry.io/
W3C Trace Context. Trace propagation across services when the LLM gateway fans out to retrieval and tools (§10). https://www.w3.org/TR/trace-context/

Protocols & API design

OpenAPI Specification. Machine-readable HTTP contracts—REST tool surfaces in §4, §8. https://spec.openapis.org/oas/latest.html

Inference serving & systems (optional depth)

Agrawal et al., “Sarathi-Serve: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills.” OSDI 2024. Throughput / scheduling if you self-host (§4, §11). https://arxiv.org/abs/2308.16369
Pope et al., “Efficiently Scaling Transformer Inference.” MLSys 2023. Multi-device serving background. https://arxiv.org/abs/2211.05102

Attribution. This HTML is authored by Linh Truong. The list above does not imply endorsement by any author, institution, or regulator. When redistributing the file, link the canonical note at LinhTruong.com.

Contents

1 · What is an AI Engineer?

Make probabilistic systems behave like products

Pipelines, prompts, evals, guardrails

Empiricist, not theorist

2 · Why this role exists now

Talent gap

3 · Software Engineer vs ML Engineer vs AI Engineer

4 · The modern AI engineering stack

Reference tools per layer (representative, not exhaustive)

5 · Mental models & fundamentals every AI engineer needs

The model is a function over context

Context is a budget, not a bucket

Tokens, not characters

Embeddings are geometry

Determinism is a choice you opt into

Failures are distributions, not exceptions

6 · Prompting & context engineering

Techniques worth knowing cold

7 · Retrieval-Augmented Generation (RAG)

RAG design decisions, and their usual answers in 2026

Beyond vanilla RAG

8 · Agents, tools & multi-agent systems

Tool design — the single biggest determinant of agent quality

Make tools small, sharp, and named like a senior API

Make failures legible

Agent topologies

The Model Context Protocol (MCP)

9 · Evals: the new unit test

Eval set

Scorer

Harness

The eval pyramid

LLM-as-judge — the workhorse, with caveats

What to measure, by feature type

10 · Observability & LLMOps

Traces

Token + cost ledger

Quality signals

Production loop

11 · Latency, cost & reliability

Latency levers

Cost levers

Reliability levers

SLO suggestions

12 · Safety, security & governance

Governance checklist (enterprise)

13 · Fine-tuning, distillation & post-training

Spectrum of post-training

14 · Reference architectures

14.1 The "Copilot for X" reference architecture

14.2 Pattern catalogue

Single-shot LLM call

RAG + cite

Tool-using agent

Router + specialist

Workflow + LLM steps

Async / batch

15 · Day-to-day workflow of an effective AI engineer

16 · 12-month learning roadmap for software engineers

Per-phase recommended exercises

Phase 1 — Foundations

Phase 2 — RAG + Evals

Phase 3 — Agents + Tools

Phase 4 — Scale + Specialise

17 · Anti-patterns & failure modes

18 · 2026 trends & what's coming next

Long-horizon, durable agents

Computer / browser use

Open standards win

Reasoning models for everything hard

Per-tenant fine-tuning

Eval-as-a-product

On-device + edge inference

Regulation hardens

19 · Closing

20 · References & sources

Foundation models, transformers & prompting

Retrieval-augmented generation & information retrieval

Tool use, agents & planning loops