An AI Engineer is a software engineer whose primary craft is building applications on top of foundation models. They sit at the intersection of product engineering, applied ML, and systems design — but, unlike traditional ML engineers, they rarely train models from scratch. Instead, they compose pre-trained models, retrieval, tools, and evaluation harnesses into reliable products.
Core mandate
Make probabilistic systems behave like products
Turn a non-deterministic model API into a system that meets latency, cost, accuracy, and safety SLOs — every single deploy.
Primary outputs
Pipelines, prompts, evals, guardrails
Not model weights. The deliverable is a versioned system: prompts + retrieval + tools + policies + tests + monitors.
Posture
Empiricist, not theorist
Hypothesize → run an eval → measure → ship. Reading benchmarks ≠ knowing how a model behaves on your data.
Figure 1. The AI Engineer sits at the convergence of three disciplines.
2 · Why this role exists now
Three structural shifts created a permanent role between SWE and ML:
Capability is rented, not trained. Frontier models (Claude, GPT, Gemini, Llama) are accessible through APIs. The bottleneck moved from training to integrating.
The interface is natural language. The hardest part is no longer math — it is specifying intent, retrieving the right context, and evaluating fuzzy output.
Production is probabilistic. Classical SWE assumes deterministic functions. LLM systems require new disciplines: evals, guardrails, fallbacks, and observability designed for distributions of behaviour.
Talent gap
Teams I work with rarely lack someone who can train a model; they lack someone who will own prompts, retrieval, eval harnesses, and guardrails end to end. Most shipped AI is integration and measurement, not fresh pre-training.
Foundation ModelsTool UseRAGEvalsGuardrails
3 · Software Engineer vs ML Engineer vs AI Engineer
Dimension
Software Engineer
ML Engineer
AI Engineer
Primary artifact
Deterministic code
Trained models & pipelines
Prompted/orchestrated systems on top of pre-trained models
Math floor
Algorithms, DS
Linear algebra, probability, optimization
Probability intuition, vector geometry, basic stats for evals
MCP (Model Context Protocol) for tool/context interop · OpenAPI for HTTP tools
5 · Mental models & fundamentals every AI engineer needs
The model is a function over context
Conceptually: output = f(context, decoding). You don't get to change f at runtime — you only control context (system + user + retrieved + tools) and decoding (temperature, top-p, max-tokens, stop-sequences, JSON-mode).
Context is a budget, not a bucket
Bigger contexts ≠ better answers. Models attend unevenly, and cost/latency scale with tokens. Curate context like you curate a small slide deck for an executive: ruthless relevance over completeness.
Tokens, not characters
Pricing, limits, and quality are all token-denominated. Internalize "1 token ≈ 4 chars ≈ ¾ of an English word." Multi-lingual and code tokens are denser — measure, don't guess.
Embeddings are geometry
Similar meaning ≈ small angle. Cosine similarity is a noisy proxy for relevance. Always pair vector search with a re-ranker or BM25 for grounding-critical tasks.
Determinism is a choice you opt into
Set temperature=0, fix seed, freeze the model version, and pin prompts. Treat the model version like a dependency in package.json.
Failures are distributions, not exceptions
Your code doesn't "crash" — it degrades silently on 3% of inputs. Build evals that surface the tail, and dashboards that track quality, not just uptime.
6 · Prompting & context engineering
Prompt engineering has matured into context engineering: the discipline of assembling the smallest sufficient context for a model to act correctly. A useful taxonomy:
Figure 3. The four ingredients of a well-engineered context window.
Techniques worth knowing cold
Few-shot examples — 2–5 diverse, edge-case-aware examples beat instructions almost every time.
Chain-of-thought / reasoning models — for math, code, planning, prefer reasoning models or explicit step-by-step prompting.
Structured output — JSON schema / tool calling produces parseable output and is more reliable than ad-hoc regex.
Self-consistency & sampling — for hard answers, sample N times and vote / verify.
Role / persona constraints — narrow the function to reduce drift ("You are a senior tax accountant…").
Negative instructions sparingly — models follow "do X" better than "don't do Y."
Prompt caching — keep large stable preludes (style guide, tools, docs) at the top to hit provider caches.
Decomposition — split big asks into a planner + executor + verifier. Smaller subtasks are easier to eval.
Anti-pattern: stuffing the entire knowledge base into the prompt because the context window is large. Bigger context degrades attention, balloons cost, and hides where the answer actually came from. Retrieve, don't dump.
7 · Retrieval-Augmented Generation (RAG)
RAG is the default architecture for grounding LLMs in private or fresh knowledge. The deceptively-simple promise — "search then generate" — hides a dozen design decisions.
Figure 4. A production RAG pipeline — offline indexing (top) and online retrieval + generation (bottom).
RAG design decisions, and their usual answers in 2026
Decision
Typical answer
When to deviate
Chunk size
300–800 tokens, with 10–20% overlap
Long-form prose: larger; code/structured: smaller, by symbol
Chunking strategy
Semantic / heading-aware over fixed-size
Tables, code, schemas — chunk by structure
Retrieval
Hybrid (vector + BM25) + metadata filters
Highly entity-driven domains may need graph or SQL retrieval
Re-ranker
Yes — cross-encoder on top 50 → keep 5–10
Latency-critical paths can skip if recall is high
Cite sources
Always for user-facing factual answers
Internal pipelines can skip; logs should still record provenance
Agentic retrieval — the model decides what to search for, runs multi-hop queries, and stops when satisfied. Stronger on open-ended research questions.
GraphRAG — retrieval over knowledge graphs / entity-relationship structures. Helps when answers depend on connections, not just text similarity.
Structured RAG — text-to-SQL / text-to-API retrieval over operational data. Useful for analytics-style products once you harden query safety and tenancy.
Long-context as cache — for small corpora (<1M tokens), put everything in-context with prompt caching and skip retrieval entirely.
8 · Agents, tools & multi-agent systems
An agent is an LLM that runs in a loop, calling tools and observing results until a goal is satisfied. Agents are the natural extension of RAG: instead of retrieving once, they retrieve, act, verify, and replan.
Figure 5. The canonical Plan → Act → Observe → Reflect agent loop.
Tool design — the single biggest determinant of agent quality
Make tools small, sharp, and named like a senior API
Each tool does one thing well. Compose, don't overload.
Names are the documentation: search_invoices_by_customer > db_query.
Use strict JSON schemas. Reject under-specified calls early.
Return small, structured results — not whole pages.
Make failures legible
Errors should explain how to fix the call: "missing field: customer_id (UUID)".
Truncate large outputs with a marker the agent understands.
Idempotency keys on mutating tools.
Always log tool args + outputs for replay.
Agent topologies
Topology
Shape
Use it when
Caution
Single-agent loop
One LLM + tools
Default. Covers ~80% of real-world cases.
None — start here, always.
Planner + executor
One agent plans, another executes
Long tasks where planning ≠ tool calling cleanly.
Adds latency & failure surface.
Sub-agent delegation
Parent spawns specialised sub-agents
Independent subtasks that can run in parallel.
Don't fan out for the sake of it — costs multiply.
Swarm / multi-agent debate
Several agents argue / vote
Open-ended creative or research tasks.
Often less reliable than a single careful agent + verifier.
Human-in-the-loop
Agent pauses for approval at risky steps
Anything irreversible (payments, deletes, sends).
Design the UX of waiting — silence is a failure.
Heuristic: if a deterministic workflow would do the job, use one. Agents are appropriate when the path itself is unknown until you start. Most "agent" projects that fail were workflow problems in disguise.
The Model Context Protocol (MCP)
MCP is the emerging open protocol for exposing tools, resources, and prompts to any LLM client. Treat it the way you treat HTTP: a contract between agent runtimes and the systems they act on. Wrap your internal APIs in an MCP server once, and they become available to every model and IDE that speaks MCP.
9 · Evals: the new unit test
Evals are the deliverable. Without a pinned suite, you cannot tell whether yesterday's prompt tweak helped, regressed a corner case, or just moved noise around.
Eval set
A curated, version-controlled collection of inputs + expected behaviours. Grows from real production traces, bug reports, and red-team examples.
Scorer
How "good" is measured: exact match, regex, embedding similarity, code-executed test, rubric-graded LLM-as-judge, or human label.
Harness
The runner that pairs eval set × scorer × system version, produces metrics, and compares against baseline. Runs in CI on every prompt or code change.
The eval pyramid
Figure 6. Stack cheap deterministic checks at the base; reserve human judgement for the top.
LLM-as-judge — the workhorse, with caveats
Use pairwise comparison (A vs B) — judges are more reliable comparing than scoring absolutely.
Always include a rubric the judge must follow, with negative examples.
Calibrate the judge against ~100 human-labelled examples. If judge ≠ human, fix the judge before trusting it.
Beware self-preference bias — judges tend to prefer outputs from the same model family.
Fallback model on provider failures — health-check both.
Retries with jitter, capped at sane budgets — never infinite.
Circuit breakers per tool / per provider.
Timeouts everywhere, including streaming response stalls.
Idempotency on mutating tool calls.
SLO suggestions
Time-to-first-token: p95 < 1.5s for chat.
End-to-end completion: p95 < 8s for chat, < 45s for agents.
Quality regression: alert on judge-score drop > 3 points week-over-week.
Cost per session: alert on > 2× rolling 7-day median.
12 · Safety, security & governance
LLM apps carry a different threat model than typical CRUD. When I review features, I walk the OWASP LLM Top-10 list explicitly—here are the rows that show up most in incident-style reviews:
Threat
What it looks like
Mitigation
Prompt injection
Untrusted content tells the model to ignore instructions, exfiltrate data, or call tools.
Treat all retrieved content as untrusted; separate roles; sanitize; limit tool permissions; least-privilege agents.
Data exfiltration
Model includes secrets from context in its output, or via a tool call to an attacker-controlled URL.
Output filters, URL allowlists, secret scanners, no secrets in prompts, sandboxed tool execution.
Insecure tool use
Agent calls a destructive tool on adversarial input.
Human approval for irreversible ops, idempotency, strict schema validation, RBAC at tool level.
Supply chain
Compromised model weights, embeddings, or third-party MCP server.
Pin versions, verify hashes, vet providers, isolate untrusted MCP servers.
PII leakage
Logs/traces contain user PII sent to third-party providers.
PII detection + redaction at the boundary; data-processing agreements; regional routing.
Jailbreaks
Users coax the model past its safety policies.
Layered defenses: system prompt, classifier guardrails, content filters, abuse-rate-limits.
Hallucination as risk
Confident wrong answers in regulated domains (medical, legal, financial).
Citations, abstention, "I don't know" prompting, human-in-the-loop for high-stakes decisions.
Rule: the agent's effective permissions are the union of every tool it can call. Design tool surfaces as if they will be invoked with adversarial inputs — because they will be.
Governance checklist (enterprise)
Data-flow diagram per feature: what data leaves the trust boundary, to which provider, under what DPA.
Model card / system card kept in version control: model, version, intended use, known limits, eval scores.
Audit log for any agent action affecting users, money, or records (immutable, queryable).
Region / residency routing for regulated tenants (EU-only, on-prem option, etc.).
13 · Fine-tuning, distillation & post-training
Most product problems are solved by better prompts, better retrieval, or better evals — not by training. But there is a real role for post-training when:
You have a narrow, high-volume task where latency / cost dominate (intent classification, structured extraction, routing).
You need to teach a stable style or format that's hard to convey via examples.
You're distilling a frontier model's behaviour onto a smaller open-weights model for offline / on-prem deployment.
Spectrum of post-training
Technique
What it changes
Effort
Right when
Prompting + few-shot
Inputs only
Minutes
Always try first.
RAG
Inputs (with knowledge)
Days
Knowledge gap, not skill gap.
SFT (LoRA/QLoRA)
A small adapter on a base model
Days–weeks
Style/format/narrow skill; you have ≥ a few thousand high-quality examples.
Preference tuning (DPO / RFT)
Model preferences via pairwise data
Weeks
You can grade outputs better than you can write them.
Figure 8. A production copilot — a thin client, an orchestrator, three "shoulder" services (retrieval, tools, models), and a cross-cutting LLMOps plane.
14.2 Pattern catalogue
Single-shot LLM call
One prompt, one response, structured output. Use for extraction, classification, summarisation. The 80% solution to most "AI features".
RAG + cite
Retrieve → generate → cite. Default for any Q&A over private knowledge.
Tool-using agent
Loop with bounded steps and a verifier tool. Use when the path of action is not known in advance.
Router + specialist
A small classifier model routes to one of N specialist prompts/models. Cheap, fast, and easy to govern.
Workflow + LLM steps
A deterministic DAG with LLM nodes for specific subtasks. The right answer when "agent" feels like overkill.
Async / batch
Background jobs (summarise nightly, classify backlogs). Use batch APIs for ~50% cost; observability through job IDs.
15 · Day-to-day workflow of an effective AI engineer
Define the task with a sample of 20 real examples. Not synthetic — real. If you can't find 20, you don't have a product yet, you have a hypothesis.
Label the desired output for each. This is your initial eval set.
Build the simplest baseline that could work. One prompt, one model, no retrieval. Measure.
Add one capability at a time: retrieval → tools → verification → routing. Re-run evals after every change.
Instrument from day one. Traces, token counts, latency, judge scores. You will need them next week.
Ship behind a feature flag, canary at 1% → 10% → 50% → 100%, with quality + cost guardrail metrics.
Mine production traces weekly. Every failure becomes an eval example. The eval set is alive.
Re-evaluate model choice quarterly. Models get cheaper and better — your routing and prompt should follow.
16 · 12-month learning roadmap for software engineers
Figure 9. A pragmatic 12-month curriculum — each phase ends with a shipped artifact, not a certificate.
Per-phase recommended exercises
Phase 1 — Foundations
Write a tokenizer-aware "trim context" function for your stack.
Build a multi-turn chat with streaming, cancellation, and undo.
Force a model into strict JSON for a real internal task. Measure schema-violation rate.
Phase 2 — RAG + Evals
Index a real corpus you care about; compare chunking strategies on a 50-question eval set.
Add a re-ranker; show recall@k improvement.
Build CI that fails a PR if eval-mean drops >2 points.
Phase 3 — Agents + Tools
Take an existing internal workflow with ≤5 steps and replace it with an agent. Compare.
Wrap one internal API as an MCP server.
Add a verifier sub-agent; measure task-success uplift.
Phase 4 — Scale + Specialise
Route 30% of traffic to a smaller model via a learned router; track quality and cost delta.
Run a quarterly red-team exercise with documented findings.
Distil a frontier prompt into a small fine-tuned model for one high-volume task.
17 · Anti-patterns & failure modes
Anti-pattern
Why it happens
Fix
Demoware in prod
Demo prompt promoted without evals or guardrails.
No ship without an eval set + SLOs + canary.
"Big context" everything
"Models have 1M context, just dump it all."
Retrieve, rank, trim. Token budget is a discipline.
Single-string mega-prompt
Years of patches accumulate in one file.
Compose modular prompts; version them; cache the stable parts.
No-evals fast iteration
"It seemed better." Vibes-driven engineering.
Every change runs the eval suite. The number is the decision.
Vendor-locked code
SDK-specific abstractions leak everywhere.
Thin internal interface for model calls; swap providers in a day.
Trusting LLM-as-judge blindly
It looks like rigour; it isn't, unless calibrated.
Calibrate to humans on 100+ examples; use pairwise; rotate judges.
Agents where workflows suffice
Hype. Loops feel powerful.
Start with a workflow; promote to an agent only when the path is data-dependent.
Unbounded tool permissions
"It's just a prototype."
RBAC at the tool layer; human approval on irreversible ops; audit logs.
Ignoring tail latency
p50 looks fine; p99 is 40 seconds.
Stream, cache, parallelize, timeout, and alert on p95/p99.
No feedback loop
Users complain in Slack, not in the product.
In-product 👍/👎, retry-as-signal, weekly trace review with PM.
18 · 2026 trends & what's coming next
Long-horizon, durable agents
Agents that run for hours / days, with explicit memory, checkpoints, and resumability — closer to Temporal / workflow engines than chatbots.
Computer / browser use
Models that operate UIs directly. Powerful for legacy systems with no APIs — but the highest-blast-radius surface to govern.
Open standards win
MCP for tools/context and OpenAPI for HTTP are becoming the lingua franca. Build to the protocol, not the SDK.
Reasoning models for everything hard
Math, code, planning, multi-step extraction. The cost gap closes; the quality gap on hard tasks does not.
Per-tenant fine-tuning
LoRA adapters per customer for style and domain. Cheap, fast, and a real moat for vertical SaaS.
Eval-as-a-product
Evals are sold, traded, and audited like test suites. Expect SOC2-style attestations of eval coverage in enterprise procurement.
On-device + edge inference
Small capable models (1–8B) on phones and laptops handle private/latency-critical paths; cloud reserved for hard cases.
Regulation hardens
EU AI Act enforcement, US sector-specific rules. The AI engineer's responsibility extends to documentation, audit, and lifecycle controls.
19 · Closing
How I tie a bow on reviews. Treat the model as a probabilistic dependency—same as any flaky upstream—and engineer the rest: context in, structured I/O, evals that regress bad behavior, and traces that show you when quality drifts. Nail that loop and you can swap model versions without scrambling.
8
stack layers to know
1
eval set: your single biggest asset
∞
production traces to learn from
20 · References & sources
The diagrams and opinions in this note are mine; the factual and technical anchors below are where I send people who want primary material on transformers, RAG, agents, evals, security, and production practice. Entries are grouped so you can jump to what a section leaned on—papers for mechanisms, standards for shared vocabulary, regulators for compliance context.
Scope note: arXiv and conference versions suffice for citations; always check the published venue if you need a DOI for academic writing. Vendor APIs and model names change frequently—treat provider documentation as the source of truth for quotas, pricing, and safety policies.
Foundation models, transformers & prompting
Vaswani et al., “Attention Is All You Need.” NeurIPS 2017. Transformer architecture underlying essentially all frontier LLMs discussed in §4–§6. https://arxiv.org/abs/1706.03762
Brown et al., “Language Models are Few-Shot Learners” (GPT-3). NeurIPS 2020. Grounds the “capability is rented / in-context learning” framing in §2. https://arxiv.org/abs/2005.14165
OpenAI, “GPT-4 Technical Report.” 2023. Baseline for capability envelopes, mitigations, and evaluation limits—always cross-check with your own evals (§9). https://arxiv.org/abs/2303.08774
Touvron et al., “LLaMA: Open and Efficient Foundation Language Models.” Meta, 2023. Open-weights and on-prem options in §4, §13. https://arxiv.org/abs/2302.13971
Wei et al., “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.” NeurIPS 2022. Reasoning-style prompting in §6. https://arxiv.org/abs/2201.11903
Kojima et al., “Large Language Models are Zero-Shot Reasoners.” NeurIPS 2022. “Let’s think step by step” line of work in §6. https://arxiv.org/abs/2205.11916
Wang et al., “Self-Consistency Improves Chain of Thought Reasoning in Language Models.” ICLR 2023. Sampling / voting ideas in §6. https://arxiv.org/abs/2203.11171
Ouyang et al., “Training language models to follow instructions with human feedback” (InstructGPT). 2022. Instruction tuning, refusal behavior, and alignment-shaped APIs—§6–§8. https://arxiv.org/abs/2203.02155
Retrieval-augmented generation & information retrieval
Lewis et al., “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.” NeurIPS 2020. Canonical RAG formulation in §7. https://arxiv.org/abs/2005.11401
Robertson & Zaragoza, “The Probabilistic Relevance Framework: BM25 and Beyond.” Foundations and Trends in Information Retrieval, 2009. Lexical / BM25 retrieval in hybrid search (§7). https://dl.acm.org/doi/10.1561/1500000019
Karpukhin et al., “Dense Passage Retrieval for Open-Domain Question Answering.” EMNLP 2020. Dense retrieval adjacent to embedding-based RAG (§7). https://arxiv.org/abs/2004.04906
Gao et al., “Retrieval-Augmented Generation: A Survey.” 2024. One-paper overview of chunking, indexing, and query reformulation beyond this note. https://arxiv.org/abs/2312.10997
Edge et al., “From Local to Global: A Graph RAG Approach to Query-Focused Summarization.” Microsoft Research, 2024. Background for GraphRAG in §7. https://arxiv.org/abs/2404.16130
Tool use, agents & planning loops
Schick et al., “Toolformer: Language Models Can Teach Themselves to Use Tools.” 2023. Systematic LLM + tools framing related to §8. https://arxiv.org/abs/2302.04761
Yao et al., “ReAct: Synergizing Reasoning and Acting in Language Models.” ICLR 2023. Plan / act / observe loop behind the agent diagram in §8. https://arxiv.org/abs/2210.03629
Shinn et al., “Reflexion: Language Agents with Verbal Reinforcement Learning.” 2023. Reflection / verification patterns in §8–§9. https://arxiv.org/abs/2303.11366
Model Context Protocol (MCP). Anthropic-initiated open specification for tools, resources, and prompts across clients—§4, §8. https://modelcontextprotocol.io/Schemas and revisions are published on the official site and GitHub org.
Evaluation, benchmarking & LLM-as-judge
Zheng et al., “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.” NeurIPS 2023. Pairwise judging methodology relevant to §9. https://arxiv.org/abs/2306.05685
Dubois et al., “AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback.” 2023. Cheap evaluation and simulator bias—counterweight when building harnesses (§9). https://arxiv.org/abs/2305.14387
Es et al., “Ragas: Automated Evaluation of Retrieval Augmented Generation.” 2023. Metrics behind the RAGAS tooling named in §4. https://arxiv.org/abs/2309.15217
Lin, “ROUGE: A Package for Automatic Evaluation of Summaries.” ACL Workshop 2004. Classical overlap metrics still used in summarization evals (§9). https://aclanthology.org/W04-1013/
Rafailov et al., “Direct Preference Optimization: Your Language Model is Secretly a Reward Model.” NeurIPS 2023. DPO in §13. https://arxiv.org/abs/2305.18290
Hu et al., “LoRA: Low-Rank Adaptation of Large Language Models.” ICLR 2022. Adapter fine-tuning in §13. https://arxiv.org/abs/2106.09685
Hinton, Vinyals & Dean, “Distilling the Knowledge in a Neural Network.” NIPS Deep Learning Workshop 2015. Distillation vocabulary in §13. https://arxiv.org/abs/1503.02531
Greshake et al., “Not What You’ve Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection.” 2023. Prompt injection via retrieved or external content—pairs with §12. https://arxiv.org/abs/2302.12173
Beyer et al., Site Reliability Engineering: How Google Runs Production Systems. O’Reilly, 2016 (free online). SLO / error-budget thinking in §5, §10–§11. https://sre.google/sre-book/table-of-contents/
Agrawal et al., “Sarathi-Serve: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills.” OSDI 2024. Throughput / scheduling if you self-host (§4, §11). https://arxiv.org/abs/2308.16369
Pope et al., “Efficiently Scaling Transformer Inference.” MLSys 2023. Multi-device serving background. https://arxiv.org/abs/2211.05102
Attribution. This HTML is authored by Linh Truong. The list above does not imply endorsement by any author, institution, or regulator. When redistributing the file, link the canonical note at LinhTruong.com.