AI platform engineering · field notes · May 2026

The AI Platform Engineer

How I explain the role, draw the platform for execs and ICs, and sequence the work—gateway, retrieval, serving, evals, and governance—when teams are tired of one-off LLM demos.

Linh Truong, MA (Harvard), MBA · LinhTruong.com · Linh@Alumni.Harvard.edu

Source / canonical copy: LinhTruong.com. If you forward this HTML, link there so the attribution stays put.

Audience: Platform, MLOps, and infra engineers shipping AI systems

Revised: May 2026

Scope: Foundation models, LLM ops, GenAI infra, agentic platforms

The Role & Why It Exists
Mental Model of the Discipline
Reference Architecture
The Modern AI Platform Stack
Model & Application Lifecycle
Inference & Serving Strategy
RAG & Agentic Systems
Evaluation & Observability
Security, Governance & Safety
Cost & Performance Optimization
Skill Matrix & Competencies
12-Month Success Strategy
Anti-Patterns & Pitfalls
Platform KPIs & Metrics
2026+ Outlook
Appendix — Cheat Sheet
References & sources

Executive Summary

The AI Platform Engineer is what you get when classical MLOps runs head-first into foundation models, agents, and retrieval. In practice, you’re building a self-service path so product teams can ship AI features without each squad re-solving serving, eval harnesses, guardrails, and observability from scratch.

Platform layers to own

Core feedback loops

3×

Typical cost reduction vs. ad-hoc

<P95 1.5s

Target streaming TTFT

Below are the diagrams I reuse in reviews, the sequence I use to stand up a platform, and the checklists I expect on-call to actually run. I wrote this against a 2026 stack reality: strong hosted APIs, credible open weights, hybrid retrieval, real agent loops—and finance asking pointed questions about tokens and latency.

1. The Role & Why It Exists

You are hired to multiply a whole org, not to babysit one notebook. The job is the shared path—templates, gateways, registries—that every model, agent, and AI feature follows. Get that right and you stop rebuilding the same plumbing for every launch.

Mission

Make it safe, fast, and cheap for product teams to deploy and operate AI features — from prompts and RAG to fine-tunes and agents.

Outcomes

High developer velocity, predictable latency & cost, compliant data handling, and measurable model quality across the org.

Interfaces

Gateway APIs, SDKs, eval harnesses, dashboards, vector stores, feature stores, prompt registries, model registries.

How the AI Platform Engineer differs from neighbouring roles

Role	Primary Output	Time Horizon	Owns Infra?	Owns Models?
ML Engineer	A trained model for a use case	Sprint–Quarter	Partial	Yes
Data Scientist	Insight, prototype, experiment	Sprint	No	Sometimes
MLOps / LLMOps Engineer	Pipelines for a model or team	Sprint–Quarter	Yes	Operate only
AI Platform Engineer	Reusable platform & shared path	Quarter–Year	Yes (deeply)	Host, evaluate, govern
AI Application / Product Eng	End-user features built on platform	Sprint	No	Consume
Research / Applied Scientist	Novel methods, evaluations	Quarter–Year	No	Yes

2. Mental Model of the Discipline

I keep coming back to seven layers—each with its own SLAs, contracts, and owners. Platform work is as much about the contracts between those layers as it is about any single box.

Figure 1 — Seven-layer AI Platform mental model with cross-cutting concerns.

3. Reference Architecture

The reference sketch I draw most often: request in through a gateway, into a reasoning / orchestration layer that fans out to retrieval, tools, and one or more models, then back out through guardrails and telemetry. Training and the registry sit off the hot path and feed serving on a slower cadence.

Figure 2 — End-to-end reference architecture for a production AI platform.

4. The Modern AI Platform Stack (2026)

Tooling churns fast. What matters is the capability slot each tool fills. Pick one per slot, standardize, and write the migration plan before you commit.

Slot	Open-source / Self-host	Managed	Notes
Foundation models (chat)	Llama 3.x/4, Mistral, Qwen 2.5, DeepSeek-V3	Anthropic Claude 4.7, OpenAI GPT, Google Gemini, AWS Bedrock	Default to hosted for frontier quality; self-host for cost/sovereignty.
Embeddings	BGE-M3, E5, Nomic, Jina	Voyage, OpenAI, Cohere, Vertex	Pick one default + one fallback. Standardize dimensionality.
Re-rankers	BGE-reranker, Cohere Rerank OSS	Cohere, Voyage	Cross-encoder rerank lifts top-k quality 10–30%.
Inference servers	vLLM, TGI, SGLang, TensorRT-LLM, Triton	Bedrock, Together, Fireworks, Anyscale	vLLM is the default for most teams; TRT-LLM for latency-critical.
Vector store	Qdrant, Weaviate, Milvus, pgvector	Pinecone, Turbopuffer, Vertex Vector Search	Use pgvector under 10M vectors; dedicated above.
Orchestration / Agents	LangGraph, LlamaIndex Workflows, Haystack, DSPy	Anthropic / OpenAI Agents APIs, Vertex Agent Builder	Prefer typed, deterministic graphs over freeform agents.
Eval & tracing	Langfuse, Arize Phoenix, Ragas, DeepEval, Promptfoo	Braintrust, Helicone, LangSmith, Datadog LLM Obs	Treat eval as CI — block deploys on regressions.
Guardrails	NeMo Guardrails, Guardrails AI, LlamaGuard, ShieldGemma	Anthropic safety classifiers, Azure Content Safety	Layer input + output checks; never trust one model alone.
Fine-tuning	HF TRL, Axolotl, Unsloth, Llama Factory, LMFlow	Bedrock FT, OpenAI FT, Vertex Tuning	Start with LoRA + DPO. Full SFT only with clear ROI.
Experiment tracking	MLflow, W&B (OSS), Aim	Weights & Biases, Comet	Required for any training workload.
Feature / prompt store	Feast, PromptLayer, Latitude	Tecton, Hopsworks	Versioned prompts behave like code — review & CI them.
Compute orchestration	Kubernetes, KubeRay, Slurm, Volcano, Karpenter	SageMaker, Vertex, Azure ML, RunPod, CoreWeave	Standardize on K8s + Ray for portability.
Data lakehouse	Delta Lake, Apache Iceberg, Hudi	Databricks, Snowflake, BigQuery	Iceberg + Parquet is the open default in 2026.
Observability	OpenTelemetry, Prometheus, Grafana, Loki, Tempo	Datadog, Honeycomb, New Relic	Emit OTel everywhere. Token + latency are first-class metrics.
Tooling protocol	MCP (Model Context Protocol)	Native vendor function calling	MCP is becoming the cross-vendor standard for tools.

Heuristic: Standardize at most two options per slot — one default and one fallback. Three options is a sprawl tax that the platform team will pay for years.

5. Model & Application Lifecycle

AI features have two clocks: a fast loop (prompts, retrieval tweaks, configs) measured in hours; and a slow loop (fine-tunes, base-model swaps) measured in weeks. Your platform must serve both without making either painful.

Figure 3 — Two-clock lifecycle: a fast iteration loop on prompts/RAG and a slow loop on models/data.

Lifecycle artifacts the platform owns

Prompts — versioned, reviewed, eval-gated, deployed via the same pipeline as code.
Datasets — golden sets, adversarial sets, eval sets, training/validation/test splits — all with lineage.
Models — base, fine-tuned, distilled, quantized — each a registry artifact with a model card.
Retrieval indexes — embedded with a specific embedding model + version; immutable, rebuilt on swap.
Evals — code under version control; the platform runs them on PRs and on schedule.

6. Inference & Serving Strategy

Latency budget

For a streaming chat UX, target TTFT < 1.5 s P95 and inter-token < 80 ms. For agentic tool-using flows, budget end-to-end P95 < 8 s per step.

Cache: prompt prefix cache (KV) and semantic response cache.
Decode: speculative decoding can cut latency 1.5–3×.
Batching: continuous (in-flight) batching is non-negotiable.

Throughput

Token throughput is the unit of capacity. Plan in tokens/sec/GPU, not requests.

KV cache memory is usually the bottleneck before FLOPs.
Quantize to FP8 / INT4 (AWQ, GPTQ) for 2–4× speed when quality allows.
Use paged attention & chunked prefill for long contexts.

Routing

A model router in the gateway picks the cheapest model that meets quality + latency:

Tier 1 — frontier (Claude 4.7, GPT) for hard reasoning.
Tier 2 — mid (Haiku, Sonnet small, Gemini Flash) for default traffic.
Tier 3 — local (Llama, Qwen) for cheap, cacheable, or PII-sensitive paths.

Failure modes

Upstream provider 5xx / throttling — automatic cross-vendor failover.
Long-context OOM — preflight estimate + truncate strategy.
Hot prompts — semantic cache + dedup on identical concurrent requests.
Tool calls that hang — per-step timeouts and partial-result returns.

Figure 4 — Inference path with caching, routing, guardrails, and multi-target serving.

7. RAG & Agentic Systems

The honest RAG pipeline

RAG quality is bottlenecked by chunking, retrieval recall, and re-ranking — in that order. Most "RAG isn't working" investigations end in those three.

Stage	What it does	Levers	Common failure
Parse	PDF, HTML, slides → text + structure	Layout-aware parsers, OCR fallback	Tables flattened, losing rows
Chunk	Split into retrievable units	Semantic chunking, parent-doc, sliding window	Chunks too small / too big
Embed	Vector representation	Domain-tuned embeddings, dim choice	Mismatched embedding model on query vs. doc
Index	Vector + BM25	Hybrid search, metadata filters, ACLs	Permission leakage, missing filters
Retrieve	Top-k candidates	k, MMR, multi-query, HyDE	k too small; over-confidence on single retrieval
Re-rank	Cross-encoder reorder	BGE / Cohere rerank, listwise rerank	Skipping rerank — biggest quality miss
Compose	Prompt assembly	Citations, structured templates	Context bloat, lost-in-the-middle
Generate	Final answer	Grounding instructions, "refuse if unknown"	Hallucination, citation drift

Agent topology

In 2026 the dominant pattern is typed workflows with bounded autonomy: an LLM drives a directed graph of tools, with planning + reflection at well-defined nodes. Pure freeform "ReAct" loops are reserved for exploratory tasks; production prefers LangGraph-style deterministic graphs with checkpointing.

Figure 5 — Typed agent graph: classify → plan → retrieve → tools → reflect → aggregate → respond.

Watch out: Token cost and latency scale with steps × context, not with model price alone. A "cheap" model in a 12-step loop is often more expensive than one frontier call.

8. Evaluation & Observability

If you remember one rule: no eval, no deploy. Evals are the unit tests of probabilistic systems. The platform should make them as easy to write as a Python test and as automatic to run as CI.

Offline Evals

Run on fixed datasets. Used in CI to block regressions.

Golden Q&A sets
Adversarial / red-team sets
Tool-use trajectories
RAG faithfulness (Ragas)

Online Evals

Run on live traffic, sampled. Detect drift and emerging failure modes.

LLM-as-judge on sampled responses
Heuristic detectors (refusal, length, format)
User feedback (thumbs, edits)

Telemetry

The substrate every other eval reads from.

Per-call: prompt, tools, tokens, cost, latency
Per-session: trace tree (OTel)
Per-user: redacted, with consent

Metric taxonomy

Category	Metric	Why it matters
Quality	Task success / pass@1	Did it do the job?
	Faithfulness / groundedness	Hallucination signal for RAG
	Citation precision / recall	Trust signal for source-backed answers
Performance	TTFT, ITL, E2E P50/P95/P99	User-perceived latency
	Tokens / sec / GPU	Capacity planning
	Tool-call success rate	Agentic reliability
Cost	$ / request, $ / resolved task	Unit economics
	Cache hit rate	Spend efficiency
	Tokens in / out per feature	Drives the bill
Safety	Refusal rate (correct & incorrect)	Over/under-refusal
	Jailbreak success rate (red-team)	Attack surface
	PII leak rate	Compliance

9. Security, Governance & Safety

Top Threats (LLM Top 10)

Prompt injection (direct + indirect)
Sensitive data leakage via prompts & logs
Insecure tool / plugin design (overscoped tokens)
Training data poisoning
Model DoS via long-context / recursive agent loops
Supply chain (weights, deps, MCP servers)
Excessive agency (unbounded tool actions)
Overreliance — humans trusting wrong answers
Model theft / weight exfiltration
Output handling (XSS, SQLi from generated content)

Controls

Gateway-enforced auth, quotas, content policies
Defense-in-depth guardrails: input + tool + output
Least-privilege tools — scoped tokens, dry-run modes
Sandboxed code execution for agentic actions
Redaction before logging; encrypted at rest
Data residency & regional routing
Model cards + evaluation scorecards for every release
Audit log of every prompt, tool call, and decision
Kill switch per model, per feature, per tenant

Indirect prompt injection (malicious instructions embedded in retrieved documents or web pages) is the dominant attack vector for agentic systems. Treat every retrieved string as untrusted user input, even from "internal" sources.

Compliance map (typical enterprise)

Regime	Implication for the platform
SOC 2 / ISO 27001	Access logging, change management, vendor risk on model providers.
GDPR / CCPA	DSR support, deletion across vector indexes, lawful basis for training data.
HIPAA	BAA-backed providers only; PHI tokens never leave the boundary.
EU AI Act	Risk classification per use case; transparency & conformity for high-risk deployments.
Sector (PCI, FedRAMP, etc.)	Air-gapped or self-hosted inference; key management; audit trails.

10. Cost & Performance Optimization

Cost is a platform feature, not an afterthought. Bake the controls in once; reuse them everywhere.

Figure 6 — Illustrative cost reduction waterfall stacking standard platform levers (representative, not benchmarked).

Cost levers I default to

Measure first. Token meters by feature, team, and tenant — at the gateway.
Right-size. Most "needs frontier" requests are actually solvable by a mid-tier model with better prompts.
Cache aggressively. Prompt-prefix cache (vendor or vLLM), semantic response cache, embedding cache.
Compress. Summarize past turns; strip system prompts to essentials; use structured tool I/O instead of prose.
Batch and stream. Continuous batching + streaming improves both UX and $/token.
Quantize where safe. Re-run quality evals after every quantization swap.
Spot & reserved. Train on spot; serve on reserved + on-demand burst.
Spend caps & alerts. Per-tenant, per-feature; auto-degrade rather than fail.

11. Skill Matrix & Competencies

Below is the T-shape of an AI Platform Engineer: a broad horizontal across the stack, with depth in 2–3 vertical columns that match your team's bottleneck.

Domain	Foundational	Practitioner	Expert
Software engineering	Python, Git, basic SQL, REST	Typed Python or Go, async, testing, design patterns, gRPC	Distributed systems, API design, library authoring
Cloud & infra	Linux, Docker, one cloud	Kubernetes, Terraform, IAM, networking, multi-region	GPU clusters, networking for ML, multi-cloud, capacity planning
ML fundamentals	Train a basic model, metrics	Transformers, attention, tokenization, training loops, eval design	Scaling laws, distillation, RLHF/DPO, custom kernels
LLM systems	Use an API, write prompts	RAG, tool use, agents, structured output, function calling	Inference engine tuning, custom routers, multi-agent design
Data engineering	Parquet, basic ETL	Lakehouse, streaming, vector + hybrid search	Petabyte data infra, lineage, real-time features
Observability	Logs, metrics	OTel, traces, SLOs, anomaly detection	End-to-end AI-specific observability platforms
Security & governance	Auth basics, secrets	OWASP LLM top 10, IAM, key mgmt, compliance basics	Threat modeling for agents, red-team, regulatory programs
Product & UX	Understand user journeys	Translate UX needs to latency / quality budgets	Drive product strategy with AI capability roadmaps

Tooling fluency checklist

Build & ship

PythonGoTypeScript DockerKubernetesHelm TerraformPulumi GitHub ActionsArgo

AI/ML

PyTorchHF Transformers vLLMTGITensorRT-LLM LangGraphLlamaIndexDSPy RayAxolotlUnsloth MCP

Data & ops

Postgres + pgvectorQdrantWeaviate IcebergDelta MLflowW&B LangfusePhoenixBraintrust OpenTelemetryPrometheus

12. 12-Month Success Strategy

A practical sequencing that works whether you are starting a new platform team, or joining one mid-flight. Each quarter ends with a visible win that demonstrates value beyond infrastructure.

Figure 7 — A 12-month sequence that compounds: listen, standardize, scale, compound.

Day-0 → Day-90 checklist (for joining a team)

Days 1–30 — Learn

Inventory every model, prompt, RAG index, agent in production.
Read 6 months of incidents & postmortems.
Interview 5 product engineers about friction.
Reproduce the top use case end-to-end locally.
Map current cost by feature / team / model.

Days 31–60 — Stabilize

Centralize all model traffic through one gateway.
Wire OpenTelemetry into every call path.
Adopt one eval framework org-wide.
Define SLOs for TTFT, E2E, and error rate.
Fix the worst hallucination / safety bug.

Days 61–90 — Standardize

Publish the paved-road tutorial & SDK.
Migrate one flagship product to the platform.
Stand up the prompt + model registries.
Run a tabletop red-team on agents.
Quarterly cost & quality report for leadership.

Beyond — Compound

Self-host for the top 1–2 cost lines.
Fine-tune where it changes unit economics.
Introduce automated routing across model tiers.
Bring agent workflows under typed graph contracts.
Run continuous online evals + drift alerts.

13. Anti-Patterns & Pitfalls

Architecture smells

No gateway. Every team holds its own API keys; cost is invisible.
Three vector DBs. Pick one. Re-evaluate yearly, not weekly.
Prompts in product code. No versioning, no eval history.
Agents in freeform ReAct. Untyped, untestable, expensive.
Re-running embedding only on "new" docs after changing the embedding model — silent recall collapse.

Process smells

Eyeballing outputs in a Slack thread instead of an eval set.
Shipping to 100% behind a feature flag with no shadow phase.
No model card, no risk classification.
"It works on the demo" — never tested under concurrency.
Vendor lock-in by hard-coding one SDK throughout the codebase.

Cost smells

Frontier model on every call, including classification.
Massive system prompts duplicated per request (no prefix cache).
Re-embedding the entire corpus daily.
Long-running agents with no per-step or total token budget.
Logging full prompts & responses unredacted in hot paths.

Safety smells

Trusting retrieved content as "safe" — indirect prompt injection.
Tools with broad scopes (admin, write-DB) handed to a planner LLM.
No human-in-the-loop on irreversible actions.
One vendor for both generation and judging in evals.
No kill switch per feature/tenant.

14. Platform KPIs & Metrics

Five chart buckets I want live by day 30, and that I can explain to a non-technical exec by day 90.

Velocity

Time-to-first-prototype
Time-to-production
# teams on the shared platform path

Quality

Eval pass rate per feature
Hallucination rate (sampled)
User satisfaction (thumbs / CSAT)

Reliability

Error budget burn
TTFT / E2E P95
Incidents per quarter

Cost

$ per resolved task
Cache hit rate
Spend per feature / tenant

Safety

Red-team success rate (lower is better)
PII leak rate
Refusal correctness

DX

NPS from product engineers
Docs / SDK satisfaction
# of "rebuild from scratch" requests (lower is better)

15. 2026+ Outlook

What's stabilizing

MCP as the cross-vendor tool protocol.
Typed agent graphs over freeform agents.
Hybrid retrieval + cross-encoder reranking as the RAG default.
Eval-in-CI as a non-negotiable for production.
Token / cost as first-class observability dimensions.

What's still volatile

Long-running, persistent agents — memory + identity remain unsolved.
Multi-modal RAG (video, audio, 3D) — tooling is early.
On-device + edge inference of capable models.
Verifiable safety claims for autonomous agents.
The regulatory surface (EU AI Act, sectoral rules).

My bias: the teams that win aren't chasing the cleverest one-off diagram—they're the ones where the default path is so boringly good that a product engineer can ship something safe, measured, and observable in days without hand-holding from infra.

Appendix — One-page Cheat Sheet

Daily diagnostic questions

What is our top spend feature, and what fraction of its tokens hit a cache?
What is the P95 TTFT, and where is the long tail?
Which feature regressed quality this week? Did CI catch it?
What's the refusal rate, and are refusals correct?
Any tool / MCP server added this week — was it threat-modeled?

Default architectural choices (start here)

Gateway: one place, one auth, one cost meter.
Models: 2 hosted tiers + 1 self-hosted OSS.
Retrieval: pgvector → Qdrant when >10M vectors. Always rerank.
Orchestration: typed graphs (LangGraph / native SDK agents).
Evals: Promptfoo or Braintrust in CI; Langfuse / Phoenix in prod.
Observability: OTel everything; tokens as a first-class metric.
Guardrails: input + tool + output, with at least one non-LLM check.
Registry: prompts, models, datasets — all versioned, all reviewable.

16 · References & sources

The body of this note is opinionated engineering text; the list below is for verifiable sources on production ML platforms—technical debt, serving, lifecycle tooling, retrieval, agents, and the compliance language execs will use. I use it as a reading list, not as a substitute for your vendor’s contract or your counsel.

Note: Use arXiv / in-press versions for speed; verify the published venue when you need a DOI. Vendor runbooks (Bedrock, Vertex, OpenAI, Anthropic, inference runtimes) change often—treat them as the authority for limits, regions, and pricing.

ML systems, platform practice & lifecycle

Sculley et al., “Hidden Technical Debt in Machine Learning Systems.” NeurIPS / NIPS 2015. Foundational essay on why “the model is the easy part”—informs §1–§3 platform boundaries and §13 anti-patterns. NeurIPS proceedings
Breck et al., “What’s your ML Test Score? A rubric for ML production readiness.” Google, 2017 (expanded rubric). Checklist DNA behind production gates in §5–§8, §14. Google Research
Amershi et al., “Software Engineering for Machine Learning: A Case Study.” ICSE 2019. How mature teams operationalize ML—maps to the shared platform path and role split in §1, §11. https://arxiv.org/abs/1811.04900
Polyzotis et al., “Data Management Challenges in Production Machine Learning.” SIGMOD 2017. Data/lineage/registries—underpins §5–§7 and the foundation layer in Figure 1. ACM Digital Library
Zaharia et al., “Accelerating the Machine Learning Lifecycle with MLflow.” IEEE Data Eng. Bull. 2018. Experiment tracking, packaging, registry concepts in §5–§6. IEEE Xplore
Huyen, Designing Machine Learning Systems. O’Reilly, 2022. End-to-end lifecycle and deployment patterns echoed across §3–§11. ISBN 978-1098107969.
Kleppmann, Designing Data-Intensive Applications. O’Reilly, 2017. Storage, streams, and reliability vocabulary for retrieval, logging, and registries (§3–§7). ISBN 978-1449373320.

Kubernetes, scheduling & “cloud-native” serving context

Burns et al., “Borg, Omega, and Kubernetes.” ACM Queue / Communications of the ACM 2016. Conceptual background for running inference and control planes on orchestrators (§3, §6, §7 foundation). ACM Queue
Verma et al., “Large-scale cluster management at Google with Borg.” EuroSys 2015. Scheduling, quotas, SLO-aware placement—relevant to GPU/tenant fairness in §6, §10. ACM Digital Library

Inference, memory management & throughput

Kwon et al., “Efficient Memory Management for Large Language Model Serving with PagedAttention.” SOSP 2023 (vLLM). Core citation for KV-cache/paging discussion in §6. https://arxiv.org/abs/2309.06180
Yu et al., “Orca: A Distributed Serving System for Transformer-Based Generative Models.” OSDI 2022. Iteration-level scheduling and batching—adjacent to §6 latency/cost tradeoffs. USENIX OSDI 22
Pope et al., “Efficiently Scaling Transformer Inference.” MLSys 2023. Multi-device inference design space. https://arxiv.org/abs/2211.05102
Aminabadi et al., “DeepSpeed-Inference: enabling efficient inference of transformer models at unprecedented scale.” SC 2022. High-scale serving stack context. https://arxiv.org/abs/2207.00032

Foundation models, adaptation & training interfaces

Vaswani et al., “Attention Is All You Need.” NeurIPS 2017. Transformer substrate for everything in §5–§6. https://arxiv.org/abs/1706.03762
Touvron et al., “LLaMA: Open and Efficient Foundation Language Models.” 2023. Open-weights / self-hosting storyline in exec summary, §5–§6. https://arxiv.org/abs/2302.13971
Hu et al., “LoRA: Low-Rank Adaptation of Large Language Models.” ICLR 2022. Adapter fine-tuning in Figure 1 layer 6 and §5–§10. https://arxiv.org/abs/2106.09685
Rafailov et al., “Direct Preference Optimization (DPO).” NeurIPS 2023. Preference optimization named in §5–§6. https://arxiv.org/abs/2305.18290

RAG, retrieval & hybrid search

Lewis et al., “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.” NeurIPS 2020. Canonical RAG—§7. https://arxiv.org/abs/2005.11401
Robertson & Zaragoza, “The Probabilistic Relevance Framework: BM25 and Beyond.” FnTIR 2009. Lexical side of hybrid retrieval in §7. https://dl.acm.org/doi/10.1561/1500000019
Karpukhin et al., “Dense Passage Retrieval for Open-Domain Question Answering.” EMNLP 2020. Dense retrieval practice. https://arxiv.org/abs/2004.04906
Gao et al., “Retrieval-Augmented Generation: A Survey.” 2024. Single entrypoint for chunking/index/query research. https://arxiv.org/abs/2312.10997

Agents, tools & orchestration protocols

Yao et al., “ReAct: Synergizing Reasoning and Acting in Language Models.” ICLR 2023. Agent loop mental model in §2–§3, §7. https://arxiv.org/abs/2210.03629
Schick et al., “Toolformer.” 2023. Tool-learning framing. https://arxiv.org/abs/2302.04761
Model Context Protocol (MCP). Cross-vendor tool/context standard referenced in §3, §7, §15. https://modelcontextprotocol.io/
OpenAPI Specification. HTTP tool contracts at the gateway boundary (§3). https://spec.openapis.org/oas/latest.html

Evaluation, observability & SLO discipline

Es et al., “Ragas: Automated Evaluation of Retrieval Augmented Generation.” 2023. RAG quality metrics named in the appendix cheat sheet / §8. https://arxiv.org/abs/2309.15217
Zheng et al., “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.” NeurIPS 2023. Pairwise judging—§8. https://arxiv.org/abs/2306.05685
Beyer et al. (eds.), Site Reliability Engineering. O’Reilly / Google, 2016. SLOs, error budgets, incident practice—§2 cross-cutting, §8, §14. https://sre.google/sre-book/table-of-contents/
OpenTelemetry. Vendor-neutral traces/metrics/logs—appendix “OTel everything” and §8. https://opentelemetry.io/
W3C Trace Context. Propagation standard for distributed traces through gateways and tool calls. https://www.w3.org/TR/trace-context/

Security, governance & compliance

OWASP Top 10 for Large Language Model Applications. Threat taxonomy for §9. OWASP project page
Greshake et al., “Indirect Prompt Injection” (real-world LLM-integrated apps). 2023. Pairs with gateway + retrieved-content trust models in §3, §9. https://arxiv.org/abs/2302.12173
NIST AI Risk Management Framework (AI RMF 1.0). Risk language for enterprise governance in §9. https://www.nist.gov/itl/ai-risk-management-framework
EU Artificial Intelligence Act (Regulation (EU) 2024/1689). Legal primary source for §15 regulatory bullet. EUR-Lex

Identity, multi-tenancy & API security (platform gateway)

Hardt et al., “The OAuth 2.0 Authorization Framework.” RFC 6749. Baseline for delegated access patterns at the AI gateway. RFC 6749
OpenID Connect Core 1.0. OIDC for AuthN in browser/API flows—referenced conceptually in §3 gateway. OpenID Foundation

Disclaimer. References are for education; they don’t list every vendor in the diagrams, and links aren’t endorsements. Canonical page for this file: LinhTruong.com — Linh Truong.

Contents

Executive Summary

1. The Role & Why It Exists

How the AI Platform Engineer differs from neighbouring roles

2. Mental Model of the Discipline

3. Reference Architecture

4. The Modern AI Platform Stack (2026)

5. Model & Application Lifecycle

Lifecycle artifacts the platform owns

6. Inference & Serving Strategy

Latency budget

Throughput

Routing

Failure modes

7. RAG & Agentic Systems

The honest RAG pipeline

Agent topology

8. Evaluation & Observability

Metric taxonomy

9. Security, Governance & Safety

Compliance map (typical enterprise)

10. Cost & Performance Optimization

Cost levers I default to

11. Skill Matrix & Competencies

Tooling fluency checklist

Build & ship

AI/ML

Data & ops

12. 12-Month Success Strategy

Day-0 → Day-90 checklist (for joining a team)

Days 1–30 — Learn

Days 31–60 — Stabilize

Days 61–90 — Standardize

Beyond — Compound

13. Anti-Patterns & Pitfalls

14. Platform KPIs & Metrics

Velocity

Quality

Reliability

Cost

Safety

DX

15. 2026+ Outlook

What's stabilizing

What's still volatile

Appendix — One-page Cheat Sheet

Daily diagnostic questions

Default architectural choices (start here)

16 · References & sources

ML systems, platform practice & lifecycle

Kubernetes, scheduling & “cloud-native” serving context

Inference, memory management & throughput

Foundation models, adaptation & training interfaces

RAG, retrieval & hybrid search

Agents, tools & orchestration protocols

Evaluation, observability & SLO discipline

Security, governance & compliance

Identity, multi-tenancy & API security (platform gateway)