How I explain the role, draw the platform for execs and ICs, and sequence the work—gateway, retrieval, serving, evals, and governance—when teams are tired of one-off LLM demos.
The AI Platform Engineer is what you get when classical MLOps runs head-first into foundation models, agents, and retrieval. In practice, you’re building a self-service path so product teams can ship AI features without each squad re-solving serving, eval harnesses, guardrails, and observability from scratch.
7
Platform layers to own
5
Core feedback loops
3×
Typical cost reduction vs. ad-hoc
<P95 1.5s
Target streaming TTFT
Below are the diagrams I reuse in reviews, the sequence I use to stand up a platform, and the checklists I expect on-call to actually run. I wrote this against a 2026 stack reality: strong hosted APIs, credible open weights, hybrid retrieval, real agent loops—and finance asking pointed questions about tokens and latency.
1. The Role & Why It Exists
You are hired to multiply a whole org, not to babysit one notebook. The job is the shared path—templates, gateways, registries—that every model, agent, and AI feature follows. Get that right and you stop rebuilding the same plumbing for every launch.
Mission
Make it safe, fast, and cheap for product teams to deploy and operate AI features — from prompts and RAG to fine-tunes and agents.
Outcomes
High developer velocity, predictable latency & cost, compliant data handling, and measurable model quality across the org.
How the AI Platform Engineer differs from neighbouring roles
Role
Primary Output
Time Horizon
Owns Infra?
Owns Models?
ML Engineer
A trained model for a use case
Sprint–Quarter
Partial
Yes
Data Scientist
Insight, prototype, experiment
Sprint
No
Sometimes
MLOps / LLMOps Engineer
Pipelines for a model or team
Sprint–Quarter
Yes
Operate only
AI Platform Engineer
Reusable platform & shared path
Quarter–Year
Yes (deeply)
Host, evaluate, govern
AI Application / Product Eng
End-user features built on platform
Sprint
No
Consume
Research / Applied Scientist
Novel methods, evaluations
Quarter–Year
No
Yes
2. Mental Model of the Discipline
I keep coming back to seven layers—each with its own SLAs, contracts, and owners. Platform work is as much about the contracts between those layers as it is about any single box.
Figure 1 — Seven-layer AI Platform mental model with cross-cutting concerns.
3. Reference Architecture
The reference sketch I draw most often: request in through a gateway, into a reasoning / orchestration layer that fans out to retrieval, tools, and one or more models, then back out through guardrails and telemetry. Training and the registry sit off the hot path and feed serving on a slower cadence.
Figure 2 — End-to-end reference architecture for a production AI platform.
4. The Modern AI Platform Stack (2026)
Tooling churns fast. What matters is the capability slot each tool fills. Pick one per slot, standardize, and write the migration plan before you commit.
Slot
Open-source / Self-host
Managed
Notes
Foundation models (chat)
Llama 3.x/4, Mistral, Qwen 2.5, DeepSeek-V3
Anthropic Claude 4.7, OpenAI GPT, Google Gemini, AWS Bedrock
Default to hosted for frontier quality; self-host for cost/sovereignty.
Embeddings
BGE-M3, E5, Nomic, Jina
Voyage, OpenAI, Cohere, Vertex
Pick one default + one fallback. Standardize dimensionality.
Re-rankers
BGE-reranker, Cohere Rerank OSS
Cohere, Voyage
Cross-encoder rerank lifts top-k quality 10–30%.
Inference servers
vLLM, TGI, SGLang, TensorRT-LLM, Triton
Bedrock, Together, Fireworks, Anyscale
vLLM is the default for most teams; TRT-LLM for latency-critical.
Layer input + output checks; never trust one model alone.
Fine-tuning
HF TRL, Axolotl, Unsloth, Llama Factory, LMFlow
Bedrock FT, OpenAI FT, Vertex Tuning
Start with LoRA + DPO. Full SFT only with clear ROI.
Experiment tracking
MLflow, W&B (OSS), Aim
Weights & Biases, Comet
Required for any training workload.
Feature / prompt store
Feast, PromptLayer, Latitude
Tecton, Hopsworks
Versioned prompts behave like code — review & CI them.
Compute orchestration
Kubernetes, KubeRay, Slurm, Volcano, Karpenter
SageMaker, Vertex, Azure ML, RunPod, CoreWeave
Standardize on K8s + Ray for portability.
Data lakehouse
Delta Lake, Apache Iceberg, Hudi
Databricks, Snowflake, BigQuery
Iceberg + Parquet is the open default in 2026.
Observability
OpenTelemetry, Prometheus, Grafana, Loki, Tempo
Datadog, Honeycomb, New Relic
Emit OTel everywhere. Token + latency are first-class metrics.
Tooling protocol
MCP (Model Context Protocol)
Native vendor function calling
MCP is becoming the cross-vendor standard for tools.
Heuristic: Standardize at most two options per slot — one default and one fallback. Three options is a sprawl tax that the platform team will pay for years.
5. Model & Application Lifecycle
AI features have two clocks: a fast loop (prompts, retrieval tweaks, configs) measured in hours; and a slow loop (fine-tunes, base-model swaps) measured in weeks. Your platform must serve both without making either painful.
Figure 3 — Two-clock lifecycle: a fast iteration loop on prompts/RAG and a slow loop on models/data.
Lifecycle artifacts the platform owns
Prompts — versioned, reviewed, eval-gated, deployed via the same pipeline as code.
Datasets — golden sets, adversarial sets, eval sets, training/validation/test splits — all with lineage.
Models — base, fine-tuned, distilled, quantized — each a registry artifact with a model card.
Retrieval indexes — embedded with a specific embedding model + version; immutable, rebuilt on swap.
Evals — code under version control; the platform runs them on PRs and on schedule.
6. Inference & Serving Strategy
Latency budget
For a streaming chat UX, target TTFT < 1.5 s P95 and inter-token < 80 ms. For agentic tool-using flows, budget end-to-end P95 < 8 s per step.
Cache: prompt prefix cache (KV) and semantic response cache.
Decode: speculative decoding can cut latency 1.5–3×.
Batching: continuous (in-flight) batching is non-negotiable.
Throughput
Token throughput is the unit of capacity. Plan in tokens/sec/GPU, not requests.
KV cache memory is usually the bottleneck before FLOPs.
Quantize to FP8 / INT4 (AWQ, GPTQ) for 2–4× speed when quality allows.
Use paged attention & chunked prefill for long contexts.
Routing
A model router in the gateway picks the cheapest model that meets quality + latency:
Tier 1 — frontier (Claude 4.7, GPT) for hard reasoning.
Hot prompts — semantic cache + dedup on identical concurrent requests.
Tool calls that hang — per-step timeouts and partial-result returns.
Figure 4 — Inference path with caching, routing, guardrails, and multi-target serving.
7. RAG & Agentic Systems
The honest RAG pipeline
RAG quality is bottlenecked by chunking, retrieval recall, and re-ranking — in that order. Most "RAG isn't working" investigations end in those three.
Stage
What it does
Levers
Common failure
Parse
PDF, HTML, slides → text + structure
Layout-aware parsers, OCR fallback
Tables flattened, losing rows
Chunk
Split into retrievable units
Semantic chunking, parent-doc, sliding window
Chunks too small / too big
Embed
Vector representation
Domain-tuned embeddings, dim choice
Mismatched embedding model on query vs. doc
Index
Vector + BM25
Hybrid search, metadata filters, ACLs
Permission leakage, missing filters
Retrieve
Top-k candidates
k, MMR, multi-query, HyDE
k too small; over-confidence on single retrieval
Re-rank
Cross-encoder reorder
BGE / Cohere rerank, listwise rerank
Skipping rerank — biggest quality miss
Compose
Prompt assembly
Citations, structured templates
Context bloat, lost-in-the-middle
Generate
Final answer
Grounding instructions, "refuse if unknown"
Hallucination, citation drift
Agent topology
In 2026 the dominant pattern is typed workflows with bounded autonomy: an LLM drives a directed graph of tools, with planning + reflection at well-defined nodes. Pure freeform "ReAct" loops are reserved for exploratory tasks; production prefers LangGraph-style deterministic graphs with checkpointing.
Watch out: Token cost and latency scale with steps × context, not with model price alone. A "cheap" model in a 12-step loop is often more expensive than one frontier call.
8. Evaluation & Observability
If you remember one rule: no eval, no deploy. Evals are the unit tests of probabilistic systems. The platform should make them as easy to write as a Python test and as automatic to run as CI.
Offline Evals
Run on fixed datasets. Used in CI to block regressions.
Golden Q&A sets
Adversarial / red-team sets
Tool-use trajectories
RAG faithfulness (Ragas)
Online Evals
Run on live traffic, sampled. Detect drift and emerging failure modes.
LLM-as-judge on sampled responses
Heuristic detectors (refusal, length, format)
User feedback (thumbs, edits)
Telemetry
The substrate every other eval reads from.
Per-call: prompt, tools, tokens, cost, latency
Per-session: trace tree (OTel)
Per-user: redacted, with consent
Metric taxonomy
Category
Metric
Why it matters
Quality
Task success / pass@1
Did it do the job?
Faithfulness / groundedness
Hallucination signal for RAG
Citation precision / recall
Trust signal for source-backed answers
Performance
TTFT, ITL, E2E P50/P95/P99
User-perceived latency
Tokens / sec / GPU
Capacity planning
Tool-call success rate
Agentic reliability
Cost
$ / request, $ / resolved task
Unit economics
Cache hit rate
Spend efficiency
Tokens in / out per feature
Drives the bill
Safety
Refusal rate (correct & incorrect)
Over/under-refusal
Jailbreak success rate (red-team)
Attack surface
PII leak rate
Compliance
9. Security, Governance & Safety
Top Threats (LLM Top 10)
Prompt injection (direct + indirect)
Sensitive data leakage via prompts & logs
Insecure tool / plugin design (overscoped tokens)
Training data poisoning
Model DoS via long-context / recursive agent loops
Supply chain (weights, deps, MCP servers)
Excessive agency (unbounded tool actions)
Overreliance — humans trusting wrong answers
Model theft / weight exfiltration
Output handling (XSS, SQLi from generated content)
Model cards + evaluation scorecards for every release
Audit log of every prompt, tool call, and decision
Kill switch per model, per feature, per tenant
Indirect prompt injection (malicious instructions embedded in retrieved documents or web pages) is the dominant attack vector for agentic systems. Treat every retrieved string as untrusted user input, even from "internal" sources.
Compliance map (typical enterprise)
Regime
Implication for the platform
SOC 2 / ISO 27001
Access logging, change management, vendor risk on model providers.
GDPR / CCPA
DSR support, deletion across vector indexes, lawful basis for training data.
HIPAA
BAA-backed providers only; PHI tokens never leave the boundary.
EU AI Act
Risk classification per use case; transparency & conformity for high-risk deployments.
Sector (PCI, FedRAMP, etc.)
Air-gapped or self-hosted inference; key management; audit trails.
10. Cost & Performance Optimization
Cost is a platform feature, not an afterthought. Bake the controls in once; reuse them everywhere.
Figure 6 — Illustrative cost reduction waterfall stacking standard platform levers (representative, not benchmarked).
Cost levers I default to
Measure first. Token meters by feature, team, and tenant — at the gateway.
Right-size. Most "needs frontier" requests are actually solvable by a mid-tier model with better prompts.
Compress. Summarize past turns; strip system prompts to essentials; use structured tool I/O instead of prose.
Batch and stream. Continuous batching + streaming improves both UX and $/token.
Quantize where safe. Re-run quality evals after every quantization swap.
Spot & reserved. Train on spot; serve on reserved + on-demand burst.
Spend caps & alerts. Per-tenant, per-feature; auto-degrade rather than fail.
11. Skill Matrix & Competencies
Below is the T-shape of an AI Platform Engineer: a broad horizontal across the stack, with depth in 2–3 vertical columns that match your team's bottleneck.
Domain
Foundational
Practitioner
Expert
Software engineering
Python, Git, basic SQL, REST
Typed Python or Go, async, testing, design patterns, gRPC
Distributed systems, API design, library authoring
A practical sequencing that works whether you are starting a new platform team, or joining one mid-flight. Each quarter ends with a visible win that demonstrates value beyond infrastructure.
Figure 7 — A 12-month sequence that compounds: listen, standardize, scale, compound.
Day-0 → Day-90 checklist (for joining a team)
Days 1–30 — Learn
Inventory every model, prompt, RAG index, agent in production.
Read 6 months of incidents & postmortems.
Interview 5 product engineers about friction.
Reproduce the top use case end-to-end locally.
Map current cost by feature / team / model.
Days 31–60 — Stabilize
Centralize all model traffic through one gateway.
Wire OpenTelemetry into every call path.
Adopt one eval framework org-wide.
Define SLOs for TTFT, E2E, and error rate.
Fix the worst hallucination / safety bug.
Days 61–90 — Standardize
Publish the paved-road tutorial & SDK.
Migrate one flagship product to the platform.
Stand up the prompt + model registries.
Run a tabletop red-team on agents.
Quarterly cost & quality report for leadership.
Beyond — Compound
Self-host for the top 1–2 cost lines.
Fine-tune where it changes unit economics.
Introduce automated routing across model tiers.
Bring agent workflows under typed graph contracts.
Run continuous online evals + drift alerts.
13. Anti-Patterns & Pitfalls
Architecture smells
No gateway. Every team holds its own API keys; cost is invisible.
Three vector DBs. Pick one. Re-evaluate yearly, not weekly.
Prompts in product code. No versioning, no eval history.
Agents in freeform ReAct. Untyped, untestable, expensive.
Re-running embedding only on "new" docs after changing the embedding model — silent recall collapse.
Process smells
Eyeballing outputs in a Slack thread instead of an eval set.
Shipping to 100% behind a feature flag with no shadow phase.
No model card, no risk classification.
"It works on the demo" — never tested under concurrency.
Vendor lock-in by hard-coding one SDK throughout the codebase.
Cost smells
Frontier model on every call, including classification.
Massive system prompts duplicated per request (no prefix cache).
Re-embedding the entire corpus daily.
Long-running agents with no per-step or total token budget.
Logging full prompts & responses unredacted in hot paths.
Safety smells
Trusting retrieved content as "safe" — indirect prompt injection.
Tools with broad scopes (admin, write-DB) handed to a planner LLM.
No human-in-the-loop on irreversible actions.
One vendor for both generation and judging in evals.
No kill switch per feature/tenant.
14. Platform KPIs & Metrics
Five chart buckets I want live by day 30, and that I can explain to a non-technical exec by day 90.
Velocity
Time-to-first-prototype
Time-to-production
# teams on the shared platform path
Quality
Eval pass rate per feature
Hallucination rate (sampled)
User satisfaction (thumbs / CSAT)
Reliability
Error budget burn
TTFT / E2E P95
Incidents per quarter
Cost
$ per resolved task
Cache hit rate
Spend per feature / tenant
Safety
Red-team success rate (lower is better)
PII leak rate
Refusal correctness
DX
NPS from product engineers
Docs / SDK satisfaction
# of "rebuild from scratch" requests (lower is better)
15. 2026+ Outlook
What's stabilizing
MCP as the cross-vendor tool protocol.
Typed agent graphs over freeform agents.
Hybrid retrieval + cross-encoder reranking as the RAG default.
Eval-in-CI as a non-negotiable for production.
Token / cost as first-class observability dimensions.
Multi-modal RAG (video, audio, 3D) — tooling is early.
On-device + edge inference of capable models.
Verifiable safety claims for autonomous agents.
The regulatory surface (EU AI Act, sectoral rules).
My bias: the teams that win aren't chasing the cleverest one-off diagram—they're the ones where the default path is so boringly good that a product engineer can ship something safe, measured, and observable in days without hand-holding from infra.
Appendix — One-page Cheat Sheet
Daily diagnostic questions
What is our top spend feature, and what fraction of its tokens hit a cache?
What is the P95 TTFT, and where is the long tail?
Which feature regressed quality this week? Did CI catch it?
What's the refusal rate, and are refusals correct?
Any tool / MCP server added this week — was it threat-modeled?
Default architectural choices (start here)
Gateway: one place, one auth, one cost meter.
Models: 2 hosted tiers + 1 self-hosted OSS.
Retrieval: pgvector → Qdrant when >10M vectors. Always rerank.
Evals: Promptfoo or Braintrust in CI; Langfuse / Phoenix in prod.
Observability: OTel everything; tokens as a first-class metric.
Guardrails: input + tool + output, with at least one non-LLM check.
Registry: prompts, models, datasets — all versioned, all reviewable.
16 · References & sources
The body of this note is opinionated engineering text; the list below is for verifiable sources on production ML platforms—technical debt, serving, lifecycle tooling, retrieval, agents, and the compliance language execs will use. I use it as a reading list, not as a substitute for your vendor’s contract or your counsel.
Note: Use arXiv / in-press versions for speed; verify the published venue when you need a DOI. Vendor runbooks (Bedrock, Vertex, OpenAI, Anthropic, inference runtimes) change often—treat them as the authority for limits, regions, and pricing.
ML systems, platform practice & lifecycle
Sculley et al., “Hidden Technical Debt in Machine Learning Systems.” NeurIPS / NIPS 2015. Foundational essay on why “the model is the easy part”—informs §1–§3 platform boundaries and §13 anti-patterns. NeurIPS proceedings
Breck et al., “What’s your ML Test Score? A rubric for ML production readiness.” Google, 2017 (expanded rubric). Checklist DNA behind production gates in §5–§8, §14. Google Research
Amershi et al., “Software Engineering for Machine Learning: A Case Study.” ICSE 2019. How mature teams operationalize ML—maps to the shared platform path and role split in §1, §11. https://arxiv.org/abs/1811.04900
Polyzotis et al., “Data Management Challenges in Production Machine Learning.” SIGMOD 2017. Data/lineage/registries—underpins §5–§7 and the foundation layer in Figure 1. ACM Digital Library
Zaharia et al., “Accelerating the Machine Learning Lifecycle with MLflow.” IEEE Data Eng. Bull. 2018. Experiment tracking, packaging, registry concepts in §5–§6. IEEE Xplore
Huyen, Designing Machine Learning Systems. O’Reilly, 2022. End-to-end lifecycle and deployment patterns echoed across §3–§11. ISBN 978-1098107969.
Kleppmann, Designing Data-Intensive Applications. O’Reilly, 2017. Storage, streams, and reliability vocabulary for retrieval, logging, and registries (§3–§7). ISBN 978-1449373320.
Burns et al., “Borg, Omega, and Kubernetes.” ACM Queue / Communications of the ACM 2016. Conceptual background for running inference and control planes on orchestrators (§3, §6, §7 foundation). ACM Queue
Verma et al., “Large-scale cluster management at Google with Borg.” EuroSys 2015. Scheduling, quotas, SLO-aware placement—relevant to GPU/tenant fairness in §6, §10. ACM Digital Library
Inference, memory management & throughput
Kwon et al., “Efficient Memory Management for Large Language Model Serving with PagedAttention.” SOSP 2023 (vLLM). Core citation for KV-cache/paging discussion in §6. https://arxiv.org/abs/2309.06180
Yu et al., “Orca: A Distributed Serving System for Transformer-Based Generative Models.” OSDI 2022. Iteration-level scheduling and batching—adjacent to §6 latency/cost tradeoffs. USENIX OSDI 22
Pope et al., “Efficiently Scaling Transformer Inference.” MLSys 2023. Multi-device inference design space. https://arxiv.org/abs/2211.05102
Aminabadi et al., “DeepSpeed-Inference: enabling efficient inference of transformer models at unprecedented scale.” SC 2022. High-scale serving stack context. https://arxiv.org/abs/2207.00032
Foundation models, adaptation & training interfaces
Vaswani et al., “Attention Is All You Need.” NeurIPS 2017. Transformer substrate for everything in §5–§6. https://arxiv.org/abs/1706.03762
Touvron et al., “LLaMA: Open and Efficient Foundation Language Models.” 2023. Open-weights / self-hosting storyline in exec summary, §5–§6. https://arxiv.org/abs/2302.13971
Hu et al., “LoRA: Low-Rank Adaptation of Large Language Models.” ICLR 2022. Adapter fine-tuning in Figure 1 layer 6 and §5–§10. https://arxiv.org/abs/2106.09685
Rafailov et al., “Direct Preference Optimization (DPO).” NeurIPS 2023. Preference optimization named in §5–§6. https://arxiv.org/abs/2305.18290
RAG, retrieval & hybrid search
Lewis et al., “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.” NeurIPS 2020. Canonical RAG—§7. https://arxiv.org/abs/2005.11401
Robertson & Zaragoza, “The Probabilistic Relevance Framework: BM25 and Beyond.” FnTIR 2009. Lexical side of hybrid retrieval in §7. https://dl.acm.org/doi/10.1561/1500000019
Karpukhin et al., “Dense Passage Retrieval for Open-Domain Question Answering.” EMNLP 2020. Dense retrieval practice. https://arxiv.org/abs/2004.04906
Gao et al., “Retrieval-Augmented Generation: A Survey.” 2024. Single entrypoint for chunking/index/query research. https://arxiv.org/abs/2312.10997
Agents, tools & orchestration protocols
Yao et al., “ReAct: Synergizing Reasoning and Acting in Language Models.” ICLR 2023. Agent loop mental model in §2–§3, §7. https://arxiv.org/abs/2210.03629
Es et al., “Ragas: Automated Evaluation of Retrieval Augmented Generation.” 2023. RAG quality metrics named in the appendix cheat sheet / §8. https://arxiv.org/abs/2309.15217
Zheng et al., “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.” NeurIPS 2023. Pairwise judging—§8. https://arxiv.org/abs/2306.05685
Beyer et al. (eds.), Site Reliability Engineering. O’Reilly / Google, 2016. SLOs, error budgets, incident practice—§2 cross-cutting, §8, §14. https://sre.google/sre-book/table-of-contents/
OpenTelemetry. Vendor-neutral traces/metrics/logs—appendix “OTel everything” and §8. https://opentelemetry.io/
EU Artificial Intelligence Act (Regulation (EU) 2024/1689). Legal primary source for §15 regulatory bullet. EUR-Lex
Identity, multi-tenancy & API security (platform gateway)
Hardt et al., “The OAuth 2.0 Authorization Framework.” RFC 6749. Baseline for delegated access patterns at the AI gateway. RFC 6749
OpenID Connect Core 1.0. OIDC for AuthN in browser/API flows—referenced conceptually in §3 gateway. OpenID Foundation
Disclaimer. References are for education; they don’t list every vendor in the diagrams, and links aren’t endorsements. Canonical page for this file: LinhTruong.com — Linh Truong.