AI platform engineering · field notes · May 2026

The AI Platform Engineer

How I explain the role, draw the platform for execs and ICs, and sequence the work—gateway, retrieval, serving, evals, and governance—when teams are tired of one-off LLM demos.

Source / canonical copy: LinhTruong.com. If you forward this HTML, link there so the attribution stays put.

Audience: Platform, MLOps, and infra engineers shipping AI systems
Revised: May 2026
Scope: Foundation models, LLM ops, GenAI infra, agentic platforms

Contents

  1. The Role & Why It Exists
  2. Mental Model of the Discipline
  3. Reference Architecture
  4. The Modern AI Platform Stack
  5. Model & Application Lifecycle
  6. Inference & Serving Strategy
  7. RAG & Agentic Systems
  8. Evaluation & Observability
  9. Security, Governance & Safety
  10. Cost & Performance Optimization
  11. Skill Matrix & Competencies
  12. 12-Month Success Strategy
  13. Anti-Patterns & Pitfalls
  14. Platform KPIs & Metrics
  15. 2026+ Outlook
  16. Appendix — Cheat Sheet
  17. References & sources

Executive Summary

The AI Platform Engineer is what you get when classical MLOps runs head-first into foundation models, agents, and retrieval. In practice, you’re building a self-service path so product teams can ship AI features without each squad re-solving serving, eval harnesses, guardrails, and observability from scratch.

7
Platform layers to own
5
Core feedback loops
Typical cost reduction vs. ad-hoc
<P95 1.5s
Target streaming TTFT

Below are the diagrams I reuse in reviews, the sequence I use to stand up a platform, and the checklists I expect on-call to actually run. I wrote this against a 2026 stack reality: strong hosted APIs, credible open weights, hybrid retrieval, real agent loops—and finance asking pointed questions about tokens and latency.

1. The Role & Why It Exists

You are hired to multiply a whole org, not to babysit one notebook. The job is the shared path—templates, gateways, registries—that every model, agent, and AI feature follows. Get that right and you stop rebuilding the same plumbing for every launch.

Mission

Make it safe, fast, and cheap for product teams to deploy and operate AI features — from prompts and RAG to fine-tunes and agents.

Outcomes

High developer velocity, predictable latency & cost, compliant data handling, and measurable model quality across the org.

Interfaces

Gateway APIs, SDKs, eval harnesses, dashboards, vector stores, feature stores, prompt registries, model registries.

How the AI Platform Engineer differs from neighbouring roles

RolePrimary OutputTime HorizonOwns Infra?Owns Models?
ML EngineerA trained model for a use caseSprint–QuarterPartialYes
Data ScientistInsight, prototype, experimentSprintNoSometimes
MLOps / LLMOps EngineerPipelines for a model or teamSprint–QuarterYesOperate only
AI Platform EngineerReusable platform & shared pathQuarter–YearYes (deeply)Host, evaluate, govern
AI Application / Product EngEnd-user features built on platformSprintNoConsume
Research / Applied ScientistNovel methods, evaluationsQuarter–YearNoYes

2. Mental Model of the Discipline

I keep coming back to seven layers—each with its own SLAs, contracts, and owners. Platform work is as much about the contracts between those layers as it is about any single box.

1 · Product & Experience Chat UIs · Copilots · Agentic apps · IDE assistants · Workflows 2 · AI Gateway & Orchestration Routing · Rate limits · Guardrails · Caching · Auth · Cost attribution 3 · Reasoning / Agent Runtime Tool use · Planning · Memory · Multi-step traces · Workflow engines 4 · Retrieval & Knowledge Vector + hybrid search · Re-rankers · Chunking · Doc & semantic indexing 5 · Model Serving & Inference vLLM · TGI · Triton · Bedrock/Anthropic/OpenAI APIs · KV-cache · Batching 6 · Training, Fine-Tuning & Registry SFT · LoRA · DPO · RLAIF · Experiment tracking · Model registry · Lineage 7 · Foundation: Compute, Data, Identity, Observability GPUs · K8s · Object stores · IAM · Secrets · Logging · Tracing · Metrics CROSS-CUTTING: Security · Privacy · Cost & FinOps · Evaluation · Governance · Reliability (SLO/SLA) · Developer Experience
Figure 1 — Seven-layer AI Platform mental model with cross-cutting concerns.

3. Reference Architecture

The reference sketch I draw most often: request in through a gateway, into a reasoning / orchestration layer that fans out to retrieval, tools, and one or more models, then back out through guardrails and telemetry. Training and the registry sit off the hot path and feed serving on a slower cadence.

CLIENTS Web / Mobile Apps IDE Plugins · CLIs Internal APIs / Agents AI GATEWAY AuthN / AuthZ (OIDC, SCIM) Rate limits · Quotas · Tenanting PII redaction · Prompt firewall Semantic + prompt cache Cost attribution · Spend caps ORCHESTRATION / AGENT RUNTIME Router · Planner · Reflection Tool use (MCP, functions) Short / long-term memory Workflow engine · DAG · retries Guardrails (input + output) Trace span emitter (OTel) RETRIEVAL & KNOWLEDGE Ingestion pipelines Chunking + embeddings Vector + BM25 hybrid Re-ranker (cross-encoder) Document ACLs / row security MODEL SERVING Hosted: Claude, GPT, Gemini Self-host: vLLM / TGI / SGLang Triton / TensorRT-LLM Embedding service Re-rank service Speculative / continuous batching Autoscaling · GPU pools EVALUATION & OBSERVABILITY Offline evals · Golden sets Online evals · LLM-as-judge User feedback & thumbs Traces · Spans · Token meters Drift & hallucination detection SLO dashboards · Alerts TRAINING · FINE-TUNING · REGISTRY (slow loop) Data lake / lakehouse Feature / prompt store Labeling & curation Experiment tracking SFT / LoRA / QLoRA DPO / RLAIF Distillation Eval gates → promote Model Registry (versions, lineage, cards, scorecards) Prompt Registry (versions, A/B, eval history) CI/CD: Pipeline-as-code · IaC · Canary & shadow deploy · Rollback FOUNDATION GPU clusters · Spot & reserved Kubernetes · KubeRay · Karpenter Object stores · Parquet · Delta / Iceberg Identity · Secrets · KMS Networking · Private link · Egress control OpenTelemetry · Prometheus · Loki FinOps tagging · Showback / Chargeback DR / Multi-region / Backups Compliance: SOC2 · HIPAA · ISO
Figure 2 — End-to-end reference architecture for a production AI platform.

4. The Modern AI Platform Stack (2026)

Tooling churns fast. What matters is the capability slot each tool fills. Pick one per slot, standardize, and write the migration plan before you commit.

SlotOpen-source / Self-hostManagedNotes
Foundation models (chat)Llama 3.x/4, Mistral, Qwen 2.5, DeepSeek-V3Anthropic Claude 4.7, OpenAI GPT, Google Gemini, AWS BedrockDefault to hosted for frontier quality; self-host for cost/sovereignty.
EmbeddingsBGE-M3, E5, Nomic, JinaVoyage, OpenAI, Cohere, VertexPick one default + one fallback. Standardize dimensionality.
Re-rankersBGE-reranker, Cohere Rerank OSSCohere, VoyageCross-encoder rerank lifts top-k quality 10–30%.
Inference serversvLLM, TGI, SGLang, TensorRT-LLM, TritonBedrock, Together, Fireworks, AnyscalevLLM is the default for most teams; TRT-LLM for latency-critical.
Vector storeQdrant, Weaviate, Milvus, pgvectorPinecone, Turbopuffer, Vertex Vector SearchUse pgvector under 10M vectors; dedicated above.
Orchestration / AgentsLangGraph, LlamaIndex Workflows, Haystack, DSPyAnthropic / OpenAI Agents APIs, Vertex Agent BuilderPrefer typed, deterministic graphs over freeform agents.
Eval & tracingLangfuse, Arize Phoenix, Ragas, DeepEval, PromptfooBraintrust, Helicone, LangSmith, Datadog LLM ObsTreat eval as CI — block deploys on regressions.
GuardrailsNeMo Guardrails, Guardrails AI, LlamaGuard, ShieldGemmaAnthropic safety classifiers, Azure Content SafetyLayer input + output checks; never trust one model alone.
Fine-tuningHF TRL, Axolotl, Unsloth, Llama Factory, LMFlowBedrock FT, OpenAI FT, Vertex TuningStart with LoRA + DPO. Full SFT only with clear ROI.
Experiment trackingMLflow, W&B (OSS), AimWeights & Biases, CometRequired for any training workload.
Feature / prompt storeFeast, PromptLayer, LatitudeTecton, HopsworksVersioned prompts behave like code — review & CI them.
Compute orchestrationKubernetes, KubeRay, Slurm, Volcano, KarpenterSageMaker, Vertex, Azure ML, RunPod, CoreWeaveStandardize on K8s + Ray for portability.
Data lakehouseDelta Lake, Apache Iceberg, HudiDatabricks, Snowflake, BigQueryIceberg + Parquet is the open default in 2026.
ObservabilityOpenTelemetry, Prometheus, Grafana, Loki, TempoDatadog, Honeycomb, New RelicEmit OTel everywhere. Token + latency are first-class metrics.
Tooling protocolMCP (Model Context Protocol)Native vendor function callingMCP is becoming the cross-vendor standard for tools.
Heuristic: Standardize at most two options per slot — one default and one fallback. Three options is a sprawl tax that the platform team will pay for years.

5. Model & Application Lifecycle

AI features have two clocks: a fast loop (prompts, retrieval tweaks, configs) measured in hours; and a slow loop (fine-tunes, base-model swaps) measured in weeks. Your platform must serve both without making either painful.

FAST LOOP · hours to days · prompt / config / retrieval Idea / ticket PRD, success metric Prompt + RAG draft Notebook · playground Eval suite (offline) Golden + adversarial Canary / shadow 5% → 25% → 100% Production + online eval Telemetry · feedback SLOW LOOP · weeks · fine-tune / base-model swap / data refresh Collect prod data Traces · failures · ratings Curate & label PII strip · golden set Train / fine-tune LoRA · DPO · distill Eval gates Block regression Registry → serving Versioned · cardable prod signals →
Figure 3 — Two-clock lifecycle: a fast iteration loop on prompts/RAG and a slow loop on models/data.

Lifecycle artifacts the platform owns

6. Inference & Serving Strategy

Latency budget

For a streaming chat UX, target TTFT < 1.5 s P95 and inter-token < 80 ms. For agentic tool-using flows, budget end-to-end P95 < 8 s per step.

  • Cache: prompt prefix cache (KV) and semantic response cache.
  • Decode: speculative decoding can cut latency 1.5–3×.
  • Batching: continuous (in-flight) batching is non-negotiable.

Throughput

Token throughput is the unit of capacity. Plan in tokens/sec/GPU, not requests.

  • KV cache memory is usually the bottleneck before FLOPs.
  • Quantize to FP8 / INT4 (AWQ, GPTQ) for 2–4× speed when quality allows.
  • Use paged attention & chunked prefill for long contexts.

Routing

A model router in the gateway picks the cheapest model that meets quality + latency:

  • Tier 1 — frontier (Claude 4.7, GPT) for hard reasoning.
  • Tier 2 — mid (Haiku, Sonnet small, Gemini Flash) for default traffic.
  • Tier 3 — local (Llama, Qwen) for cheap, cacheable, or PII-sensitive paths.

Failure modes

  • Upstream provider 5xx / throttling — automatic cross-vendor failover.
  • Long-context OOM — preflight estimate + truncate strategy.
  • Hot prompts — semantic cache + dedup on identical concurrent requests.
  • Tool calls that hang — per-step timeouts and partial-result returns.
Client streams SSE Gateway auth · cache · route Semantic cache hit? return cached Router decision tier · region · cost Guardrails (input) PII · jailbreak Hosted: Claude/GPT vendor SDK vLLM cluster paged KV · batching Embed + Rerank CPU/GPU pool Output guardrails PII · toxicity · schema
Figure 4 — Inference path with caching, routing, guardrails, and multi-target serving.

7. RAG & Agentic Systems

The honest RAG pipeline

RAG quality is bottlenecked by chunking, retrieval recall, and re-ranking — in that order. Most "RAG isn't working" investigations end in those three.

StageWhat it doesLeversCommon failure
ParsePDF, HTML, slides → text + structureLayout-aware parsers, OCR fallbackTables flattened, losing rows
ChunkSplit into retrievable unitsSemantic chunking, parent-doc, sliding windowChunks too small / too big
EmbedVector representationDomain-tuned embeddings, dim choiceMismatched embedding model on query vs. doc
IndexVector + BM25Hybrid search, metadata filters, ACLsPermission leakage, missing filters
RetrieveTop-k candidatesk, MMR, multi-query, HyDEk too small; over-confidence on single retrieval
Re-rankCross-encoder reorderBGE / Cohere rerank, listwise rerankSkipping rerank — biggest quality miss
ComposePrompt assemblyCitations, structured templatesContext bloat, lost-in-the-middle
GenerateFinal answerGrounding instructions, "refuse if unknown"Hallucination, citation drift

Agent topology

In 2026 the dominant pattern is typed workflows with bounded autonomy: an LLM drives a directed graph of tools, with planning + reflection at well-defined nodes. Pure freeform "ReAct" loops are reserved for exploratory tasks; production prefers LangGraph-style deterministic graphs with checkpointing.

Intent classify small model Planner decomposes into steps Retriever RAG · web · DB Tool dispatcher MCP / functions Critique / reflect self-check Reducer / aggregator merges step outputs Responder cites · formats · streams
Figure 5 — Typed agent graph: classify → plan → retrieve → tools → reflect → aggregate → respond.
Watch out: Token cost and latency scale with steps × context, not with model price alone. A "cheap" model in a 12-step loop is often more expensive than one frontier call.

8. Evaluation & Observability

If you remember one rule: no eval, no deploy. Evals are the unit tests of probabilistic systems. The platform should make them as easy to write as a Python test and as automatic to run as CI.

Offline Evals

Run on fixed datasets. Used in CI to block regressions.

  • Golden Q&A sets
  • Adversarial / red-team sets
  • Tool-use trajectories
  • RAG faithfulness (Ragas)
Online Evals

Run on live traffic, sampled. Detect drift and emerging failure modes.

  • LLM-as-judge on sampled responses
  • Heuristic detectors (refusal, length, format)
  • User feedback (thumbs, edits)
Telemetry

The substrate every other eval reads from.

  • Per-call: prompt, tools, tokens, cost, latency
  • Per-session: trace tree (OTel)
  • Per-user: redacted, with consent

Metric taxonomy

CategoryMetricWhy it matters
QualityTask success / pass@1Did it do the job?
Faithfulness / groundednessHallucination signal for RAG
Citation precision / recallTrust signal for source-backed answers
PerformanceTTFT, ITL, E2E P50/P95/P99User-perceived latency
Tokens / sec / GPUCapacity planning
Tool-call success rateAgentic reliability
Cost$ / request, $ / resolved taskUnit economics
Cache hit rateSpend efficiency
Tokens in / out per featureDrives the bill
SafetyRefusal rate (correct & incorrect)Over/under-refusal
Jailbreak success rate (red-team)Attack surface
PII leak rateCompliance

9. Security, Governance & Safety

Top Threats (LLM Top 10)
  1. Prompt injection (direct + indirect)
  2. Sensitive data leakage via prompts & logs
  3. Insecure tool / plugin design (overscoped tokens)
  4. Training data poisoning
  5. Model DoS via long-context / recursive agent loops
  6. Supply chain (weights, deps, MCP servers)
  7. Excessive agency (unbounded tool actions)
  8. Overreliance — humans trusting wrong answers
  9. Model theft / weight exfiltration
  10. Output handling (XSS, SQLi from generated content)
Controls
  • Gateway-enforced auth, quotas, content policies
  • Defense-in-depth guardrails: input + tool + output
  • Least-privilege tools — scoped tokens, dry-run modes
  • Sandboxed code execution for agentic actions
  • Redaction before logging; encrypted at rest
  • Data residency & regional routing
  • Model cards + evaluation scorecards for every release
  • Audit log of every prompt, tool call, and decision
  • Kill switch per model, per feature, per tenant
Indirect prompt injection (malicious instructions embedded in retrieved documents or web pages) is the dominant attack vector for agentic systems. Treat every retrieved string as untrusted user input, even from "internal" sources.

Compliance map (typical enterprise)

RegimeImplication for the platform
SOC 2 / ISO 27001Access logging, change management, vendor risk on model providers.
GDPR / CCPADSR support, deletion across vector indexes, lawful basis for training data.
HIPAABAA-backed providers only; PHI tokens never leave the boundary.
EU AI ActRisk classification per use case; transparency & conformity for high-risk deployments.
Sector (PCI, FedRAMP, etc.)Air-gapped or self-hosted inference; key management; audit trails.

10. Cost & Performance Optimization

Cost is a platform feature, not an afterthought. Bake the controls in once; reuse them everywhere.

$1.00 Naive baseline −25% Right-size model −15% Prompt compression −12% Semantic cache −9% Prompt-prefixcache −6% Batching /speculation −5% Quantization(FP8 / INT4) $0.28 Optimized total
Figure 6 — Illustrative cost reduction waterfall stacking standard platform levers (representative, not benchmarked).

Cost levers I default to

  1. Measure first. Token meters by feature, team, and tenant — at the gateway.
  2. Right-size. Most "needs frontier" requests are actually solvable by a mid-tier model with better prompts.
  3. Cache aggressively. Prompt-prefix cache (vendor or vLLM), semantic response cache, embedding cache.
  4. Compress. Summarize past turns; strip system prompts to essentials; use structured tool I/O instead of prose.
  5. Batch and stream. Continuous batching + streaming improves both UX and $/token.
  6. Quantize where safe. Re-run quality evals after every quantization swap.
  7. Spot & reserved. Train on spot; serve on reserved + on-demand burst.
  8. Spend caps & alerts. Per-tenant, per-feature; auto-degrade rather than fail.

11. Skill Matrix & Competencies

Below is the T-shape of an AI Platform Engineer: a broad horizontal across the stack, with depth in 2–3 vertical columns that match your team's bottleneck.

DomainFoundationalPractitionerExpert
Software engineering Python, Git, basic SQL, REST Typed Python or Go, async, testing, design patterns, gRPC Distributed systems, API design, library authoring
Cloud & infra Linux, Docker, one cloud Kubernetes, Terraform, IAM, networking, multi-region GPU clusters, networking for ML, multi-cloud, capacity planning
ML fundamentals Train a basic model, metrics Transformers, attention, tokenization, training loops, eval design Scaling laws, distillation, RLHF/DPO, custom kernels
LLM systems Use an API, write prompts RAG, tool use, agents, structured output, function calling Inference engine tuning, custom routers, multi-agent design
Data engineering Parquet, basic ETL Lakehouse, streaming, vector + hybrid search Petabyte data infra, lineage, real-time features
Observability Logs, metrics OTel, traces, SLOs, anomaly detection End-to-end AI-specific observability platforms
Security & governance Auth basics, secrets OWASP LLM top 10, IAM, key mgmt, compliance basics Threat modeling for agents, red-team, regulatory programs
Product & UX Understand user journeys Translate UX needs to latency / quality budgets Drive product strategy with AI capability roadmaps

Tooling fluency checklist

Build & ship

PythonGoTypeScript DockerKubernetesHelm TerraformPulumi GitHub ActionsArgo

AI/ML

PyTorchHF Transformers vLLMTGITensorRT-LLM LangGraphLlamaIndexDSPy RayAxolotlUnsloth MCP

Data & ops

Postgres + pgvectorQdrantWeaviate IcebergDelta MLflowW&B LangfusePhoenixBraintrust OpenTelemetryPrometheus

12. 12-Month Success Strategy

A practical sequencing that works whether you are starting a new platform team, or joining one mid-flight. Each quarter ends with a visible win that demonstrates value beyond infrastructure.

Q1 · LISTEN & LAND • Map every AI use case • Audit current stack & spend • Stand up an AI Gateway • OTel + token meters everywhere • One eval harness for golden sets Win: visible spend & quality dashboard that the org has never had before. Q2 · STANDARDIZE • Default RAG service + reranker • Prompt + model registries • Guardrail framework (in/out) • Self-host an OSS model for cheap paths • Eval-in-CI for every team Win: 2 teams migrated to the paved road, with measurable cost / quality lift. Q3 · SCALE & OPTIMIZE • Multi-region serving + failover • Semantic + prefix caches in gateway • Agent runtime (typed graphs) • Fine-tune pipeline (LoRA + DPO) • Red-team program Win: 30–50% cost cut on the top 3 use cases at equal/better quality. Q4 · COMPOUND • Production agents in 2+ products • Continuous evals on live traffic • FinOps reviews monthly • Public model cards & risk scorecards • Capacity plan for next year Win: AI features ship in a sprint, not a quarter — platform is invisible.
Figure 7 — A 12-month sequence that compounds: listen, standardize, scale, compound.

Day-0 → Day-90 checklist (for joining a team)

Days 1–30 — Learn

  • Inventory every model, prompt, RAG index, agent in production.
  • Read 6 months of incidents & postmortems.
  • Interview 5 product engineers about friction.
  • Reproduce the top use case end-to-end locally.
  • Map current cost by feature / team / model.

Days 31–60 — Stabilize

  • Centralize all model traffic through one gateway.
  • Wire OpenTelemetry into every call path.
  • Adopt one eval framework org-wide.
  • Define SLOs for TTFT, E2E, and error rate.
  • Fix the worst hallucination / safety bug.

Days 61–90 — Standardize

  • Publish the paved-road tutorial & SDK.
  • Migrate one flagship product to the platform.
  • Stand up the prompt + model registries.
  • Run a tabletop red-team on agents.
  • Quarterly cost & quality report for leadership.

Beyond — Compound

  • Self-host for the top 1–2 cost lines.
  • Fine-tune where it changes unit economics.
  • Introduce automated routing across model tiers.
  • Bring agent workflows under typed graph contracts.
  • Run continuous online evals + drift alerts.

13. Anti-Patterns & Pitfalls

Architecture smells
  • No gateway. Every team holds its own API keys; cost is invisible.
  • Three vector DBs. Pick one. Re-evaluate yearly, not weekly.
  • Prompts in product code. No versioning, no eval history.
  • Agents in freeform ReAct. Untyped, untestable, expensive.
  • Re-running embedding only on "new" docs after changing the embedding model — silent recall collapse.
Process smells
  • Eyeballing outputs in a Slack thread instead of an eval set.
  • Shipping to 100% behind a feature flag with no shadow phase.
  • No model card, no risk classification.
  • "It works on the demo" — never tested under concurrency.
  • Vendor lock-in by hard-coding one SDK throughout the codebase.
Cost smells
  • Frontier model on every call, including classification.
  • Massive system prompts duplicated per request (no prefix cache).
  • Re-embedding the entire corpus daily.
  • Long-running agents with no per-step or total token budget.
  • Logging full prompts & responses unredacted in hot paths.
Safety smells
  • Trusting retrieved content as "safe" — indirect prompt injection.
  • Tools with broad scopes (admin, write-DB) handed to a planner LLM.
  • No human-in-the-loop on irreversible actions.
  • One vendor for both generation and judging in evals.
  • No kill switch per feature/tenant.

14. Platform KPIs & Metrics

Five chart buckets I want live by day 30, and that I can explain to a non-technical exec by day 90.

Velocity

  • Time-to-first-prototype
  • Time-to-production
  • # teams on the shared platform path

Quality

  • Eval pass rate per feature
  • Hallucination rate (sampled)
  • User satisfaction (thumbs / CSAT)

Reliability

  • Error budget burn
  • TTFT / E2E P95
  • Incidents per quarter

Cost

  • $ per resolved task
  • Cache hit rate
  • Spend per feature / tenant

Safety

  • Red-team success rate (lower is better)
  • PII leak rate
  • Refusal correctness

DX

  • NPS from product engineers
  • Docs / SDK satisfaction
  • # of "rebuild from scratch" requests (lower is better)

15. 2026+ Outlook

What's stabilizing

  • MCP as the cross-vendor tool protocol.
  • Typed agent graphs over freeform agents.
  • Hybrid retrieval + cross-encoder reranking as the RAG default.
  • Eval-in-CI as a non-negotiable for production.
  • Token / cost as first-class observability dimensions.

What's still volatile

  • Long-running, persistent agents — memory + identity remain unsolved.
  • Multi-modal RAG (video, audio, 3D) — tooling is early.
  • On-device + edge inference of capable models.
  • Verifiable safety claims for autonomous agents.
  • The regulatory surface (EU AI Act, sectoral rules).
My bias: the teams that win aren't chasing the cleverest one-off diagram—they're the ones where the default path is so boringly good that a product engineer can ship something safe, measured, and observable in days without hand-holding from infra.

Appendix — One-page Cheat Sheet

Daily diagnostic questions

  1. What is our top spend feature, and what fraction of its tokens hit a cache?
  2. What is the P95 TTFT, and where is the long tail?
  3. Which feature regressed quality this week? Did CI catch it?
  4. What's the refusal rate, and are refusals correct?
  5. Any tool / MCP server added this week — was it threat-modeled?

Default architectural choices (start here)

  • Gateway: one place, one auth, one cost meter.
  • Models: 2 hosted tiers + 1 self-hosted OSS.
  • Retrieval: pgvector → Qdrant when >10M vectors. Always rerank.
  • Orchestration: typed graphs (LangGraph / native SDK agents).
  • Evals: Promptfoo or Braintrust in CI; Langfuse / Phoenix in prod.
  • Observability: OTel everything; tokens as a first-class metric.
  • Guardrails: input + tool + output, with at least one non-LLM check.
  • Registry: prompts, models, datasets — all versioned, all reviewable.

16 · References & sources

The body of this note is opinionated engineering text; the list below is for verifiable sources on production ML platforms—technical debt, serving, lifecycle tooling, retrieval, agents, and the compliance language execs will use. I use it as a reading list, not as a substitute for your vendor’s contract or your counsel.

Note: Use arXiv / in-press versions for speed; verify the published venue when you need a DOI. Vendor runbooks (Bedrock, Vertex, OpenAI, Anthropic, inference runtimes) change often—treat them as the authority for limits, regions, and pricing.

ML systems, platform practice & lifecycle

  1. Sculley et al., “Hidden Technical Debt in Machine Learning Systems.” NeurIPS / NIPS 2015. Foundational essay on why “the model is the easy part”—informs §1–§3 platform boundaries and §13 anti-patterns. NeurIPS proceedings
  2. Breck et al., “What’s your ML Test Score? A rubric for ML production readiness.” Google, 2017 (expanded rubric). Checklist DNA behind production gates in §5–§8, §14. Google Research
  3. Amershi et al., “Software Engineering for Machine Learning: A Case Study.” ICSE 2019. How mature teams operationalize ML—maps to the shared platform path and role split in §1, §11. https://arxiv.org/abs/1811.04900
  4. Polyzotis et al., “Data Management Challenges in Production Machine Learning.” SIGMOD 2017. Data/lineage/registries—underpins §5–§7 and the foundation layer in Figure 1. ACM Digital Library
  5. Zaharia et al., “Accelerating the Machine Learning Lifecycle with MLflow.” IEEE Data Eng. Bull. 2018. Experiment tracking, packaging, registry concepts in §5–§6. IEEE Xplore
  6. Huyen, Designing Machine Learning Systems. O’Reilly, 2022. End-to-end lifecycle and deployment patterns echoed across §3–§11. ISBN 978-1098107969.
  7. Kleppmann, Designing Data-Intensive Applications. O’Reilly, 2017. Storage, streams, and reliability vocabulary for retrieval, logging, and registries (§3–§7). ISBN 978-1449373320.

Kubernetes, scheduling & “cloud-native” serving context

  1. Burns et al., “Borg, Omega, and Kubernetes.” ACM Queue / Communications of the ACM 2016. Conceptual background for running inference and control planes on orchestrators (§3, §6, §7 foundation). ACM Queue
  2. Verma et al., “Large-scale cluster management at Google with Borg.” EuroSys 2015. Scheduling, quotas, SLO-aware placement—relevant to GPU/tenant fairness in §6, §10. ACM Digital Library

Inference, memory management & throughput

  1. Kwon et al., “Efficient Memory Management for Large Language Model Serving with PagedAttention.” SOSP 2023 (vLLM). Core citation for KV-cache/paging discussion in §6. https://arxiv.org/abs/2309.06180
  2. Yu et al., “Orca: A Distributed Serving System for Transformer-Based Generative Models.” OSDI 2022. Iteration-level scheduling and batching—adjacent to §6 latency/cost tradeoffs. USENIX OSDI 22
  3. Pope et al., “Efficiently Scaling Transformer Inference.” MLSys 2023. Multi-device inference design space. https://arxiv.org/abs/2211.05102
  4. Aminabadi et al., “DeepSpeed-Inference: enabling efficient inference of transformer models at unprecedented scale.” SC 2022. High-scale serving stack context. https://arxiv.org/abs/2207.00032

Foundation models, adaptation & training interfaces

  1. Vaswani et al., “Attention Is All You Need.” NeurIPS 2017. Transformer substrate for everything in §5–§6. https://arxiv.org/abs/1706.03762
  2. Touvron et al., “LLaMA: Open and Efficient Foundation Language Models.” 2023. Open-weights / self-hosting storyline in exec summary, §5–§6. https://arxiv.org/abs/2302.13971
  3. Hu et al., “LoRA: Low-Rank Adaptation of Large Language Models.” ICLR 2022. Adapter fine-tuning in Figure 1 layer 6 and §5–§10. https://arxiv.org/abs/2106.09685
  4. Rafailov et al., “Direct Preference Optimization (DPO).” NeurIPS 2023. Preference optimization named in §5–§6. https://arxiv.org/abs/2305.18290

RAG, retrieval & hybrid search

  1. Lewis et al., “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.” NeurIPS 2020. Canonical RAG—§7. https://arxiv.org/abs/2005.11401
  2. Robertson & Zaragoza, “The Probabilistic Relevance Framework: BM25 and Beyond.” FnTIR 2009. Lexical side of hybrid retrieval in §7. https://dl.acm.org/doi/10.1561/1500000019
  3. Karpukhin et al., “Dense Passage Retrieval for Open-Domain Question Answering.” EMNLP 2020. Dense retrieval practice. https://arxiv.org/abs/2004.04906
  4. Gao et al., “Retrieval-Augmented Generation: A Survey.” 2024. Single entrypoint for chunking/index/query research. https://arxiv.org/abs/2312.10997

Agents, tools & orchestration protocols

  1. Yao et al., “ReAct: Synergizing Reasoning and Acting in Language Models.” ICLR 2023. Agent loop mental model in §2–§3, §7. https://arxiv.org/abs/2210.03629
  2. Schick et al., “Toolformer.” 2023. Tool-learning framing. https://arxiv.org/abs/2302.04761
  3. Model Context Protocol (MCP). Cross-vendor tool/context standard referenced in §3, §7, §15. https://modelcontextprotocol.io/
  4. OpenAPI Specification. HTTP tool contracts at the gateway boundary (§3). https://spec.openapis.org/oas/latest.html

Evaluation, observability & SLO discipline

  1. Es et al., “Ragas: Automated Evaluation of Retrieval Augmented Generation.” 2023. RAG quality metrics named in the appendix cheat sheet / §8. https://arxiv.org/abs/2309.15217
  2. Zheng et al., “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.” NeurIPS 2023. Pairwise judging—§8. https://arxiv.org/abs/2306.05685
  3. Beyer et al. (eds.), Site Reliability Engineering. O’Reilly / Google, 2016. SLOs, error budgets, incident practice—§2 cross-cutting, §8, §14. https://sre.google/sre-book/table-of-contents/
  4. OpenTelemetry. Vendor-neutral traces/metrics/logs—appendix “OTel everything” and §8. https://opentelemetry.io/
  5. W3C Trace Context. Propagation standard for distributed traces through gateways and tool calls. https://www.w3.org/TR/trace-context/

Security, governance & compliance

  1. OWASP Top 10 for Large Language Model Applications. Threat taxonomy for §9. OWASP project page
  2. Greshake et al., “Indirect Prompt Injection” (real-world LLM-integrated apps). 2023. Pairs with gateway + retrieved-content trust models in §3, §9. https://arxiv.org/abs/2302.12173
  3. NIST AI Risk Management Framework (AI RMF 1.0). Risk language for enterprise governance in §9. https://www.nist.gov/itl/ai-risk-management-framework
  4. EU Artificial Intelligence Act (Regulation (EU) 2024/1689). Legal primary source for §15 regulatory bullet. EUR-Lex

Identity, multi-tenancy & API security (platform gateway)

  1. Hardt et al., “The OAuth 2.0 Authorization Framework.” RFC 6749. Baseline for delegated access patterns at the AI gateway. RFC 6749
  2. OpenID Connect Core 1.0. OIDC for AuthN in browser/API flows—referenced conceptually in §3 gateway. OpenID Foundation
Disclaimer. References are for education; they don’t list every vendor in the diagrams, and links aren’t endorsements. Canonical page for this file: LinhTruong.comLinh Truong.