Enterprise AI Architect Trade-Offs
Generative AI broke an assumption I relied on for years: that systems are deterministic. The core component is now probabilistic, cost is per-token, latency is measured in seconds, and the same input can yield different outputs. I wrote this note to map the trade-off space I use when building AI systems an enterprise can trust — across model strategy, retrieval, agents, inference economics, governance, and operations.
Foundations
The Enterprise AI Architect Mandate — How I Frame It
You're still an architect — but the material has changed. I focus on delivering business value from AI reliably, safely, and economically, while managing a component that is non-deterministic, rapidly evolving, and prone to confident error.
Everything in AI architecture is a trade-off — and a moving one. The model I benchmark today may be superseded next quarter, so I design for replaceability, not for any single model.
Where AI belongs (and doesn't)
Which problems warrant a probabilistic model vs deterministic code or classic ML. The first trade-off is whether to use a generative model at all — often a rules engine or a fine-tuned classifier is cheaper, faster, and auditable.
The model & adaptation strategy
Closed API vs open weights, frontier vs small, prompt vs RAG vs fine-tune. These set the cost, latency, quality, and governance envelope of everything downstream.
The trust architecture
Evaluation, guardrails, observability, human-in-the-loop, and data governance. In the enterprise, trustworthiness is the product — an impressive demo that can't be governed never ships.
Foundations
The Fundamental Shift: Deterministic → Probabilistic
Classical software is deterministic: same input, same output, testable to exact equality. LLM-based systems are stochastic, opaque, and fallible. Every architectural instinct has to be re-examined through that lens.
| Dimension | Traditional software | LLM-based system | Architectural consequence |
|---|---|---|---|
| Output | Deterministic, exact | Probabilistic, varies per call | Test distributions & quality, not equality; set temperature deliberately |
| Correctness | Right or wrong (verifiable) | A spectrum; can be confidently wrong | Evals with scores; guardrails; human review on high stakes |
| Latency | Milliseconds | Hundreds of ms to many seconds (streaming) | Stream tokens; design async UX; cache aggressively |
| Cost | Fixed compute, ~free per call | Per-token, scales with usage | Cost is a runtime variable to engineer, not a fixed line item |
| Failure mode | Crash / exception (loud) | Plausible hallucination (silent) | Grounding, citations, confidence signals, fallbacks |
| Behavior change | Only when you deploy | Can shift when the provider updates the model | Pin versions; regression-eval before adopting new models |
Foundations
The AI Constraint Set — Pick Your Corner
Just as distributed systems have CAP, enterprise AI systems live inside a tension between quality, latency, cost, and governance. I can't maximize all four at once — the architecture is the act of choosing which corner to occupy.
Accuracy, reasoning depth, instruction-following, factual grounding. Bought with bigger models, retrieval, and eval-driven iteration.
Time-to-first-token + tokens/sec. Bought with smaller models, streaming, caching, and shorter prompts.
Per-token API spend or GPU amortization. Bought down with smaller/cheaper models, caching, batching, routing.
Data privacy, residency, auditability, safety, compliance (e.g. EU AI Act). Often pushes toward private/self-hosted deployments.
Model Strategy
The Adaptation Ladder — How to Make a Model "Yours"
The single most important strategy decision: how do you specialize a general model for your domain? There is a ladder of techniques from cheap-and-fast to expensive-and-powerful. Climb only as high as the requirement forces you.
| Technique | Best for | Cost / effort | Data freshness | Key risk |
|---|---|---|---|---|
| Prompt engineering | General tasks, fast prototypes, orchestration | Lowest | Live (via context) | Brittle prompts; context-window limits |
| RAG | Q&A over private/changing knowledge | Medium | Always fresh | Retrieval quality is the ceiling; chunking pain |
| Fine-tuning | Consistent style/format, narrow tasks, smaller-model substitution | Medium–high | Frozen at train time | Stale knowledge; re-tune treadmill; data drift |
| Continued pre-training | Deep domain language (legal, biomed, code) | High | Frozen | Large data + compute; catastrophic forgetting |
| From scratch | AI-native products, unique modality/scale | Extreme | Frozen | $millions; rarely beats adapting an open model |
Model Strategy
RAG vs Fine-tuning vs Long Context
The most common architecture debate. They are not mutually exclusive — but they solve different problems, and confusing them is the most expensive AI mistake teams make.
RAG — change what it KNOWS
- Inject fresh, private, citable facts at query time
- Update knowledge by updating the index, no retraining
- Reduces hallucination via grounding; auditable sources
- Adds retrieval latency + a pipeline to maintain
Fine-tuning — change how it BEHAVES
- Teach tone, format, structure, a narrow skill
- Can let a small cheap model match a big one on a task
- Knowledge is frozen at training time → goes stale
- Needs labeled data + an ongoing re-tune cycle
...and the third option: just use a bigger context window
Frontier models now accept very large contexts, so for small/static corpora you can sometimes skip retrieval and stuff the documents straight into the prompt. The trade-off:
| RAG (retrieve top-k) | Long context (stuff it all) | |
|---|---|---|
| Cost per query | Low (only relevant chunks) | High (pay for all tokens every call) |
| Latency | Lower | Higher (more tokens to process) |
| Scales to large corpora | Yes (millions of docs) | No (window-bounded) |
| Accuracy on big inputs | Depends on retrieval quality | "Lost in the middle" degradation |
| Simplicity | More moving parts | Dead simple |
Model Strategy
Closed API vs Open-Weight Self-Hosted
A defining build-vs-buy decision with major cost, control, and governance implications. The gap between frontier closed models and the best open-weight models has narrowed sharply — making this a live choice for most enterprises.
Closed model via API
- Best-in-class quality, zero infra to run
- Instant access, scales for you, fast to ship
- Data leaves your boundary (mitigated by enterprise/no-train terms)
- Per-token cost scales with usage; vendor lock-in & model deprecation risk
Open-weight, self-hosted
- Full data control & residency; runs in your VPC/on-prem
- Fixed GPU cost (cheap at high steady volume); no lock-in
- Customizable & fine-tunable freely; version stability
- You own the MLOps: GPUs, scaling, optimization, on-call
Model Strategy
Frontier vs Small Models — Right-Sizing
Using a frontier model for everything is the most common source of runaway AI cost and latency. Match the model to the task; route deliberately.
Model Strategy
Model-Strategy Scorecard
A comparative read of the main strategy options against what enterprises care about (★ relative, 1–5). Use it to shortlist; validate against your own data and evals.
| Strategy | Output quality | Speed to ship | Run cost @ scale | Data control | Customizability | Ops burden |
|---|---|---|---|---|---|---|
| Frontier API + prompting | ★★★★★ | ★★★★★ | ★★★★★ | ★★★★★ | ★★★★★ | ★★★★★ |
| Frontier API + RAG | ★★★★★ | ★★★★★ | ★★★★★ | ★★★★★ | ★★★★★ | ★★★★★ |
| Fine-tuned mid model | ★★★★★ | ★★★★★ | ★★★★★ | ★★★★★ | ★★★★★ | ★★★★★ |
| Self-hosted open model | ★★★★★ | ★★★★★ | ★★★★★ | ★★★★★ | ★★★★★ | ★★★★★ |
| Small/distilled + routing | ★★★★★ | ★★★★★ | ★★★★★ | ★★★★★ | ★★★★★ | ★★★★★ |
System Architecture
RAG Architecture & Its Trade-Offs
RAG is the workhorse of enterprise GenAI — but "RAG" hides a dozen tuning decisions, each a trade-off. The quality of your retrieval is the hard ceiling on the quality of your answers.
| RAG decision | Trade-off |
|---|---|
| Chunk size | Small chunks → precise retrieval but lost context. Large chunks → more context but noisy, costly. Tune per corpus; consider overlap. |
| Retrieval method | Pure vector (semantic) misses exact keywords; pure keyword (BM25) misses meaning. Hybrid search + re-ranking usually wins, at more complexity. |
| top-k (how many chunks) | More k → higher recall but more tokens (cost/latency) & "lost in the middle." Fewer k → cheaper but may miss the answer. |
| Re-ranking | A re-ranker boosts precision markedly but adds latency & a second model to run. |
| Freshness / indexing | Real-time indexing keeps answers current but costs pipeline complexity; batch reindex is simpler but stale. |
Vector store selection
| Option | Lean toward it when… | Trade-off |
|---|---|---|
| pgvector / existing DB extension | You already run Postgres; modest scale; want one system | Simplest ops; may not scale to billions of vectors or peak ANN throughput |
| Dedicated vector DB (managed or OSS) | Large scale, advanced filtering, high QPS | Best performance & features; another system to run/pay for |
| Search engine w/ vectors (Elastic/OpenSearch) | You need strong hybrid (keyword+vector) & already use it | Great hybrid; heavier to operate, tuning required |
System Architecture
Agents & Orchestration — Autonomy vs Control
"Agentic" is the dominant 2025–2026 pattern, and the most over-applied. Every step of autonomy you grant buys capability and spends predictability, cost, and debuggability. Use the least autonomy that solves the problem.
Deterministic workflow
- Predictable, testable, cheaper
- You own the control flow
- Can't handle truly novel paths
Autonomous agent
- Handles open-ended, multi-step tasks
- Adapts its own plan
- Costly, slow, hard to bound & debug
Single agent
- One context, simpler to reason about
- Lower coordination cost
- Context bloat on big tasks
Multi-agent
- Specialized roles, parallelism, separation
- Scales to complex problems
- Coordination overhead, error propagation, cost explosion
System Architecture
Inference Economics — Latency, Throughput & Cost
In production, the model is the easy part; serving it economically is the hard part. These levers trade against each other, and getting them wrong is what turns a successful pilot into an unaffordable rollout.
Streaming & TTFT
Stream tokens so users see output immediately; time-to-first-token matters more to perceived speed than total time. Trades a little complexity for big UX gains.
Caching
Prompt caching (reuse repeated context) and semantic caching (serve answers to similar past queries) can cut cost & latency dramatically. Trade: cache staleness & invalidation.
Quantization & distillation
Smaller/quantized models cut GPU cost & latency with modest quality loss. Distill a big model into a small one for hot paths. Trade: a quality dip you must eval.
Batching
Continuous/dynamic batching raises GPU utilization and tokens/sec. Trade: individual request latency vs aggregate throughput — tune for your SLA.
Model cascades
Send easy traffic to cheap models, escalate the hard. Big cost savings; trade is routing logic + the risk of mis-routing a hard query down-tier.
Token budgets
Cap output length, trim context, compress history. Every token is money & latency. Trade: terseness vs completeness.
Govern & Operate
Data, Privacy & Security
In the enterprise, this is where AI projects most often die — not on capability, but on whether they can be deployed safely and compliantly. Security is an architectural concern, not a checkbox at the end.
Where does the data go?
- Confirm provider terms: no training on your data, retention windows, region/residency
- Redact/avoid sending PII & secrets; tokenize where possible
- Enforce per-user access control on RAG sources — never let a model surface docs a user can't see
- For the strictest data, self-host or use private/VPC deployments
New attack surface
- Prompt injection — malicious instructions hidden in inputs/docs hijack the model
- Data exfiltration — model coaxed into revealing context/system prompt/other users' data
- Insecure tool use — agent tricked into harmful actions
- Mitigate: input/output filtering, least-privilege tools, sandboxing, treat all model output as untrusted
Govern & Operate
Evaluation & Quality — How You Know It Works
I can't improve what I can't measure, and I can't govern a probabilistic system without systematic evaluation. Evals are to AI systems what tests are to traditional code — but graded, not binary.
| Eval type | What it does | Trade-off |
|---|---|---|
| Offline eval set | Run candidate prompts/models against a curated, labeled dataset | Fast, repeatable, regression-safe; only as good as the dataset's coverage |
| LLM-as-judge | Use a strong model to grade outputs at scale (relevance, faithfulness) | Scalable & cheap vs humans; the judge has biases & can be gamed — calibrate against humans |
| Human evaluation | Expert/SME review of outputs | Gold standard for nuance; slow, costly, doesn't scale — reserve for calibration & high stakes |
| Online / production | A-B tests, user feedback (thumbs), implicit signals | Real-world truth; noisy, lagging, needs telemetry & guardrails for safety |
| RAG-specific | Measure retrieval (recall/precision) & faithfulness/groundedness separately | Pinpoints whether failures are retrieval or generation; more harness to build |
Guardrails — the runtime safety layer
Wrap the model with input and output guardrails: validate/clean inputs (block injection, profanity, PII), and validate outputs (schema/format checks, factuality/grounding checks, toxicity & policy filters, citation verification). Trade-off: each guardrail adds latency and can produce false positives that block legitimate responses — tune the strictness to the stakes.
Govern & Operate
LLMOps in Production
A working prototype is ~10% of the job. Operating an AI system reliably over time — through model updates, data drift, and changing usage — is the other 90%, and where rigorous ops discipline pays off.
- Observability: trace every request (prompt, retrieved context, tool calls, tokens, cost, latency) — you can't debug what you can't see
- Versioning: pin model versions; version prompts & RAG configs; enable instant rollback
- Drift & quality monitoring: watch output quality, refusal rates, latency, and cost over time; alert on regressions
- Feedback loop: capture user signals (thumbs, edits, escalations) and feed them back into evals & fine-tuning data
- Human-in-the-loop: route low-confidence or high-stakes outputs to human review; design the handoff
- Graceful degradation: fallbacks for provider outages, rate limits, timeouts (cheaper model, cached answer, "try later")
Decision Toolkit
Master Trade-Off Matrix
My quick reference for enterprise AI decisions. For each choice: when to lean each way, and the price of the lean.
| Decision | Lean A when… | Lean B when… | The cost you pay |
|---|---|---|---|
| GenAI vs classic ML/rules | Open-ended language/reasoning task | Structured, high-volume, must be auditable | Cost & nondeterminism, or limited flexibility |
| Prompt vs RAG vs fine-tune | General task / need fresh facts (prompt/RAG) | Need consistent behavior or lower cost (fine-tune) | Pipeline complexity, or stale knowledge + re-tune cost |
| RAG vs long context | Large/dynamic corpus, cost-sensitive | Small static docs, want simplicity | Retrieval complexity, or token cost & "lost in middle" |
| Closed API vs self-hosted | Speed to ship, best quality, low/spiky volume | Strict privacy, high steady volume, no lock-in | Data exposure & lock-in, or MLOps burden |
| Frontier vs small model | Hard reasoning, quality is the product | Routine task, cost & latency matter | High cost/latency, or lower ceiling — so route |
| Workflow vs autonomous agent | Path is known & enumerable | Open-ended, multi-step, novel tasks | Rigidity, or unpredictability + cost |
| Single vs multi-agent | One coherent context | Clearly separable specialized roles | Context bloat, or coordination overhead & cost |
| Latency vs quality | Interactive UX, real-time | Accuracy is critical, async is fine | Lower quality, or slower + costlier responses |
| Automation vs human-in-loop | Low stakes, high volume, tolerant of error | High stakes, regulated, costly mistakes | Risk of silent error, or throughput & cost |
| Build vs buy (AI features) | Core differentiator on your data | Commodity capability (chat, search, summarize) | Maintenance & expertise, or lock-in & fit |
Decision Toolkit
Decision Framework I Use
How I take an enterprise AI initiative from idea to a defensible, production-ready design without chasing hype or burning budget.
- Define the job & the bar. What business outcome, and what does "good enough" mean numerically (accuracy, latency, cost-per-request, acceptable error rate)? Without a target, you'll over-build forever.
- Question whether you need GenAI at all. Could rules, search, or classic ML do this cheaper, faster, and auditably? I reach for the probabilistic hammer only where the problem is genuinely a nail for it.
- Climb the adaptation ladder from the bottom. Start with prompting; add RAG for knowledge; fine-tune only if behavior/cost demands it. Stop at the lowest rung that hits the bar.
- Pick a model strategy behind an abstraction layer. Choose closed vs open, frontier vs small, by your constraint corner (quality/latency/cost/governance) — but keep models swappable.
- Build the eval harness before optimizing. A representative dataset + automated scoring is the prerequisite to every later decision and to safe model upgrades.
- Design the trust & safety layer. Grounding, guardrails, access control, human-in-the-loop, cost limits, and observability — proportional to the use case's risk tier.
- Pilot, measure, then harden for production. Validate value on real users with real evals; then invest in LLMOps (versioning, monitoring, rollback, drift) for the long haul.
- Write the ADR & the revisit trigger. Record the decision, the trade-offs accepted, and the signal to reconsider — because in this field, a better/cheaper option will appear within months.
Decision Toolkit
Enterprise AI Anti-Patterns
The recurring, expensive mistakes — most stem from chasing capability without governing it, or reaching for complexity the problem didn't require.
AI for AI's sake
Bolting an LLM onto a problem better solved by rules, search, or a form. Complexity & cost with no real benefit.
Wrong tool
Burning money to bake knowledge into weights that immediately goes stale — when RAG would be cheaper, fresher, and citable.
No right-sizing
Routing trivial classification through the most expensive model. Runaway cost & latency; no model cascade.
No eval
"Looks great in the demo" with no eval set, so quality silently regresses and no one can prove it works. The pilot that never reaches production.
Autonomy without need
Multi-agent frameworks for tasks a simple workflow handles. Unpredictable, costly, undebuggable.
No abstraction layer
Hard-wiring one provider's SDK throughout the app, so switching or comparing models means a rewrite.
Security last
Ignoring prompt injection, data leakage, access control, and compliance until launch — when they become blockers or breaches.
No cost SLO
No token accounting, limits, or alerts — until a loop or viral feature produces a five-figure surprise.
No grounding/HITL
Treating confident hallucinations as truth on high-stakes decisions, with no citations, checks, or human review.
Decision Toolkit
Pre-Flight Checklist
I run this before committing to or shipping an AI system. A blank line is an undecided trade-off that will surface later — usually in production.
Problem & strategy
- Business outcome & numeric "good enough" bar defined
- Confirmed GenAI beats rules/classic ML for this job
- Lowest sufficient rung on the adaptation ladder chosen
- Model strategy fits the constraint corner (quality/latency/cost/gov)
- Model-abstraction layer in place; models are swappable
Quality & evaluation
- Representative eval set + automated scoring exists
- Retrieval quality measured separately from generation (if RAG)
- Regression-eval gate before adopting any new model version
- Hallucination mitigated: grounding, citations, output schemas
- Online feedback signals captured and fed back
Governance & security
- Data flow & provider terms reviewed (no-train, retention, residency)
- PII handling & per-user access control on knowledge sources
- Prompt-injection & output filtering guardrails in place
- Use case risk-tiered; human-in-loop where stakes are high
- Tool/agent actions least-privilege, approved & logged
Operations & cost
- Per-request tracing: tokens, cost, latency, context, tool calls
- Cost SLO: rate/spend limits, budget alerts, kill switch
- Right-sizing/routing & caching to control cost & latency
- Versioning & instant rollback for models, prompts, configs
- Drift/quality monitoring + graceful degradation on provider failure
Sources
References & Sources
Annotated bibliography behind this enterprise AI architecture note — model strategy, RAG, agents, inference economics, governance, evaluation, and LLMOps. Section tags (e.g. §04) show where each source informed the prose, tables, and diagrams. Constraint diamonds, scorecards, pipeline SVGs, and synthesis tables are my own unless noted.
Scope. Synthesis of peer-reviewed papers, industry frameworks, vendor documentation, and regulatory texts (May 2026). Model names, pricing, and API capabilities change monthly — verify against current vendor docs before committing to a one-way door. Scorecard star ratings are relative tendencies for shortlisting, not benchmarks. Not legal, compliance, or procurement advice.
Citations are numbered continuously [1]–[n] within this section.
Enterprise AI architect mandate & replaceability (§01, §16)
- Richards, M., & Ford, N., Fundamentals of Software Architecture. O'Reilly, 2020. Architectural trade-offs and driving characteristics — §01 mandate framing and §16 decision framework. — §01, §16.
- Sowell, T., Applied Economics (and related work). "There are no solutions, only trade-offs" — adapted throughout this note; §01 lead-rule theme. — §01, §15.
- Sculley, D., et al., "Hidden Technical Debt in Machine Learning Systems." NeurIPS, 2015. ML/AI systems require ops discipline beyond the model — §01 trust architecture and §14 LLMOps. NeurIPS proceedings — §01, §14, §17.
- Portkey / LiteLLM / OpenRouter documentation (industry pattern). Model-abstraction and gateway layers for multi-provider routing — §01 abstraction-layer callout. litellm.ai — §01, §06, §17.
- Nygard, M., "Documenting Architecture Decisions." Cognitect, 2011. ADR format — §16 step 8 and §18 checklist. cognitect.com — §16, §18.
Deterministic → probabilistic systems (§02)
- Bender, E. M., et al., "On the Dangers of Stochastic Parrots." FAccT, 2021. Non-deterministic, opaque LLM behavior — §02 shift table. DOI: 10.1145/3442188.3445922 — §02.
- Ji, Z., et al., "Survey of Hallucination in Natural Language Generation." ACM Computing Surveys, 2023. Hallucination as inherent risk to manage — §02 defining-risk callout. DOI: 10.1145/3571730 — §02, §13.
- OpenAI Platform documentation — temperature, sampling, model versioning. Stochastic outputs and version pinning — §02 output/correctness rows and §14 versioning. platform.openai.com — §02, §14.
- Anthropic documentation — model behavior, streaming, safety. Latency and streaming UX — §02 latency row and §11 TTFT card. docs.anthropic.com — §02, §11.
- Google, "Responsible AI practices." Grounding, citations, human review on high-stakes outputs — §02 failure-mode row. ai.google — §02, §12, §13.
Quality · latency · cost · governance constraints (§03, §15)
- Brewer, E., "CAP Twelve Years Later: How the 'Rules' Have Changed." IEEE Computer, 2012. Distributed trade-off analogy for §03 constraint diamond (adapted to AI axes). — §03.
- European Parliament & Council, Regulation (EU) 2024/1689 (AI Act). Risk-tiered governance obligations — §03 governance corner and §12 compliance callout. EUR-Lex — §03, §12, §18.
- NIST, Artificial Intelligence Risk Management Framework (AI RMF 1.0). NIST AI 100-1, 2023. Risk-based AI governance — §12 risk tiering. nist.gov — §12, §16.
- ISO/IEC 42001:2023 — Artificial intelligence management system. Enterprise AI governance baseline — §12 compliance context. — §12, §18.
- Truong, L. (synthesis). Quality–latency–cost–governance constraint diamond — §03 figure and §15 matrix rows. LinhTruong.com — §03, §15.
Adaptation ladder: prompt · RAG · fine-tune · pre-train (§04, §05, §08)
- OpenAI, "Prompt engineering guide." Prompting as first rung — §04 ladder step 1. platform.openai.com — §04, §05.
- Anthropic, "Prompt engineering overview." System prompts, few-shot, chain-of-thought patterns — §04 ladder. docs.anthropic.com — §04.
- Lewis, P., et al., "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." NeurIPS, 2020. RAG as rung 2 — §04 ladder and §09 architecture. arxiv.org/abs/2005.11401 — §04, §05, §09, §13.
- Hu, E. J., et al., "LoRA: Low-Rank Adaptation of Large Language Models." ICLR, 2022. Parameter-efficient fine-tuning — §04 rung 3 (LoRA/PEFT). arxiv.org/abs/2106.09685 — §04, §05.
- Rafailov, R., et al., "Direct Preference Optimization: Your Language Model is Secretly a Reward Model." NeurIPS, 2023. DPO alignment — §04 fine-tuning row. arxiv.org/abs/2305.18201 — §04.
- Hugging Face, PEFT & TRL documentation. SFT/DPO tooling — §04 adaptation table. huggingface.co/docs/peft — §04, §06.
- Meta AI, Llama model cards & continued-pretraining guidance. Domain adaptation rungs 4–5 — §04 ladder top steps. ai.meta.com/llama — §04, §06.
RAG vs fine-tuning vs long context (§05, §09, §15)
- OpenAI fine-tuning documentation. Fine-tune for behavior/format, not facts — §05 rule-of-thumb callout. platform.openai.com — §05, §17.
- Liu, N. F., et al., "Lost in the Middle: How Language Models Use Long Contexts." TACL, 2024. Long-context degradation — §05 long-context table "lost in the middle" row. arxiv.org/abs/2307.03172 — §05, §09, §15.
- Google Gemini / Anthropic Claude documentation — context windows. When stuffing docs vs retrieving — §05 long-context trade-off. — §05, §15.
- Gao, Y., et al., "Retrieval-Augmented Generation for Large Language Models: A Survey." arXiv, 2023. RAG vs fine-tuning decision space — §05 tradeoff block. arxiv.org/abs/2312.10997 — §05, §09.
- LangChain & LlamaIndex documentation — RAG pipelines. Chunking, retrieval, augmentation patterns — §05 decision callout and §09 pipeline. python.langchain.com — §05, §09.
Closed API vs open-weight self-hosting (§06, §08, §15)
- OpenAI API & Azure OpenAI Service documentation — enterprise data handling, no-train terms. Managed API governance — §06 closed-model column and §12 data-flow bullets. learn.microsoft.com — §06, §12.
- Anthropic API — data retention & enterprise privacy. Provider terms checklist — §12 no-train/retention bullets. anthropic.com/privacy — §06, §12.
- Touvron, H., et al., "Llama 2: Open Foundation and Fine-Tuned Chat Models." Meta, 2023. Open-weight self-host baseline — §06 self-hosted column. arxiv.org/abs/2307.09288 — §06, §08.
- Mistral AI model documentation. Open-weight enterprise deployments — §06 hybrid pattern. docs.mistral.ai — §06, §08.
- NVIDIA NIM / vLLM / TGI (Text Generation Inference) documentation. Self-hosted inference stacks — §06 MLOps burden and §11 serving. docs.vllm.ai — §06, §11, §14.
- McKinsey / a16z industry analyses (synthesized). API vs GPU cost crossover at utilization — §06 cost-crossover callout (order-of-magnitude; re-model locally). — §06, §11.
Model sizing, routing & distillation (§07, §11, §15)
- Chen, L., et al., "FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance." arXiv, 2023. Model cascades and routing — §07 cascade callout and §11 routing card. arxiv.org/abs/2305.05176 — §07, §11, §15.
- Hinton, G., Vinyals, O., & Dean, J., "Distilling the Knowledge in a Neural Network." NeurIPS Deep Learning Workshop, 2014. Distillation concept — §07 distillation bullet and §11 efficiency card. arxiv.org/abs/1503.02531 — §07, §11.
- Microsoft Research, Phi model series technical reports. Small language models for routine tasks — §07 frontier vs SLM figure. microsoft.com/research — §07, §08.
- Google DeepMind, Gemma model documentation. Mid/small open models — §08 scorecard rows. ai.google.dev/gemma — §07, §08.
- Speculative decoding literature (Leviathan et al.; Chen et al.). Latency reduction techniques — background for §11 speed levers. arxiv.org/abs/2211.17192 — §11.
RAG architecture, retrieval & vector stores (§09, §13)
- Robertson, S., & Zaragoza, H., "The Probabilistic Relevance Framework: BM25 and Beyond." BM25 keyword retrieval — §09 hybrid-search row. — §09.
- Karpukhin, V., et al., "Dense Passage Retrieval for Open-Domain Question Answering." EMNLP, 2020. Dense vector retrieval — §09 retrieval-method row. arxiv.org/abs/2004.04906 — §09.
- Reimers, N., & Gurevych, I., "Sentence-BERT: Sentence Embeddings using SiBERT Networks." EMNLP, 2019. Embedding models for RAG — §09 embed stage. arxiv.org/abs/1908.10084 — §09.
- Cohere Rerank API documentation; cross-encoder re-ranking practice. Re-ranking trade-off — §09 re-ranking row. docs.cohere.com — §09.
- pgvector extension documentation (PostgreSQL). pgvector option — §09 vector-store table. github.com/pgvector/pgvector — §09.
- Pinecone, Weaviate, Qdrant, Milvus documentation. Dedicated vector DB trade-offs — §09 vector-store table. — §09.
- Elastic / OpenSearch kNN & hybrid search documentation. Search-engine + vector option — §09 vector-store table. elastic.co — §09.
- Es, S., et al., "RAGAS: Automated Evaluation of Retrieval Augmented Generation." EACL, 2024. Separate retrieval vs generation metrics — §09 figcaption and §13 RAG-specific eval. arxiv.org/abs/2309.15217 — §09, §13.
Agents, tools & orchestration (§10, §15, §17)
- Yao, S., et al., "ReAct: Synergizing Reasoning and Acting in Language Models." ICLR, 2023. Plan→act→observe agent loop — §10 spectrum diagram. arxiv.org/abs/2210.03629 — §10.
- Schick, T., et al., "Toolformer: Language Models Can Teach Themselves to Use Tools." NeurIPS, 2023. Tool-use foundation — §10 router+tools step. arxiv.org/abs/2302.04761 — §10.
- OpenAI function calling / tools documentation. Structured tool interfaces — §10 workflow vs agent trade-offs. platform.openai.com — §10.
- Anthropic, Model Context Protocol (MCP) specification. Standardized tool/context interfaces — §10 govern-the-tools callout. modelcontextprotocol.io — §10, §12.
- LangGraph / Temporal / AWS Step Functions documentation (pattern). Deterministic workflows with LLM steps — §10 prefer-workflows figcaption. langchain-ai.github.io/langgraph — §10, §15.
- Wu, Q., et al., "AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation." arXiv, 2023. Multi-agent coordination costs — §10 single vs multi-agent trade-off. arxiv.org/abs/2308.08155 — §10, §15, §17.
- Shinn, N., et al., "Reflexion: Language Agents with Verbal Reinforcement Learning." NeurIPS, 2023. Autonomous loop risks — §10 autonomy cost theme. arxiv.org/abs/2303.11366 — §10, §17.
Inference economics & serving (§11, §14, §18)
- Kwon, W., et al., "Efficient Memory Management for LLM Serving with PagedAttention." (vLLM). OSDI, 2023. Continuous batching & throughput — §11 batching card. arxiv.org/abs/2309.06180 — §11, §06.
- OpenAI prompt caching documentation; Anthropic prompt caching. Prompt caching cost/latency — §11 caching card. platform.openai.com — §11.
- FrugalGPT (Chen et al.) — semantic caching & cascade economics. Cost reduction patterns — §11 caching and routing cards. — §07, §11.
- Frantar, E., et al., "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers." ICLR, 2023. Quantization trade-offs — §11 quantization card. arxiv.org/abs/2210.17323 — §11.
- Lin, J., et al., "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration." MLSys, 2024. Quantized serving — §11 efficiency card. arxiv.org/abs/2306.00978 — §11.
- Google SRE Team, Site Reliability Engineering — SLOs & error budgets. Cost as first-class SLO — §11 cost-governance callout. sre.google — §11, §18.
Data governance, privacy & AI security (§12, §17, §18)
- OWASP Foundation, Top 10 for Large Language Model Applications (2023–2025). Prompt injection, insecure output, excessive agency — §12 AI-specific security and §17 governance anti-pattern. owasp.org — §12, §17, §18.
- Greshake, K., et al., "Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection." arXiv, 2023. Prompt injection attacks — §12 prompt-injection bullet. arxiv.org/abs/2302.12173 — §12, §17.
- EU GDPR (Regulation 2016/679) & UK GDPR. PII handling, residency — §12 redact/avoid PII bullets. gdpr-info.eu — §12.
- Microsoft Azure OpenAI / AWS Bedrock — VPC/private deployment patterns. Strictest-data self-host/private options — §12 privacy card. — §06, §12.
- NIST SP 800-53 / Zero Trust Architecture (SP 800-207). Least-privilege tool access — §10 & §12 least-privilege bullets. csrc.nist.gov — §10, §12.
- NeMo Guardrails / Guardrails AI / Llama Guard documentation. Input/output guardrails — §13 guardrails section. github.com/NVIDIA/NeMo-Guardrails — §13.
Evaluation, quality & LLM-as-judge (§13, §14, §18)
- Liang, P., et al., "Holistic Evaluation of Language Models (HELM)." Stanford CRFM, 2022+. Broad eval framing — §13 eval types table. crfm.stanford.edu/helm — §13.
- Zheng, L., et al., "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena." NeurIPS, 2023. LLM-as-judge trade-offs — §13 LLM-as-judge row. arxiv.org/abs/2306.05685 — §13.
- Es, S., et al., RAGAS framework documentation. Faithfulness, answer relevance, context precision — §13 RAG-specific eval. docs.ragas.io — §09, §13.
- LangSmith / Weights & Biases / Arize Phoenix documentation. Offline eval sets, tracing, regression gates — §13 build-eval-first callout and §14 observability. docs.smith.langchain.com — §13, §14.
- Breck, E., et al., "The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction." ICML, 2017. ML testing discipline adapted to LLMOps — §13 eval-as-tests analogy. research.google — §13, §14.
- Human-in-the-loop AI literature (Holzinger et al.; industry HITL practice). High-stakes review — §13 guardrails stakes tuning and §18 checklist. — §13, §15, §18.
LLMOps & production operations (§14, §18)
- Sculley et al. (2015) — hidden technical debt; MLOps industry evolution. Prototype vs production ops split — §14 opening sub. — §14, §17.
- Google, "Rules of Machine Learning: Best Practices for ML Engineering." ML engineering rules adapted to LLMOps — §14 checklist themes. developers.google.com — §14.
- OpenTelemetry GenAI semantic conventions (draft). Trace prompts, tokens, tool calls — §14 observability bullet. opentelemetry.io — §14, §18.
- MLflow / Weights & Biases model & prompt registry documentation. Versioning models, prompts, RAG configs — §14 versioning bullet. mlflow.org — §14, §18.
- Evidently AI / WhyLabs drift monitoring documentation. Quality & drift monitoring — §14 drift bullet. evidentlyai.com — §14.
- Nygard, Release It! (2nd ed.). Graceful degradation, bulkheads — §14 graceful-degradation bullet. — §14, §18.
Decision framework, ADRs & anti-patterns (§15, §16, §17, §18)
- Thomson, J., "Architecture Decision Records." adr.github.io — §16 ADR step and §18 operability checklist. — §16, §18.
- Amazon leadership — "Disagree and commit"; Type 1/2 decisions (Bezos). Reversible vs one-way doors — §04 principle 3 and §16 framework. — §04, §16.
- Beck, K., Extreme Programming Explained — YAGNI. Simplest sufficient design — §16 step 6 and §18 complexity-earned line. — §16, §18.
- Chinchilla scaling laws (Hoffmann et al., 2022). Background for model-size economics — §07–§08 sizing context. arxiv.org/abs/2203.15556 — §07, §08.
- Truong, L., Enterprise AI Architect Trade-Offs — personal working notes. May 2026. Master trade-off matrix, model-strategy scorecard, constraint diamond, RAG pipeline & agent spectrum diagrams, pre-flight checklist, synthesis prose. LinhTruong.com — all sections.