Personal notes · May 2026

Enterprise AI Architect Trade-Offs

Generative AI broke an assumption I relied on for years: that systems are deterministic. The core component is now probabilistic, cost is per-token, latency is measured in seconds, and the same input can yield different outputs. I wrote this note to map the trade-off space I use when building AI systems an enterprise can trust — across model strategy, retrieval, agents, inference economics, governance, and operations.

Author: Linh Truong, MA (Harvard), MBA · Source: LinhTruong.com · Email: Linh@Alumni.Harvard.edu

The question behind this note: where does a probabilistic component belong in the stack — and what trade-offs make that choice governable, measurable, and replaceable? What follows is the framework I use to answer that.

Covers: LLMs · RAG · agents · LLMOps · governance · cost Written: May 2026 ✍️ By: Linh Truong

Foundations

The Enterprise AI Architect Mandate — How I Frame It

You're still an architect — but the material has changed. I focus on delivering business value from AI reliably, safely, and economically, while managing a component that is non-deterministic, rapidly evolving, and prone to confident error.

Everything in AI architecture is a trade-off — and a moving one. The model I benchmark today may be superseded next quarter, so I design for replaceability, not for any single model.

Decides

Where AI belongs (and doesn't)

Which problems warrant a probabilistic model vs deterministic code or classic ML. The first trade-off is whether to use a generative model at all — often a rules engine or a fine-tuned classifier is cheaper, faster, and auditable.

Decides

The model & adaptation strategy

Closed API vs open weights, frontier vs small, prompt vs RAG vs fine-tune. These set the cost, latency, quality, and governance envelope of everything downstream.

Decides

The trust architecture

Evaluation, guardrails, observability, human-in-the-loop, and data governance. In the enterprise, trustworthiness is the product — an impressive demo that can't be governed never ships.

Abstraction layer I insist on I never hard-wire an application to one vendor's SDK. I put a model-abstraction / gateway layer between the app and providers so we can swap models, route by task, fail over, and compare — without rewrites. In a field this volatile, the ability to switch is itself an architectural feature.

Foundations

The Fundamental Shift: Deterministic → Probabilistic

Classical software is deterministic: same input, same output, testable to exact equality. LLM-based systems are stochastic, opaque, and fallible. Every architectural instinct has to be re-examined through that lens.

Dimension	Traditional software	LLM-based system	Architectural consequence
Output	Deterministic, exact	Probabilistic, varies per call	Test distributions & quality, not equality; set temperature deliberately
Correctness	Right or wrong (verifiable)	A spectrum; can be confidently wrong	Evals with scores; guardrails; human review on high stakes
Latency	Milliseconds	Hundreds of ms to many seconds (streaming)	Stream tokens; design async UX; cache aggressively
Cost	Fixed compute, ~free per call	Per-token, scales with usage	Cost is a runtime variable to engineer, not a fixed line item
Failure mode	Crash / exception (loud)	Plausible hallucination (silent)	Grounding, citations, confidence signals, fallbacks
Behavior change	Only when you deploy	Can shift when the provider updates the model	Pin versions; regression-eval before adopting new models

The defining risk Hallucination is not a bug to be fully eliminated — it is an inherent property of the technology to be managed. Architect for it: ground responses in retrieved facts, cite sources, constrain outputs to schemas, and keep a human in the loop wherever a confident error is expensive. Treat every model output as "probably right" until verified.

Foundations

The AI Constraint Set — Pick Your Corner

Just as distributed systems have CAP, enterprise AI systems live inside a tension between quality, latency, cost, and governance. I can't maximize all four at once — the architecture is the act of choosing which corner to occupy.

Want the highest quality? Use a frontier model (more cost, more latency). Need low latency and cost? Use a small model (less quality). Need strong governance? Self-host or use a private deployment (more cost, more ops). Name the corner your use case demands.

Quality

Accuracy, reasoning depth, instruction-following, factual grounding. Bought with bigger models, retrieval, and eval-driven iteration.

Latency

Time-to-first-token + tokens/sec. Bought with smaller models, streaming, caching, and shorter prompts.

Cost

Per-token API spend or GPU amortization. Bought down with smaller/cheaper models, caching, batching, routing.

Governance

Data privacy, residency, auditability, safety, compliance (e.g. EU AI Act). Often pushes toward private/self-hosted deployments.

Model Strategy

The Adaptation Ladder — How to Make a Model "Yours"

The single most important strategy decision: how do you specialize a general model for your domain? There is a ladder of techniques from cheap-and-fast to expensive-and-powerful. Climb only as high as the requirement forces you.

The 80/20 rule of enterprise AI: most business value is captured at rungs 1–2 (prompting + RAG). Fine-tuning is for style, format, latency/cost reduction, or narrow tasks — not for teaching facts (that's RAG's job). Training from scratch is almost never the right enterprise choice.

Technique	Best for	Cost / effort	Data freshness	Key risk
Prompt engineering	General tasks, fast prototypes, orchestration	Lowest	Live (via context)	Brittle prompts; context-window limits
RAG	Q&A over private/changing knowledge	Medium	Always fresh	Retrieval quality is the ceiling; chunking pain
Fine-tuning	Consistent style/format, narrow tasks, smaller-model substitution	Medium–high	Frozen at train time	Stale knowledge; re-tune treadmill; data drift
Continued pre-training	Deep domain language (legal, biomed, code)	High	Frozen	Large data + compute; catastrophic forgetting
From scratch	AI-native products, unique modality/scale	Extreme	Frozen	$millions; rarely beats adapting an open model

Model Strategy

RAG vs Fine-tuning vs Long Context

The most common architecture debate. They are not mutually exclusive — but they solve different problems, and confusing them is the most expensive AI mistake teams make.

RAG vs Fine-tuningknowledge ⇄ behavior

RAG — change what it KNOWS

Inject fresh, private, citable facts at query time
Update knowledge by updating the index, no retraining
Reduces hallucination via grounding; auditable sources
Adds retrieval latency + a pipeline to maintain

Fine-tuning — change how it BEHAVES

Teach tone, format, structure, a narrow skill
Can let a small cheap model match a big one on a task
Knowledge is frozen at training time → goes stale
Needs labeled data + an ongoing re-tune cycle

Rule of thumb: Need facts (current, private, verifiable)? → RAG. Need form (consistent style/format/behavior) or lower cost/latency? → fine-tune. Need both? Do both: fine-tune for behavior, RAG for knowledge. Never fine-tune to teach facts — it's expensive, lossy, and stale.

...and the third option: just use a bigger context window

Frontier models now accept very large contexts, so for small/static corpora you can sometimes skip retrieval and stuff the documents straight into the prompt. The trade-off:

	RAG (retrieve top-k)	Long context (stuff it all)
Cost per query	Low (only relevant chunks)	High (pay for all tokens every call)
Latency	Lower	Higher (more tokens to process)
Scales to large corpora	Yes (millions of docs)	No (window-bounded)
Accuracy on big inputs	Depends on retrieval quality	"Lost in the middle" degradation
Simplicity	More moving parts	Dead simple

Decision Tiny, static knowledge (a policy doc, a manual) → long context for simplicity. Large, dynamic, or cost-sensitive knowledge → RAG. Combine with prompt/semantic caching to blunt the long-context cost.

Model Strategy

Closed API vs Open-Weight Self-Hosted

A defining build-vs-buy decision with major cost, control, and governance implications. The gap between frontier closed models and the best open-weight models has narrowed sharply — making this a live choice for most enterprises.

Managed API (closed) vs Self-Hosted (open weights)convenience ⇄ control

Closed model via API

Best-in-class quality, zero infra to run
Instant access, scales for you, fast to ship
Data leaves your boundary (mitigated by enterprise/no-train terms)
Per-token cost scales with usage; vendor lock-in & model deprecation risk

Open-weight, self-hosted

Full data control & residency; runs in your VPC/on-prem
Fixed GPU cost (cheap at high steady volume); no lock-in
Customizable & fine-tunable freely; version stability
You own the MLOps: GPUs, scaling, optimization, on-call

Rule: Start on a managed API to validate value fast. Move toward self-hosting when one of these bites: strict data-residency/privacy, very high steady volume (GPU amortization beats per-token), need for deep customization, or unacceptable lock-in. Many enterprises run a hybrid: APIs for frontier reasoning, self-hosted small models for high-volume routine tasks.

The cost crossover Per-token APIs are cheapest at low/spiky volume; self-hosted GPUs are cheapest at high steady volume — if you keep them well-utilized. An idle reserved GPU is pure waste. Model the crossover point with realistic utilization, and include the fully-loaded cost of the MLOps team in the self-host column.

Model Strategy

Frontier vs Small Models — Right-Sizing

Using a frontier model for everything is the most common source of runaway AI cost and latency. Match the model to the task; route deliberately.

Small/specialized models are dramatically cheaper and faster, and for routine, well-scoped tasks (classification, extraction, routing, simple drafting) they're often good enough. Reserve frontier models for genuinely hard reasoning.

Model routing / cascade Architect a router: send easy requests to a cheap small model and escalate only the hard ones to a frontier model (optionally let the small model "give up" and hand off). This cascade pattern can cut cost 5–10× while preserving quality on the requests that need it. Pair with model distillation (train a small model on a big model's outputs) to push more volume down-tier over time.

Model Strategy

Model-Strategy Scorecard

A comparative read of the main strategy options against what enterprises care about (★ relative, 1–5). Use it to shortlist; validate against your own data and evals.

Strategy	Output quality	Speed to ship	Run cost @ scale	Data control	Customizability	Ops burden
Frontier API + prompting	★★★★★	★★★★★	★★★★★	★★★★★	★★★★★	★★★★★
Frontier API + RAG	★★★★★	★★★★★	★★★★★	★★★★★	★★★★★	★★★★★
Fine-tuned mid model	★★★★★	★★★★★	★★★★★	★★★★★	★★★★★	★★★★★
Self-hosted open model	★★★★★	★★★★★	★★★★★	★★★★★	★★★★★	★★★★★
Small/distilled + routing	★★★★★	★★★★★	★★★★★	★★★★★	★★★★★	★★★★★

How to read it No row wins every column — that's the trade-off. Frontier-API rows maximize quality & speed-to-ship but pay in run-cost & data control; self-hosted rows invert it. Most mature enterprises end up hybrid: frontier API for hard reasoning, RAG for knowledge, and self-hosted small models routing the high-volume tail.

System Architecture

RAG Architecture & Its Trade-Offs

RAG is the workhorse of enterprise GenAI — but "RAG" hides a dozen tuning decisions, each a trade-off. The quality of your retrieval is the hard ceiling on the quality of your answers.

Each stage is a tuning knob. The most common failure isn't the LLM — it's retrieval returning the wrong chunks. Measure retrieval quality (recall/precision) separately from generation quality.

RAG decision	Trade-off
Chunk size	Small chunks → precise retrieval but lost context. Large chunks → more context but noisy, costly. Tune per corpus; consider overlap.
Retrieval method	Pure vector (semantic) misses exact keywords; pure keyword (BM25) misses meaning. Hybrid search + re-ranking usually wins, at more complexity.
top-k (how many chunks)	More k → higher recall but more tokens (cost/latency) & "lost in the middle." Fewer k → cheaper but may miss the answer.
Re-ranking	A re-ranker boosts precision markedly but adds latency & a second model to run.
Freshness / indexing	Real-time indexing keeps answers current but costs pipeline complexity; batch reindex is simpler but stale.

Vector store selection

Option	Lean toward it when…	Trade-off
pgvector / existing DB extension	You already run Postgres; modest scale; want one system	Simplest ops; may not scale to billions of vectors or peak ANN throughput
Dedicated vector DB (managed or OSS)	Large scale, advanced filtering, high QPS	Best performance & features; another system to run/pay for
Search engine w/ vectors (Elastic/OpenSearch)	You need strong hybrid (keyword+vector) & already use it	Great hybrid; heavier to operate, tuning required

System Architecture

Agents & Orchestration — Autonomy vs Control

"Agentic" is the dominant 2025–2026 pattern, and the most over-applied. Every step of autonomy you grant buys capability and spends predictability, cost, and debuggability. Use the least autonomy that solves the problem.

Prefer workflows (you define the steps, the LLM fills them) over autonomous agents (the LLM decides the steps) unless the task genuinely needs open-ended planning. Autonomy compounds error, cost (many LLM calls), and the difficulty of guaranteeing behavior.

Workflow vs Autonomous Agentcontrol ⇄ flexibility

Deterministic workflow

Predictable, testable, cheaper
You own the control flow
Can't handle truly novel paths

Autonomous agent

Handles open-ended, multi-step tasks
Adapts its own plan
Costly, slow, hard to bound & debug

Rule: Default to workflows. Add autonomy only where the path can't be enumerated.

Single vs Multi-Agentsimplicity ⇄ specialization

Single agent

One context, simpler to reason about
Lower coordination cost
Context bloat on big tasks

Multi-agent

Specialized roles, parallelism, separation
Scales to complex problems
Coordination overhead, error propagation, cost explosion

Rule: Start single-agent. Split only when roles/contexts clearly separate.

Govern the tools, not just the model An agent is only as safe as the tools it can call. Every tool/action is an attack surface and a blast radius. Apply least-privilege, require approval for high-impact actions (writes, payments, emails), sandbox execution, and log every tool call. Standardized tool interfaces (e.g. the Model Context Protocol) help, but governance is yours. An autonomous agent with write access and no guardrails is an incident waiting to happen.

System Architecture

Inference Economics — Latency, Throughput & Cost

In production, the model is the easy part; serving it economically is the hard part. These levers trade against each other, and getting them wrong is what turns a successful pilot into an unaffordable rollout.

Speed

Streaming & TTFT

Stream tokens so users see output immediately; time-to-first-token matters more to perceived speed than total time. Trades a little complexity for big UX gains.

Cost

Caching

Prompt caching (reuse repeated context) and semantic caching (serve answers to similar past queries) can cut cost & latency dramatically. Trade: cache staleness & invalidation.

Efficiency

Quantization & distillation

Smaller/quantized models cut GPU cost & latency with modest quality loss. Distill a big model into a small one for hot paths. Trade: a quality dip you must eval.

Throughput

Batching

Continuous/dynamic batching raises GPU utilization and tokens/sec. Trade: individual request latency vs aggregate throughput — tune for your SLA.

Routing

Model cascades

Send easy traffic to cheap models, escalate the hard. Big cost savings; trade is routing logic + the risk of mis-routing a hard query down-tier.

Control

Token budgets

Cap output length, trim context, compress history. Every token is money & latency. Trade: terseness vs completeness.

Cost governance is mandatory Per-token pricing means a runaway loop, a verbose prompt, or a viral feature can produce a shocking bill overnight. Build in per-user/-tenant rate & spend limits, budget alerts, token accounting per feature, and a kill switch from day one. Treat cost as a first-class SLO alongside latency and quality.

Govern & Operate

Data, Privacy & Security

In the enterprise, this is where AI projects most often die — not on capability, but on whether they can be deployed safely and compliantly. Security is an architectural concern, not a checkbox at the end.

Data & privacy

Where does the data go?

Confirm provider terms: no training on your data, retention windows, region/residency
Redact/avoid sending PII & secrets; tokenize where possible
Enforce per-user access control on RAG sources — never let a model surface docs a user can't see
For the strictest data, self-host or use private/VPC deployments

AI-specific security

New attack surface

Prompt injection — malicious instructions hidden in inputs/docs hijack the model
Data exfiltration — model coaxed into revealing context/system prompt/other users' data
Insecure tool use — agent tricked into harmful actions
Mitigate: input/output filtering, least-privilege tools, sandboxing, treat all model output as untrusted

Compliance & risk tiering Regulations (e.g. the EU AI Act) tier obligations by risk. Classify each AI use case by impact: high-stakes decisions (credit, hiring, medical, legal) demand human oversight, documentation, explainability, and bias testing; low-stakes assistive features need far less. Match governance weight to risk — over-governing a copy-drafting tool wastes effort; under-governing a decisioning system invites disaster.

Govern & Operate

Evaluation & Quality — How You Know It Works

I can't improve what I can't measure, and I can't govern a probabilistic system without systematic evaluation. Evals are to AI systems what tests are to traditional code — but graded, not binary.

Eval type	What it does	Trade-off
Offline eval set	Run candidate prompts/models against a curated, labeled dataset	Fast, repeatable, regression-safe; only as good as the dataset's coverage
LLM-as-judge	Use a strong model to grade outputs at scale (relevance, faithfulness)	Scalable & cheap vs humans; the judge has biases & can be gamed — calibrate against humans
Human evaluation	Expert/SME review of outputs	Gold standard for nuance; slow, costly, doesn't scale — reserve for calibration & high stakes
Online / production	A-B tests, user feedback (thumbs), implicit signals	Real-world truth; noisy, lagging, needs telemetry & guardrails for safety
RAG-specific	Measure retrieval (recall/precision) & faithfulness/groundedness separately	Pinpoints whether failures are retrieval or generation; more harness to build

Build the eval harness first Before optimizing prompts or swapping models, build a representative eval set and an automated scoring pipeline. It turns "this feels better" into a number, makes model upgrades safe (regression-test before adopting), and is the only defensible way to claim quality to the business. Evals are the flywheel of every serious AI system.

Guardrails — the runtime safety layer

Wrap the model with input and output guardrails: validate/clean inputs (block injection, profanity, PII), and validate outputs (schema/format checks, factuality/grounding checks, toxicity & policy filters, citation verification). Trade-off: each guardrail adds latency and can produce false positives that block legitimate responses — tune the strictness to the stakes.

Govern & Operate

LLMOps in Production

A working prototype is ~10% of the job. Operating an AI system reliably over time — through model updates, data drift, and changing usage — is the other 90%, and where rigorous ops discipline pays off.

Treat prompts, models, and retrieval configs as versioned, evaluated artifacts. Every change — including a provider's silent model update — must pass the eval gate before reaching users.

Observability: trace every request (prompt, retrieved context, tool calls, tokens, cost, latency) — you can't debug what you can't see
Versioning: pin model versions; version prompts & RAG configs; enable instant rollback
Drift & quality monitoring: watch output quality, refusal rates, latency, and cost over time; alert on regressions
Feedback loop: capture user signals (thumbs, edits, escalations) and feed them back into evals & fine-tuning data
Human-in-the-loop: route low-confidence or high-stakes outputs to human review; design the handoff
Graceful degradation: fallbacks for provider outages, rate limits, timeouts (cheaper model, cached answer, "try later")

Decision Toolkit

Master Trade-Off Matrix

My quick reference for enterprise AI decisions. For each choice: when to lean each way, and the price of the lean.

Decision	Lean A when…	Lean B when…	The cost you pay
GenAI vs classic ML/rules	Open-ended language/reasoning task	Structured, high-volume, must be auditable	Cost & nondeterminism, or limited flexibility
Prompt vs RAG vs fine-tune	General task / need fresh facts (prompt/RAG)	Need consistent behavior or lower cost (fine-tune)	Pipeline complexity, or stale knowledge + re-tune cost
RAG vs long context	Large/dynamic corpus, cost-sensitive	Small static docs, want simplicity	Retrieval complexity, or token cost & "lost in middle"
Closed API vs self-hosted	Speed to ship, best quality, low/spiky volume	Strict privacy, high steady volume, no lock-in	Data exposure & lock-in, or MLOps burden
Frontier vs small model	Hard reasoning, quality is the product	Routine task, cost & latency matter	High cost/latency, or lower ceiling — so route
Workflow vs autonomous agent	Path is known & enumerable	Open-ended, multi-step, novel tasks	Rigidity, or unpredictability + cost
Single vs multi-agent	One coherent context	Clearly separable specialized roles	Context bloat, or coordination overhead & cost
Latency vs quality	Interactive UX, real-time	Accuracy is critical, async is fine	Lower quality, or slower + costlier responses
Automation vs human-in-loop	Low stakes, high volume, tolerant of error	High stakes, regulated, costly mistakes	Risk of silent error, or throughput & cost
Build vs buy (AI features)	Core differentiator on your data	Commodity capability (chat, search, summarize)	Maintenance & expertise, or lock-in & fit

Decision Toolkit

Decision Framework I Use

How I take an enterprise AI initiative from idea to a defensible, production-ready design without chasing hype or burning budget.

Define the job & the bar. What business outcome, and what does "good enough" mean numerically (accuracy, latency, cost-per-request, acceptable error rate)? Without a target, you'll over-build forever.
Question whether you need GenAI at all. Could rules, search, or classic ML do this cheaper, faster, and auditably? I reach for the probabilistic hammer only where the problem is genuinely a nail for it.
Climb the adaptation ladder from the bottom. Start with prompting; add RAG for knowledge; fine-tune only if behavior/cost demands it. Stop at the lowest rung that hits the bar.
Pick a model strategy behind an abstraction layer. Choose closed vs open, frontier vs small, by your constraint corner (quality/latency/cost/governance) — but keep models swappable.
Build the eval harness before optimizing. A representative dataset + automated scoring is the prerequisite to every later decision and to safe model upgrades.
Design the trust & safety layer. Grounding, guardrails, access control, human-in-the-loop, cost limits, and observability — proportional to the use case's risk tier.
Pilot, measure, then harden for production. Validate value on real users with real evals; then invest in LLMOps (versioning, monitoring, rollback, drift) for the long haul.
Write the ADR & the revisit trigger. Record the decision, the trade-offs accepted, and the signal to reconsider — because in this field, a better/cheaper option will appear within months.

The question I keep asking I replace "which model is best?" with: "What's the cheapest, simplest, most governable approach that clears the quality bar for this specific job — and how will I know when it stops being the right one?" The answer is always "it depends," and it will change. Design for change.

Decision Toolkit

Enterprise AI Anti-Patterns

The recurring, expensive mistakes — most stem from chasing capability without governing it, or reaching for complexity the problem didn't require.

Solution looking for a problem

AI for AI's sake

Bolting an LLM onto a problem better solved by rules, search, or a form. Complexity & cost with no real benefit.

Fine-tuning to teach facts

Wrong tool

Burning money to bake knowledge into weights that immediately goes stale — when RAG would be cheaper, fresher, and citable.

Frontier model for everything

No right-sizing

Routing trivial classification through the most expensive model. Runaway cost & latency; no model cascade.

Demo-driven development

No eval

"Looks great in the demo" with no eval set, so quality silently regresses and no one can prove it works. The pilot that never reaches production.

Premature agents

Autonomy without need

Multi-agent frameworks for tasks a simple workflow handles. Unpredictable, costly, undebuggable.

Vendor lock-in by default

No abstraction layer

Hard-wiring one provider's SDK throughout the app, so switching or comparing models means a rewrite.

Governance as an afterthought

Security last

Ignoring prompt injection, data leakage, access control, and compliance until launch — when they become blockers or breaches.

Ignoring cost until the bill

No cost SLO

No token accounting, limits, or alerts — until a loop or viral feature produces a five-figure surprise.

Trusting output blindly

No grounding/HITL

Treating confident hallucinations as truth on high-stakes decisions, with no citations, checks, or human review.

Decision Toolkit

Pre-Flight Checklist

I run this before committing to or shipping an AI system. A blank line is an undecided trade-off that will surface later — usually in production.

Problem & strategy

Business outcome & numeric "good enough" bar defined
Confirmed GenAI beats rules/classic ML for this job
Lowest sufficient rung on the adaptation ladder chosen
Model strategy fits the constraint corner (quality/latency/cost/gov)
Model-abstraction layer in place; models are swappable

Quality & evaluation

Representative eval set + automated scoring exists
Retrieval quality measured separately from generation (if RAG)
Regression-eval gate before adopting any new model version
Hallucination mitigated: grounding, citations, output schemas
Online feedback signals captured and fed back

Governance & security

Data flow & provider terms reviewed (no-train, retention, residency)
PII handling & per-user access control on knowledge sources
Prompt-injection & output filtering guardrails in place
Use case risk-tiered; human-in-loop where stakes are high
Tool/agent actions least-privilege, approved & logged

Operations & cost

Per-request tracing: tokens, cost, latency, context, tool calls
Cost SLO: rate/spend limits, budget alerts, kill switch
Right-sizing/routing & caching to control cost & latency
Versioning & instant rollback for models, prompts, configs
Drift/quality monitoring + graceful degradation on provider failure

The one sentence I leave with A successful enterprise AI system is not the one using the most powerful model — it is the simplest, most governable design that reliably clears the quality bar at an acceptable cost, with every probabilistic risk grounded, measured, and bounded — and built to swap its parts as the field moves beneath it.

Sources

References & Sources

Annotated bibliography behind this enterprise AI architecture note — model strategy, RAG, agents, inference economics, governance, evaluation, and LLMOps. Section tags (e.g. §04) show where each source informed the prose, tables, and diagrams. Constraint diamonds, scorecards, pipeline SVGs, and synthesis tables are my own unless noted.

Scope. Synthesis of peer-reviewed papers, industry frameworks, vendor documentation, and regulatory texts (May 2026). Model names, pricing, and API capabilities change monthly — verify against current vendor docs before committing to a one-way door. Scorecard star ratings are relative tendencies for shortlisting, not benchmarks. Not legal, compliance, or procurement advice.

Citations are numbered continuously [1]–[n] within this section.

Enterprise AI architect mandate & replaceability (§01, §16)

Richards, M., & Ford, N., Fundamentals of Software Architecture. O'Reilly, 2020. Architectural trade-offs and driving characteristics — §01 mandate framing and §16 decision framework. — §01, §16.
Sowell, T., Applied Economics (and related work). "There are no solutions, only trade-offs" — adapted throughout this note; §01 lead-rule theme. — §01, §15.
Sculley, D., et al., "Hidden Technical Debt in Machine Learning Systems." NeurIPS, 2015. ML/AI systems require ops discipline beyond the model — §01 trust architecture and §14 LLMOps. NeurIPS proceedings — §01, §14, §17.
Portkey / LiteLLM / OpenRouter documentation (industry pattern). Model-abstraction and gateway layers for multi-provider routing — §01 abstraction-layer callout. litellm.ai — §01, §06, §17.
Nygard, M., "Documenting Architecture Decisions." Cognitect, 2011. ADR format — §16 step 8 and §18 checklist. cognitect.com — §16, §18.

Deterministic → probabilistic systems (§02)

Bender, E. M., et al., "On the Dangers of Stochastic Parrots." FAccT, 2021. Non-deterministic, opaque LLM behavior — §02 shift table. DOI: 10.1145/3442188.3445922 — §02.
Ji, Z., et al., "Survey of Hallucination in Natural Language Generation." ACM Computing Surveys, 2023. Hallucination as inherent risk to manage — §02 defining-risk callout. DOI: 10.1145/3571730 — §02, §13.
OpenAI Platform documentation — temperature, sampling, model versioning. Stochastic outputs and version pinning — §02 output/correctness rows and §14 versioning. platform.openai.com — §02, §14.
Anthropic documentation — model behavior, streaming, safety. Latency and streaming UX — §02 latency row and §11 TTFT card. docs.anthropic.com — §02, §11.
Google, "Responsible AI practices." Grounding, citations, human review on high-stakes outputs — §02 failure-mode row. ai.google — §02, §12, §13.

Quality · latency · cost · governance constraints (§03, §15)

Brewer, E., "CAP Twelve Years Later: How the 'Rules' Have Changed." IEEE Computer, 2012. Distributed trade-off analogy for §03 constraint diamond (adapted to AI axes). — §03.
European Parliament & Council, Regulation (EU) 2024/1689 (AI Act). Risk-tiered governance obligations — §03 governance corner and §12 compliance callout. EUR-Lex — §03, §12, §18.
NIST, Artificial Intelligence Risk Management Framework (AI RMF 1.0). NIST AI 100-1, 2023. Risk-based AI governance — §12 risk tiering. nist.gov — §12, §16.
ISO/IEC 42001:2023 — Artificial intelligence management system. Enterprise AI governance baseline — §12 compliance context. — §12, §18.
Truong, L. (synthesis). Quality–latency–cost–governance constraint diamond — §03 figure and §15 matrix rows. LinhTruong.com — §03, §15.

Adaptation ladder: prompt · RAG · fine-tune · pre-train (§04, §05, §08)

OpenAI, "Prompt engineering guide." Prompting as first rung — §04 ladder step 1. platform.openai.com — §04, §05.
Anthropic, "Prompt engineering overview." System prompts, few-shot, chain-of-thought patterns — §04 ladder. docs.anthropic.com — §04.
Lewis, P., et al., "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." NeurIPS, 2020. RAG as rung 2 — §04 ladder and §09 architecture. arxiv.org/abs/2005.11401 — §04, §05, §09, §13.
Hu, E. J., et al., "LoRA: Low-Rank Adaptation of Large Language Models." ICLR, 2022. Parameter-efficient fine-tuning — §04 rung 3 (LoRA/PEFT). arxiv.org/abs/2106.09685 — §04, §05.
Rafailov, R., et al., "Direct Preference Optimization: Your Language Model is Secretly a Reward Model." NeurIPS, 2023. DPO alignment — §04 fine-tuning row. arxiv.org/abs/2305.18201 — §04.
Hugging Face, PEFT & TRL documentation. SFT/DPO tooling — §04 adaptation table. huggingface.co/docs/peft — §04, §06.
Meta AI, Llama model cards & continued-pretraining guidance. Domain adaptation rungs 4–5 — §04 ladder top steps. ai.meta.com/llama — §04, §06.

RAG vs fine-tuning vs long context (§05, §09, §15)

OpenAI fine-tuning documentation. Fine-tune for behavior/format, not facts — §05 rule-of-thumb callout. platform.openai.com — §05, §17.
Liu, N. F., et al., "Lost in the Middle: How Language Models Use Long Contexts." TACL, 2024. Long-context degradation — §05 long-context table "lost in the middle" row. arxiv.org/abs/2307.03172 — §05, §09, §15.
Google Gemini / Anthropic Claude documentation — context windows. When stuffing docs vs retrieving — §05 long-context trade-off. — §05, §15.
Gao, Y., et al., "Retrieval-Augmented Generation for Large Language Models: A Survey." arXiv, 2023. RAG vs fine-tuning decision space — §05 tradeoff block. arxiv.org/abs/2312.10997 — §05, §09.
LangChain & LlamaIndex documentation — RAG pipelines. Chunking, retrieval, augmentation patterns — §05 decision callout and §09 pipeline. python.langchain.com — §05, §09.

Closed API vs open-weight self-hosting (§06, §08, §15)

OpenAI API & Azure OpenAI Service documentation — enterprise data handling, no-train terms. Managed API governance — §06 closed-model column and §12 data-flow bullets. learn.microsoft.com — §06, §12.
Anthropic API — data retention & enterprise privacy. Provider terms checklist — §12 no-train/retention bullets. anthropic.com/privacy — §06, §12.
Touvron, H., et al., "Llama 2: Open Foundation and Fine-Tuned Chat Models." Meta, 2023. Open-weight self-host baseline — §06 self-hosted column. arxiv.org/abs/2307.09288 — §06, §08.
Mistral AI model documentation. Open-weight enterprise deployments — §06 hybrid pattern. docs.mistral.ai — §06, §08.
NVIDIA NIM / vLLM / TGI (Text Generation Inference) documentation. Self-hosted inference stacks — §06 MLOps burden and §11 serving. docs.vllm.ai — §06, §11, §14.
McKinsey / a16z industry analyses (synthesized). API vs GPU cost crossover at utilization — §06 cost-crossover callout (order-of-magnitude; re-model locally). — §06, §11.

Model sizing, routing & distillation (§07, §11, §15)

Chen, L., et al., "FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance." arXiv, 2023. Model cascades and routing — §07 cascade callout and §11 routing card. arxiv.org/abs/2305.05176 — §07, §11, §15.
Hinton, G., Vinyals, O., & Dean, J., "Distilling the Knowledge in a Neural Network." NeurIPS Deep Learning Workshop, 2014. Distillation concept — §07 distillation bullet and §11 efficiency card. arxiv.org/abs/1503.02531 — §07, §11.
Microsoft Research, Phi model series technical reports. Small language models for routine tasks — §07 frontier vs SLM figure. microsoft.com/research — §07, §08.
Google DeepMind, Gemma model documentation. Mid/small open models — §08 scorecard rows. ai.google.dev/gemma — §07, §08.
Speculative decoding literature (Leviathan et al.; Chen et al.). Latency reduction techniques — background for §11 speed levers. arxiv.org/abs/2211.17192 — §11.

RAG architecture, retrieval & vector stores (§09, §13)

Robertson, S., & Zaragoza, H., "The Probabilistic Relevance Framework: BM25 and Beyond." BM25 keyword retrieval — §09 hybrid-search row. — §09.
Karpukhin, V., et al., "Dense Passage Retrieval for Open-Domain Question Answering." EMNLP, 2020. Dense vector retrieval — §09 retrieval-method row. arxiv.org/abs/2004.04906 — §09.
Reimers, N., & Gurevych, I., "Sentence-BERT: Sentence Embeddings using SiBERT Networks." EMNLP, 2019. Embedding models for RAG — §09 embed stage. arxiv.org/abs/1908.10084 — §09.
Cohere Rerank API documentation; cross-encoder re-ranking practice. Re-ranking trade-off — §09 re-ranking row. docs.cohere.com — §09.
pgvector extension documentation (PostgreSQL). pgvector option — §09 vector-store table. github.com/pgvector/pgvector — §09.
Pinecone, Weaviate, Qdrant, Milvus documentation. Dedicated vector DB trade-offs — §09 vector-store table. — §09.
Elastic / OpenSearch kNN & hybrid search documentation. Search-engine + vector option — §09 vector-store table. elastic.co — §09.
Es, S., et al., "RAGAS: Automated Evaluation of Retrieval Augmented Generation." EACL, 2024. Separate retrieval vs generation metrics — §09 figcaption and §13 RAG-specific eval. arxiv.org/abs/2309.15217 — §09, §13.

Agents, tools & orchestration (§10, §15, §17)

Yao, S., et al., "ReAct: Synergizing Reasoning and Acting in Language Models." ICLR, 2023. Plan→act→observe agent loop — §10 spectrum diagram. arxiv.org/abs/2210.03629 — §10.
Schick, T., et al., "Toolformer: Language Models Can Teach Themselves to Use Tools." NeurIPS, 2023. Tool-use foundation — §10 router+tools step. arxiv.org/abs/2302.04761 — §10.
OpenAI function calling / tools documentation. Structured tool interfaces — §10 workflow vs agent trade-offs. platform.openai.com — §10.
Anthropic, Model Context Protocol (MCP) specification. Standardized tool/context interfaces — §10 govern-the-tools callout. modelcontextprotocol.io — §10, §12.
LangGraph / Temporal / AWS Step Functions documentation (pattern). Deterministic workflows with LLM steps — §10 prefer-workflows figcaption. langchain-ai.github.io/langgraph — §10, §15.
Wu, Q., et al., "AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation." arXiv, 2023. Multi-agent coordination costs — §10 single vs multi-agent trade-off. arxiv.org/abs/2308.08155 — §10, §15, §17.
Shinn, N., et al., "Reflexion: Language Agents with Verbal Reinforcement Learning." NeurIPS, 2023. Autonomous loop risks — §10 autonomy cost theme. arxiv.org/abs/2303.11366 — §10, §17.

Inference economics & serving (§11, §14, §18)

Kwon, W., et al., "Efficient Memory Management for LLM Serving with PagedAttention." (vLLM). OSDI, 2023. Continuous batching & throughput — §11 batching card. arxiv.org/abs/2309.06180 — §11, §06.
OpenAI prompt caching documentation; Anthropic prompt caching. Prompt caching cost/latency — §11 caching card. platform.openai.com — §11.
FrugalGPT (Chen et al.) — semantic caching & cascade economics. Cost reduction patterns — §11 caching and routing cards. — §07, §11.
Frantar, E., et al., "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers." ICLR, 2023. Quantization trade-offs — §11 quantization card. arxiv.org/abs/2210.17323 — §11.
Lin, J., et al., "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration." MLSys, 2024. Quantized serving — §11 efficiency card. arxiv.org/abs/2306.00978 — §11.
Google SRE Team, Site Reliability Engineering — SLOs & error budgets. Cost as first-class SLO — §11 cost-governance callout. sre.google — §11, §18.

Data governance, privacy & AI security (§12, §17, §18)

OWASP Foundation, Top 10 for Large Language Model Applications (2023–2025). Prompt injection, insecure output, excessive agency — §12 AI-specific security and §17 governance anti-pattern. owasp.org — §12, §17, §18.
Greshake, K., et al., "Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection." arXiv, 2023. Prompt injection attacks — §12 prompt-injection bullet. arxiv.org/abs/2302.12173 — §12, §17.
EU GDPR (Regulation 2016/679) & UK GDPR. PII handling, residency — §12 redact/avoid PII bullets. gdpr-info.eu — §12.
Microsoft Azure OpenAI / AWS Bedrock — VPC/private deployment patterns. Strictest-data self-host/private options — §12 privacy card. — §06, §12.
NIST SP 800-53 / Zero Trust Architecture (SP 800-207). Least-privilege tool access — §10 & §12 least-privilege bullets. csrc.nist.gov — §10, §12.
NeMo Guardrails / Guardrails AI / Llama Guard documentation. Input/output guardrails — §13 guardrails section. github.com/NVIDIA/NeMo-Guardrails — §13.

Evaluation, quality & LLM-as-judge (§13, §14, §18)

Liang, P., et al., "Holistic Evaluation of Language Models (HELM)." Stanford CRFM, 2022+. Broad eval framing — §13 eval types table. crfm.stanford.edu/helm — §13.
Zheng, L., et al., "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena." NeurIPS, 2023. LLM-as-judge trade-offs — §13 LLM-as-judge row. arxiv.org/abs/2306.05685 — §13.
Es, S., et al., RAGAS framework documentation. Faithfulness, answer relevance, context precision — §13 RAG-specific eval. docs.ragas.io — §09, §13.
LangSmith / Weights & Biases / Arize Phoenix documentation. Offline eval sets, tracing, regression gates — §13 build-eval-first callout and §14 observability. docs.smith.langchain.com — §13, §14.
Breck, E., et al., "The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction." ICML, 2017. ML testing discipline adapted to LLMOps — §13 eval-as-tests analogy. research.google — §13, §14.
Human-in-the-loop AI literature (Holzinger et al.; industry HITL practice). High-stakes review — §13 guardrails stakes tuning and §18 checklist. — §13, §15, §18.

LLMOps & production operations (§14, §18)

Sculley et al. (2015) — hidden technical debt; MLOps industry evolution. Prototype vs production ops split — §14 opening sub. — §14, §17.
Google, "Rules of Machine Learning: Best Practices for ML Engineering." ML engineering rules adapted to LLMOps — §14 checklist themes. developers.google.com — §14.
OpenTelemetry GenAI semantic conventions (draft). Trace prompts, tokens, tool calls — §14 observability bullet. opentelemetry.io — §14, §18.
MLflow / Weights & Biases model & prompt registry documentation. Versioning models, prompts, RAG configs — §14 versioning bullet. mlflow.org — §14, §18.
Evidently AI / WhyLabs drift monitoring documentation. Quality & drift monitoring — §14 drift bullet. evidentlyai.com — §14.
Nygard, Release It! (2nd ed.). Graceful degradation, bulkheads — §14 graceful-degradation bullet. — §14, §18.

Decision framework, ADRs & anti-patterns (§15, §16, §17, §18)

Thomson, J., "Architecture Decision Records." adr.github.io — §16 ADR step and §18 operability checklist. — §16, §18.
Amazon leadership — "Disagree and commit"; Type 1/2 decisions (Bezos). Reversible vs one-way doors — §04 principle 3 and §16 framework. — §04, §16.
Beck, K., Extreme Programming Explained — YAGNI. Simplest sufficient design — §16 step 6 and §18 complexity-earned line. — §16, §18.
Chinchilla scaling laws (Hoffmann et al., 2022). Background for model-size economics — §07–§08 sizing context. arxiv.org/abs/2203.15556 — §07, §08.
Truong, L., Enterprise AI Architect Trade-Offs — personal working notes. May 2026. Master trade-off matrix, model-strategy scorecard, constraint diamond, RAG pipeline & agent spectrum diagrams, pre-flight checklist, synthesis prose. LinhTruong.com — all sections.

Before you cite externally. Model capabilities, API pricing, context-window limits, and regulatory deadlines change frequently — verify against current vendor docs and legal counsel. LLM-as-judge scores require calibration against human labels. Self-hosted GPU economics depend on utilization and staffing costs unique to your org. EU AI Act obligations phase in over time; classify your use cases with qualified advisors. Scorecard ★ ratings are illustrative, not benchmark results.