Personal notes · May 2026

Enterprise AI Architect Trade-Offs

Generative AI broke an assumption I relied on for years: that systems are deterministic. The core component is now probabilistic, cost is per-token, latency is measured in seconds, and the same input can yield different outputs. I wrote this note to map the trade-off space I use when building AI systems an enterprise can trust — across model strategy, retrieval, agents, inference economics, governance, and operations.

The question behind this note: where does a probabilistic component belong in the stack — and what trade-offs make that choice governable, measurable, and replaceable? What follows is the framework I use to answer that.
Covers: LLMs · RAG · agents · LLMOps · governance · cost Written: May 2026 ✍️ By: Linh Truong

Foundations

The Enterprise AI Architect Mandate — How I Frame It

You're still an architect — but the material has changed. I focus on delivering business value from AI reliably, safely, and economically, while managing a component that is non-deterministic, rapidly evolving, and prone to confident error.

Everything in AI architecture is a trade-off — and a moving one. The model I benchmark today may be superseded next quarter, so I design for replaceability, not for any single model.

Decides

Where AI belongs (and doesn't)

Which problems warrant a probabilistic model vs deterministic code or classic ML. The first trade-off is whether to use a generative model at all — often a rules engine or a fine-tuned classifier is cheaper, faster, and auditable.

Decides

The model & adaptation strategy

Closed API vs open weights, frontier vs small, prompt vs RAG vs fine-tune. These set the cost, latency, quality, and governance envelope of everything downstream.

Decides

The trust architecture

Evaluation, guardrails, observability, human-in-the-loop, and data governance. In the enterprise, trustworthiness is the product — an impressive demo that can't be governed never ships.

Abstraction layer I insist on I never hard-wire an application to one vendor's SDK. I put a model-abstraction / gateway layer between the app and providers so we can swap models, route by task, fail over, and compare — without rewrites. In a field this volatile, the ability to switch is itself an architectural feature.

Foundations

The Fundamental Shift: Deterministic → Probabilistic

Classical software is deterministic: same input, same output, testable to exact equality. LLM-based systems are stochastic, opaque, and fallible. Every architectural instinct has to be re-examined through that lens.

DimensionTraditional softwareLLM-based systemArchitectural consequence
OutputDeterministic, exactProbabilistic, varies per callTest distributions & quality, not equality; set temperature deliberately
CorrectnessRight or wrong (verifiable)A spectrum; can be confidently wrongEvals with scores; guardrails; human review on high stakes
LatencyMillisecondsHundreds of ms to many seconds (streaming)Stream tokens; design async UX; cache aggressively
CostFixed compute, ~free per callPer-token, scales with usageCost is a runtime variable to engineer, not a fixed line item
Failure modeCrash / exception (loud)Plausible hallucination (silent)Grounding, citations, confidence signals, fallbacks
Behavior changeOnly when you deployCan shift when the provider updates the modelPin versions; regression-eval before adopting new models
The defining risk Hallucination is not a bug to be fully eliminated — it is an inherent property of the technology to be managed. Architect for it: ground responses in retrieved facts, cite sources, constrain outputs to schemas, and keep a human in the loop wherever a confident error is expensive. Treat every model output as "probably right" until verified.

Foundations

The AI Constraint Set — Pick Your Corner

Just as distributed systems have CAP, enterprise AI systems live inside a tension between quality, latency, cost, and governance. I can't maximize all four at once — the architecture is the act of choosing which corner to occupy.

The AI constraint diamond — every lever moves the others your AI system Quality accuracy · reasoning Latency speed · UX Cost $/token · GPU Governance privacy · safety
Want the highest quality? Use a frontier model (more cost, more latency). Need low latency and cost? Use a small model (less quality). Need strong governance? Self-host or use a private deployment (more cost, more ops). Name the corner your use case demands.
Quality

Accuracy, reasoning depth, instruction-following, factual grounding. Bought with bigger models, retrieval, and eval-driven iteration.

Latency

Time-to-first-token + tokens/sec. Bought with smaller models, streaming, caching, and shorter prompts.

Cost

Per-token API spend or GPU amortization. Bought down with smaller/cheaper models, caching, batching, routing.

Governance

Data privacy, residency, auditability, safety, compliance (e.g. EU AI Act). Often pushes toward private/self-hosted deployments.

Model Strategy

The Adaptation Ladder — How to Make a Model "Yours"

The single most important strategy decision: how do you specialize a general model for your domain? There is a ladder of techniques from cheap-and-fast to expensive-and-powerful. Climb only as high as the requirement forces you.

Start at the bottom; climb only when the rung below is exhausted 1 · Prompt engineering fastest · cheapest · no training · iterate in minutes limited by the model's built-in knowledge & context window 2 · RAG (retrieval-augmented) inject your private/fresh data at query time · grounded · auditable adds a retrieval pipeline & vector store to build & tune 3 · Fine-tuning (LoRA/PEFT, SFT, DPO) teach style, format, narrow tasks · smaller model can match a big one needs labeled data, training pipeline, eval & re-tuning over time 4 · Continued pre-training inject deep domain knowledge into the weights · large data & compute rarely justified outside specialized domains 5 · Train from scratch $millions · only for AI-native companies with unique scale/data cost · effort · expertise ↑
The 80/20 rule of enterprise AI: most business value is captured at rungs 1–2 (prompting + RAG). Fine-tuning is for style, format, latency/cost reduction, or narrow tasks — not for teaching facts (that's RAG's job). Training from scratch is almost never the right enterprise choice.
TechniqueBest forCost / effortData freshnessKey risk
Prompt engineeringGeneral tasks, fast prototypes, orchestrationLowestLive (via context)Brittle prompts; context-window limits
RAGQ&A over private/changing knowledgeMediumAlways freshRetrieval quality is the ceiling; chunking pain
Fine-tuningConsistent style/format, narrow tasks, smaller-model substitutionMedium–highFrozen at train timeStale knowledge; re-tune treadmill; data drift
Continued pre-trainingDeep domain language (legal, biomed, code)HighFrozenLarge data + compute; catastrophic forgetting
From scratchAI-native products, unique modality/scaleExtremeFrozen$millions; rarely beats adapting an open model

Model Strategy

RAG vs Fine-tuning vs Long Context

The most common architecture debate. They are not mutually exclusive — but they solve different problems, and confusing them is the most expensive AI mistake teams make.

RAG vs Fine-tuningknowledge ⇄ behavior
RAG — change what it KNOWS
  • Inject fresh, private, citable facts at query time
  • Update knowledge by updating the index, no retraining
  • Reduces hallucination via grounding; auditable sources
  • Adds retrieval latency + a pipeline to maintain
VS
Fine-tuning — change how it BEHAVES
  • Teach tone, format, structure, a narrow skill
  • Can let a small cheap model match a big one on a task
  • Knowledge is frozen at training time → goes stale
  • Needs labeled data + an ongoing re-tune cycle
Rule of thumb: Need facts (current, private, verifiable)? → RAG. Need form (consistent style/format/behavior) or lower cost/latency? → fine-tune. Need both? Do both: fine-tune for behavior, RAG for knowledge. Never fine-tune to teach facts — it's expensive, lossy, and stale.

...and the third option: just use a bigger context window

Frontier models now accept very large contexts, so for small/static corpora you can sometimes skip retrieval and stuff the documents straight into the prompt. The trade-off:

RAG (retrieve top-k)Long context (stuff it all)
Cost per queryLow (only relevant chunks)High (pay for all tokens every call)
LatencyLowerHigher (more tokens to process)
Scales to large corporaYes (millions of docs)No (window-bounded)
Accuracy on big inputsDepends on retrieval quality"Lost in the middle" degradation
SimplicityMore moving partsDead simple
Decision Tiny, static knowledge (a policy doc, a manual) → long context for simplicity. Large, dynamic, or cost-sensitive knowledge → RAG. Combine with prompt/semantic caching to blunt the long-context cost.

Model Strategy

Closed API vs Open-Weight Self-Hosted

A defining build-vs-buy decision with major cost, control, and governance implications. The gap between frontier closed models and the best open-weight models has narrowed sharply — making this a live choice for most enterprises.

Managed API (closed) vs Self-Hosted (open weights)convenience ⇄ control
Closed model via API
  • Best-in-class quality, zero infra to run
  • Instant access, scales for you, fast to ship
  • Data leaves your boundary (mitigated by enterprise/no-train terms)
  • Per-token cost scales with usage; vendor lock-in & model deprecation risk
VS
Open-weight, self-hosted
  • Full data control & residency; runs in your VPC/on-prem
  • Fixed GPU cost (cheap at high steady volume); no lock-in
  • Customizable & fine-tunable freely; version stability
  • You own the MLOps: GPUs, scaling, optimization, on-call
Rule: Start on a managed API to validate value fast. Move toward self-hosting when one of these bites: strict data-residency/privacy, very high steady volume (GPU amortization beats per-token), need for deep customization, or unacceptable lock-in. Many enterprises run a hybrid: APIs for frontier reasoning, self-hosted small models for high-volume routine tasks.
The cost crossover Per-token APIs are cheapest at low/spiky volume; self-hosted GPUs are cheapest at high steady volume — if you keep them well-utilized. An idle reserved GPU is pure waste. Model the crossover point with realistic utilization, and include the fully-loaded cost of the MLOps team in the self-host column.

Model Strategy

Frontier vs Small Models — Right-Sizing

Using a frontier model for everything is the most common source of runaway AI cost and latency. Match the model to the task; route deliberately.

Capability rises with size — but so do cost & latency model size / capability → cost & latency small / SLM mid-tier frontier
Small/specialized models are dramatically cheaper and faster, and for routine, well-scoped tasks (classification, extraction, routing, simple drafting) they're often good enough. Reserve frontier models for genuinely hard reasoning.
Model routing / cascade Architect a router: send easy requests to a cheap small model and escalate only the hard ones to a frontier model (optionally let the small model "give up" and hand off). This cascade pattern can cut cost 5–10× while preserving quality on the requests that need it. Pair with model distillation (train a small model on a big model's outputs) to push more volume down-tier over time.

Model Strategy

Model-Strategy Scorecard

A comparative read of the main strategy options against what enterprises care about (★ relative, 1–5). Use it to shortlist; validate against your own data and evals.

StrategyOutput qualitySpeed to shipRun cost @ scale Data controlCustomizabilityOps burden
Frontier API + prompting ★★★★★ ★★★★★ ★★★★★ ★★★★★ ★★★★★ ★★★★★
Frontier API + RAG ★★★★★ ★★★★ ★★★★★ ★★★★★ ★★★★★ ★★★★★
Fine-tuned mid model ★★★★ ★★★★★ ★★★★ ★★★★★ ★★★★ ★★★★★
Self-hosted open model ★★★★ ★★★★★ ★★★★ ★★★★★ ★★★★★ ★★★★
Small/distilled + routing ★★★★★ ★★★★★ ★★★★★ ★★★★ ★★★★ ★★★★★
How to read it No row wins every column — that's the trade-off. Frontier-API rows maximize quality & speed-to-ship but pay in run-cost & data control; self-hosted rows invert it. Most mature enterprises end up hybrid: frontier API for hard reasoning, RAG for knowledge, and self-hosted small models routing the high-volume tail.

System Architecture

RAG Architecture & Its Trade-Offs

RAG is the workhorse of enterprise GenAI — but "RAG" hides a dozen tuning decisions, each a trade-off. The quality of your retrieval is the hard ceiling on the quality of your answers.

RAG pipeline — indexing (offline) & retrieval (online) INDEX (offline) Documents Chunk + clean Embed Vector DB / index QUERY (online) User query Embed + search Re-rank top-k Augment prompt LLM Answer + citations index serves retrieval
Each stage is a tuning knob. The most common failure isn't the LLM — it's retrieval returning the wrong chunks. Measure retrieval quality (recall/precision) separately from generation quality.
RAG decisionTrade-off
Chunk sizeSmall chunks → precise retrieval but lost context. Large chunks → more context but noisy, costly. Tune per corpus; consider overlap.
Retrieval methodPure vector (semantic) misses exact keywords; pure keyword (BM25) misses meaning. Hybrid search + re-ranking usually wins, at more complexity.
top-k (how many chunks)More k → higher recall but more tokens (cost/latency) & "lost in the middle." Fewer k → cheaper but may miss the answer.
Re-rankingA re-ranker boosts precision markedly but adds latency & a second model to run.
Freshness / indexingReal-time indexing keeps answers current but costs pipeline complexity; batch reindex is simpler but stale.

Vector store selection

OptionLean toward it when…Trade-off
pgvector / existing DB extensionYou already run Postgres; modest scale; want one systemSimplest ops; may not scale to billions of vectors or peak ANN throughput
Dedicated vector DB (managed or OSS)Large scale, advanced filtering, high QPSBest performance & features; another system to run/pay for
Search engine w/ vectors (Elastic/OpenSearch)You need strong hybrid (keyword+vector) & already use itGreat hybrid; heavier to operate, tuning required

System Architecture

Agents & Orchestration — Autonomy vs Control

"Agentic" is the dominant 2025–2026 pattern, and the most over-applied. Every step of autonomy you grant buys capability and spends predictability, cost, and debuggability. Use the least autonomy that solves the problem.

Predictable · cheap · easy to test Capable · costly · hard to control Single callone prompt Chain / workflowfixed steps Router + toolsmodel picks tools Single agent (loop)plan→act→observe Multi-agentcollaborating agents Each step right adds power and removes predictability. Most "agent" use cases are best served by a deterministic workflow with LLM steps.
Prefer workflows (you define the steps, the LLM fills them) over autonomous agents (the LLM decides the steps) unless the task genuinely needs open-ended planning. Autonomy compounds error, cost (many LLM calls), and the difficulty of guaranteeing behavior.
Workflow vs Autonomous Agentcontrol ⇄ flexibility
Deterministic workflow
  • Predictable, testable, cheaper
  • You own the control flow
  • Can't handle truly novel paths
VS
Autonomous agent
  • Handles open-ended, multi-step tasks
  • Adapts its own plan
  • Costly, slow, hard to bound & debug
Rule: Default to workflows. Add autonomy only where the path can't be enumerated.
Single vs Multi-Agentsimplicity ⇄ specialization
Single agent
  • One context, simpler to reason about
  • Lower coordination cost
  • Context bloat on big tasks
VS
Multi-agent
  • Specialized roles, parallelism, separation
  • Scales to complex problems
  • Coordination overhead, error propagation, cost explosion
Rule: Start single-agent. Split only when roles/contexts clearly separate.
Govern the tools, not just the model An agent is only as safe as the tools it can call. Every tool/action is an attack surface and a blast radius. Apply least-privilege, require approval for high-impact actions (writes, payments, emails), sandbox execution, and log every tool call. Standardized tool interfaces (e.g. the Model Context Protocol) help, but governance is yours. An autonomous agent with write access and no guardrails is an incident waiting to happen.

System Architecture

Inference Economics — Latency, Throughput & Cost

In production, the model is the easy part; serving it economically is the hard part. These levers trade against each other, and getting them wrong is what turns a successful pilot into an unaffordable rollout.

Speed

Streaming & TTFT

Stream tokens so users see output immediately; time-to-first-token matters more to perceived speed than total time. Trades a little complexity for big UX gains.

Cost

Caching

Prompt caching (reuse repeated context) and semantic caching (serve answers to similar past queries) can cut cost & latency dramatically. Trade: cache staleness & invalidation.

Efficiency

Quantization & distillation

Smaller/quantized models cut GPU cost & latency with modest quality loss. Distill a big model into a small one for hot paths. Trade: a quality dip you must eval.

Throughput

Batching

Continuous/dynamic batching raises GPU utilization and tokens/sec. Trade: individual request latency vs aggregate throughput — tune for your SLA.

Routing

Model cascades

Send easy traffic to cheap models, escalate the hard. Big cost savings; trade is routing logic + the risk of mis-routing a hard query down-tier.

Control

Token budgets

Cap output length, trim context, compress history. Every token is money & latency. Trade: terseness vs completeness.

Cost governance is mandatory Per-token pricing means a runaway loop, a verbose prompt, or a viral feature can produce a shocking bill overnight. Build in per-user/-tenant rate & spend limits, budget alerts, token accounting per feature, and a kill switch from day one. Treat cost as a first-class SLO alongside latency and quality.

Govern & Operate

Data, Privacy & Security

In the enterprise, this is where AI projects most often die — not on capability, but on whether they can be deployed safely and compliantly. Security is an architectural concern, not a checkbox at the end.

Data & privacy

Where does the data go?

  • Confirm provider terms: no training on your data, retention windows, region/residency
  • Redact/avoid sending PII & secrets; tokenize where possible
  • Enforce per-user access control on RAG sources — never let a model surface docs a user can't see
  • For the strictest data, self-host or use private/VPC deployments
AI-specific security

New attack surface

  • Prompt injection — malicious instructions hidden in inputs/docs hijack the model
  • Data exfiltration — model coaxed into revealing context/system prompt/other users' data
  • Insecure tool use — agent tricked into harmful actions
  • Mitigate: input/output filtering, least-privilege tools, sandboxing, treat all model output as untrusted
Compliance & risk tiering Regulations (e.g. the EU AI Act) tier obligations by risk. Classify each AI use case by impact: high-stakes decisions (credit, hiring, medical, legal) demand human oversight, documentation, explainability, and bias testing; low-stakes assistive features need far less. Match governance weight to risk — over-governing a copy-drafting tool wastes effort; under-governing a decisioning system invites disaster.

Govern & Operate

Evaluation & Quality — How You Know It Works

I can't improve what I can't measure, and I can't govern a probabilistic system without systematic evaluation. Evals are to AI systems what tests are to traditional code — but graded, not binary.

Eval typeWhat it doesTrade-off
Offline eval setRun candidate prompts/models against a curated, labeled datasetFast, repeatable, regression-safe; only as good as the dataset's coverage
LLM-as-judgeUse a strong model to grade outputs at scale (relevance, faithfulness)Scalable & cheap vs humans; the judge has biases & can be gamed — calibrate against humans
Human evaluationExpert/SME review of outputsGold standard for nuance; slow, costly, doesn't scale — reserve for calibration & high stakes
Online / productionA-B tests, user feedback (thumbs), implicit signalsReal-world truth; noisy, lagging, needs telemetry & guardrails for safety
RAG-specificMeasure retrieval (recall/precision) & faithfulness/groundedness separatelyPinpoints whether failures are retrieval or generation; more harness to build
Build the eval harness first Before optimizing prompts or swapping models, build a representative eval set and an automated scoring pipeline. It turns "this feels better" into a number, makes model upgrades safe (regression-test before adopting), and is the only defensible way to claim quality to the business. Evals are the flywheel of every serious AI system.

Guardrails — the runtime safety layer

Wrap the model with input and output guardrails: validate/clean inputs (block injection, profanity, PII), and validate outputs (schema/format checks, factuality/grounding checks, toxicity & policy filters, citation verification). Trade-off: each guardrail adds latency and can produce false positives that block legitimate responses — tune the strictness to the stakes.

Govern & Operate

LLMOps in Production

A working prototype is ~10% of the job. Operating an AI system reliably over time — through model updates, data drift, and changing usage — is the other 90%, and where rigorous ops discipline pays off.

The LLMOps loop — ship, observe, evaluate, improve Build &prompt/RAG Evaluate& gate Deploy(versioned) Observe& collect production data & feedback feed the next iteration of evals & prompts
Treat prompts, models, and retrieval configs as versioned, evaluated artifacts. Every change — including a provider's silent model update — must pass the eval gate before reaching users.
  • Observability: trace every request (prompt, retrieved context, tool calls, tokens, cost, latency) — you can't debug what you can't see
  • Versioning: pin model versions; version prompts & RAG configs; enable instant rollback
  • Drift & quality monitoring: watch output quality, refusal rates, latency, and cost over time; alert on regressions
  • Feedback loop: capture user signals (thumbs, edits, escalations) and feed them back into evals & fine-tuning data
  • Human-in-the-loop: route low-confidence or high-stakes outputs to human review; design the handoff
  • Graceful degradation: fallbacks for provider outages, rate limits, timeouts (cheaper model, cached answer, "try later")

Decision Toolkit

Master Trade-Off Matrix

My quick reference for enterprise AI decisions. For each choice: when to lean each way, and the price of the lean.

DecisionLean A when…Lean B when…The cost you pay
GenAI vs classic ML/rulesOpen-ended language/reasoning taskStructured, high-volume, must be auditableCost & nondeterminism, or limited flexibility
Prompt vs RAG vs fine-tuneGeneral task / need fresh facts (prompt/RAG)Need consistent behavior or lower cost (fine-tune)Pipeline complexity, or stale knowledge + re-tune cost
RAG vs long contextLarge/dynamic corpus, cost-sensitiveSmall static docs, want simplicityRetrieval complexity, or token cost & "lost in middle"
Closed API vs self-hostedSpeed to ship, best quality, low/spiky volumeStrict privacy, high steady volume, no lock-inData exposure & lock-in, or MLOps burden
Frontier vs small modelHard reasoning, quality is the productRoutine task, cost & latency matterHigh cost/latency, or lower ceiling — so route
Workflow vs autonomous agentPath is known & enumerableOpen-ended, multi-step, novel tasksRigidity, or unpredictability + cost
Single vs multi-agentOne coherent contextClearly separable specialized rolesContext bloat, or coordination overhead & cost
Latency vs qualityInteractive UX, real-timeAccuracy is critical, async is fineLower quality, or slower + costlier responses
Automation vs human-in-loopLow stakes, high volume, tolerant of errorHigh stakes, regulated, costly mistakesRisk of silent error, or throughput & cost
Build vs buy (AI features)Core differentiator on your dataCommodity capability (chat, search, summarize)Maintenance & expertise, or lock-in & fit

Decision Toolkit

Decision Framework I Use

How I take an enterprise AI initiative from idea to a defensible, production-ready design without chasing hype or burning budget.

  1. Define the job & the bar. What business outcome, and what does "good enough" mean numerically (accuracy, latency, cost-per-request, acceptable error rate)? Without a target, you'll over-build forever.
  2. Question whether you need GenAI at all. Could rules, search, or classic ML do this cheaper, faster, and auditably? I reach for the probabilistic hammer only where the problem is genuinely a nail for it.
  3. Climb the adaptation ladder from the bottom. Start with prompting; add RAG for knowledge; fine-tune only if behavior/cost demands it. Stop at the lowest rung that hits the bar.
  4. Pick a model strategy behind an abstraction layer. Choose closed vs open, frontier vs small, by your constraint corner (quality/latency/cost/governance) — but keep models swappable.
  5. Build the eval harness before optimizing. A representative dataset + automated scoring is the prerequisite to every later decision and to safe model upgrades.
  6. Design the trust & safety layer. Grounding, guardrails, access control, human-in-the-loop, cost limits, and observability — proportional to the use case's risk tier.
  7. Pilot, measure, then harden for production. Validate value on real users with real evals; then invest in LLMOps (versioning, monitoring, rollback, drift) for the long haul.
  8. Write the ADR & the revisit trigger. Record the decision, the trade-offs accepted, and the signal to reconsider — because in this field, a better/cheaper option will appear within months.
The question I keep asking I replace "which model is best?" with: "What's the cheapest, simplest, most governable approach that clears the quality bar for this specific job — and how will I know when it stops being the right one?" The answer is always "it depends," and it will change. Design for change.

Decision Toolkit

Enterprise AI Anti-Patterns

The recurring, expensive mistakes — most stem from chasing capability without governing it, or reaching for complexity the problem didn't require.

Solution looking for a problem

AI for AI's sake

Bolting an LLM onto a problem better solved by rules, search, or a form. Complexity & cost with no real benefit.

Fine-tuning to teach facts

Wrong tool

Burning money to bake knowledge into weights that immediately goes stale — when RAG would be cheaper, fresher, and citable.

Frontier model for everything

No right-sizing

Routing trivial classification through the most expensive model. Runaway cost & latency; no model cascade.

Demo-driven development

No eval

"Looks great in the demo" with no eval set, so quality silently regresses and no one can prove it works. The pilot that never reaches production.

Premature agents

Autonomy without need

Multi-agent frameworks for tasks a simple workflow handles. Unpredictable, costly, undebuggable.

Vendor lock-in by default

No abstraction layer

Hard-wiring one provider's SDK throughout the app, so switching or comparing models means a rewrite.

Governance as an afterthought

Security last

Ignoring prompt injection, data leakage, access control, and compliance until launch — when they become blockers or breaches.

Ignoring cost until the bill

No cost SLO

No token accounting, limits, or alerts — until a loop or viral feature produces a five-figure surprise.

Trusting output blindly

No grounding/HITL

Treating confident hallucinations as truth on high-stakes decisions, with no citations, checks, or human review.

Decision Toolkit

Pre-Flight Checklist

I run this before committing to or shipping an AI system. A blank line is an undecided trade-off that will surface later — usually in production.

Problem & strategy

  • Business outcome & numeric "good enough" bar defined
  • Confirmed GenAI beats rules/classic ML for this job
  • Lowest sufficient rung on the adaptation ladder chosen
  • Model strategy fits the constraint corner (quality/latency/cost/gov)
  • Model-abstraction layer in place; models are swappable

Quality & evaluation

  • Representative eval set + automated scoring exists
  • Retrieval quality measured separately from generation (if RAG)
  • Regression-eval gate before adopting any new model version
  • Hallucination mitigated: grounding, citations, output schemas
  • Online feedback signals captured and fed back

Governance & security

  • Data flow & provider terms reviewed (no-train, retention, residency)
  • PII handling & per-user access control on knowledge sources
  • Prompt-injection & output filtering guardrails in place
  • Use case risk-tiered; human-in-loop where stakes are high
  • Tool/agent actions least-privilege, approved & logged

Operations & cost

  • Per-request tracing: tokens, cost, latency, context, tool calls
  • Cost SLO: rate/spend limits, budget alerts, kill switch
  • Right-sizing/routing & caching to control cost & latency
  • Versioning & instant rollback for models, prompts, configs
  • Drift/quality monitoring + graceful degradation on provider failure
The one sentence I leave with A successful enterprise AI system is not the one using the most powerful model — it is the simplest, most governable design that reliably clears the quality bar at an acceptable cost, with every probabilistic risk grounded, measured, and bounded — and built to swap its parts as the field moves beneath it.

Sources

References & Sources

Annotated bibliography behind this enterprise AI architecture note — model strategy, RAG, agents, inference economics, governance, evaluation, and LLMOps. Section tags (e.g. §04) show where each source informed the prose, tables, and diagrams. Constraint diamonds, scorecards, pipeline SVGs, and synthesis tables are my own unless noted.

Scope. Synthesis of peer-reviewed papers, industry frameworks, vendor documentation, and regulatory texts (May 2026). Model names, pricing, and API capabilities change monthly — verify against current vendor docs before committing to a one-way door. Scorecard star ratings are relative tendencies for shortlisting, not benchmarks. Not legal, compliance, or procurement advice.

Citations are numbered continuously [1]–[n] within this section.

Enterprise AI architect mandate & replaceability (§01, §16)

  1. Richards, M., & Ford, N., Fundamentals of Software Architecture. O'Reilly, 2020. Architectural trade-offs and driving characteristics — §01 mandate framing and §16 decision framework. — §01, §16.
  2. Sowell, T., Applied Economics (and related work). "There are no solutions, only trade-offs" — adapted throughout this note; §01 lead-rule theme. — §01, §15.
  3. Sculley, D., et al., "Hidden Technical Debt in Machine Learning Systems." NeurIPS, 2015. ML/AI systems require ops discipline beyond the model — §01 trust architecture and §14 LLMOps. NeurIPS proceedings — §01, §14, §17.
  4. Portkey / LiteLLM / OpenRouter documentation (industry pattern). Model-abstraction and gateway layers for multi-provider routing — §01 abstraction-layer callout. litellm.ai — §01, §06, §17.
  5. Nygard, M., "Documenting Architecture Decisions." Cognitect, 2011. ADR format — §16 step 8 and §18 checklist. cognitect.com — §16, §18.

Deterministic → probabilistic systems (§02)

  1. Bender, E. M., et al., "On the Dangers of Stochastic Parrots." FAccT, 2021. Non-deterministic, opaque LLM behavior — §02 shift table. DOI: 10.1145/3442188.3445922 — §02.
  2. Ji, Z., et al., "Survey of Hallucination in Natural Language Generation." ACM Computing Surveys, 2023. Hallucination as inherent risk to manage — §02 defining-risk callout. DOI: 10.1145/3571730 — §02, §13.
  3. OpenAI Platform documentation — temperature, sampling, model versioning. Stochastic outputs and version pinning — §02 output/correctness rows and §14 versioning. platform.openai.com — §02, §14.
  4. Anthropic documentation — model behavior, streaming, safety. Latency and streaming UX — §02 latency row and §11 TTFT card. docs.anthropic.com — §02, §11.
  5. Google, "Responsible AI practices." Grounding, citations, human review on high-stakes outputs — §02 failure-mode row. ai.google — §02, §12, §13.

Quality · latency · cost · governance constraints (§03, §15)

  1. Brewer, E., "CAP Twelve Years Later: How the 'Rules' Have Changed." IEEE Computer, 2012. Distributed trade-off analogy for §03 constraint diamond (adapted to AI axes). — §03.
  2. European Parliament & Council, Regulation (EU) 2024/1689 (AI Act). Risk-tiered governance obligations — §03 governance corner and §12 compliance callout. EUR-Lex — §03, §12, §18.
  3. NIST, Artificial Intelligence Risk Management Framework (AI RMF 1.0). NIST AI 100-1, 2023. Risk-based AI governance — §12 risk tiering. nist.gov — §12, §16.
  4. ISO/IEC 42001:2023 — Artificial intelligence management system. Enterprise AI governance baseline — §12 compliance context. — §12, §18.
  5. Truong, L. (synthesis). Quality–latency–cost–governance constraint diamond — §03 figure and §15 matrix rows. LinhTruong.com — §03, §15.

Adaptation ladder: prompt · RAG · fine-tune · pre-train (§04, §05, §08)

  1. OpenAI, "Prompt engineering guide." Prompting as first rung — §04 ladder step 1. platform.openai.com — §04, §05.
  2. Anthropic, "Prompt engineering overview." System prompts, few-shot, chain-of-thought patterns — §04 ladder. docs.anthropic.com — §04.
  3. Lewis, P., et al., "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." NeurIPS, 2020. RAG as rung 2 — §04 ladder and §09 architecture. arxiv.org/abs/2005.11401 — §04, §05, §09, §13.
  4. Hu, E. J., et al., "LoRA: Low-Rank Adaptation of Large Language Models." ICLR, 2022. Parameter-efficient fine-tuning — §04 rung 3 (LoRA/PEFT). arxiv.org/abs/2106.09685 — §04, §05.
  5. Rafailov, R., et al., "Direct Preference Optimization: Your Language Model is Secretly a Reward Model." NeurIPS, 2023. DPO alignment — §04 fine-tuning row. arxiv.org/abs/2305.18201 — §04.
  6. Hugging Face, PEFT & TRL documentation. SFT/DPO tooling — §04 adaptation table. huggingface.co/docs/peft — §04, §06.
  7. Meta AI, Llama model cards & continued-pretraining guidance. Domain adaptation rungs 4–5 — §04 ladder top steps. ai.meta.com/llama — §04, §06.

RAG vs fine-tuning vs long context (§05, §09, §15)

  1. OpenAI fine-tuning documentation. Fine-tune for behavior/format, not facts — §05 rule-of-thumb callout. platform.openai.com — §05, §17.
  2. Liu, N. F., et al., "Lost in the Middle: How Language Models Use Long Contexts." TACL, 2024. Long-context degradation — §05 long-context table "lost in the middle" row. arxiv.org/abs/2307.03172 — §05, §09, §15.
  3. Google Gemini / Anthropic Claude documentation — context windows. When stuffing docs vs retrieving — §05 long-context trade-off. — §05, §15.
  4. Gao, Y., et al., "Retrieval-Augmented Generation for Large Language Models: A Survey." arXiv, 2023. RAG vs fine-tuning decision space — §05 tradeoff block. arxiv.org/abs/2312.10997 — §05, §09.
  5. LangChain & LlamaIndex documentation — RAG pipelines. Chunking, retrieval, augmentation patterns — §05 decision callout and §09 pipeline. python.langchain.com — §05, §09.

Closed API vs open-weight self-hosting (§06, §08, §15)

  1. OpenAI API & Azure OpenAI Service documentation — enterprise data handling, no-train terms. Managed API governance — §06 closed-model column and §12 data-flow bullets. learn.microsoft.com — §06, §12.
  2. Anthropic API — data retention & enterprise privacy. Provider terms checklist — §12 no-train/retention bullets. anthropic.com/privacy — §06, §12.
  3. Touvron, H., et al., "Llama 2: Open Foundation and Fine-Tuned Chat Models." Meta, 2023. Open-weight self-host baseline — §06 self-hosted column. arxiv.org/abs/2307.09288 — §06, §08.
  4. Mistral AI model documentation. Open-weight enterprise deployments — §06 hybrid pattern. docs.mistral.ai — §06, §08.
  5. NVIDIA NIM / vLLM / TGI (Text Generation Inference) documentation. Self-hosted inference stacks — §06 MLOps burden and §11 serving. docs.vllm.ai — §06, §11, §14.
  6. McKinsey / a16z industry analyses (synthesized). API vs GPU cost crossover at utilization — §06 cost-crossover callout (order-of-magnitude; re-model locally). — §06, §11.

Model sizing, routing & distillation (§07, §11, §15)

  1. Chen, L., et al., "FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance." arXiv, 2023. Model cascades and routing — §07 cascade callout and §11 routing card. arxiv.org/abs/2305.05176 — §07, §11, §15.
  2. Hinton, G., Vinyals, O., & Dean, J., "Distilling the Knowledge in a Neural Network." NeurIPS Deep Learning Workshop, 2014. Distillation concept — §07 distillation bullet and §11 efficiency card. arxiv.org/abs/1503.02531 — §07, §11.
  3. Microsoft Research, Phi model series technical reports. Small language models for routine tasks — §07 frontier vs SLM figure. microsoft.com/research — §07, §08.
  4. Google DeepMind, Gemma model documentation. Mid/small open models — §08 scorecard rows. ai.google.dev/gemma — §07, §08.
  5. Speculative decoding literature (Leviathan et al.; Chen et al.). Latency reduction techniques — background for §11 speed levers. arxiv.org/abs/2211.17192 — §11.

RAG architecture, retrieval & vector stores (§09, §13)

  1. Robertson, S., & Zaragoza, H., "The Probabilistic Relevance Framework: BM25 and Beyond." BM25 keyword retrieval — §09 hybrid-search row. — §09.
  2. Karpukhin, V., et al., "Dense Passage Retrieval for Open-Domain Question Answering." EMNLP, 2020. Dense vector retrieval — §09 retrieval-method row. arxiv.org/abs/2004.04906 — §09.
  3. Reimers, N., & Gurevych, I., "Sentence-BERT: Sentence Embeddings using SiBERT Networks." EMNLP, 2019. Embedding models for RAG — §09 embed stage. arxiv.org/abs/1908.10084 — §09.
  4. Cohere Rerank API documentation; cross-encoder re-ranking practice. Re-ranking trade-off — §09 re-ranking row. docs.cohere.com — §09.
  5. pgvector extension documentation (PostgreSQL). pgvector option — §09 vector-store table. github.com/pgvector/pgvector — §09.
  6. Pinecone, Weaviate, Qdrant, Milvus documentation. Dedicated vector DB trade-offs — §09 vector-store table. — §09.
  7. Elastic / OpenSearch kNN & hybrid search documentation. Search-engine + vector option — §09 vector-store table. elastic.co — §09.
  8. Es, S., et al., "RAGAS: Automated Evaluation of Retrieval Augmented Generation." EACL, 2024. Separate retrieval vs generation metrics — §09 figcaption and §13 RAG-specific eval. arxiv.org/abs/2309.15217 — §09, §13.

Agents, tools & orchestration (§10, §15, §17)

  1. Yao, S., et al., "ReAct: Synergizing Reasoning and Acting in Language Models." ICLR, 2023. Plan→act→observe agent loop — §10 spectrum diagram. arxiv.org/abs/2210.03629 — §10.
  2. Schick, T., et al., "Toolformer: Language Models Can Teach Themselves to Use Tools." NeurIPS, 2023. Tool-use foundation — §10 router+tools step. arxiv.org/abs/2302.04761 — §10.
  3. OpenAI function calling / tools documentation. Structured tool interfaces — §10 workflow vs agent trade-offs. platform.openai.com — §10.
  4. Anthropic, Model Context Protocol (MCP) specification. Standardized tool/context interfaces — §10 govern-the-tools callout. modelcontextprotocol.io — §10, §12.
  5. LangGraph / Temporal / AWS Step Functions documentation (pattern). Deterministic workflows with LLM steps — §10 prefer-workflows figcaption. langchain-ai.github.io/langgraph — §10, §15.
  6. Wu, Q., et al., "AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation." arXiv, 2023. Multi-agent coordination costs — §10 single vs multi-agent trade-off. arxiv.org/abs/2308.08155 — §10, §15, §17.
  7. Shinn, N., et al., "Reflexion: Language Agents with Verbal Reinforcement Learning." NeurIPS, 2023. Autonomous loop risks — §10 autonomy cost theme. arxiv.org/abs/2303.11366 — §10, §17.

Inference economics & serving (§11, §14, §18)

  1. Kwon, W., et al., "Efficient Memory Management for LLM Serving with PagedAttention." (vLLM). OSDI, 2023. Continuous batching & throughput — §11 batching card. arxiv.org/abs/2309.06180 — §11, §06.
  2. OpenAI prompt caching documentation; Anthropic prompt caching. Prompt caching cost/latency — §11 caching card. platform.openai.com — §11.
  3. FrugalGPT (Chen et al.) — semantic caching & cascade economics. Cost reduction patterns — §11 caching and routing cards. — §07, §11.
  4. Frantar, E., et al., "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers." ICLR, 2023. Quantization trade-offs — §11 quantization card. arxiv.org/abs/2210.17323 — §11.
  5. Lin, J., et al., "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration." MLSys, 2024. Quantized serving — §11 efficiency card. arxiv.org/abs/2306.00978 — §11.
  6. Google SRE Team, Site Reliability Engineering — SLOs & error budgets. Cost as first-class SLO — §11 cost-governance callout. sre.google — §11, §18.

Data governance, privacy & AI security (§12, §17, §18)

  1. OWASP Foundation, Top 10 for Large Language Model Applications (2023–2025). Prompt injection, insecure output, excessive agency — §12 AI-specific security and §17 governance anti-pattern. owasp.org — §12, §17, §18.
  2. Greshake, K., et al., "Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection." arXiv, 2023. Prompt injection attacks — §12 prompt-injection bullet. arxiv.org/abs/2302.12173 — §12, §17.
  3. EU GDPR (Regulation 2016/679) & UK GDPR. PII handling, residency — §12 redact/avoid PII bullets. gdpr-info.eu — §12.
  4. Microsoft Azure OpenAI / AWS Bedrock — VPC/private deployment patterns. Strictest-data self-host/private options — §12 privacy card. — §06, §12.
  5. NIST SP 800-53 / Zero Trust Architecture (SP 800-207). Least-privilege tool access — §10 & §12 least-privilege bullets. csrc.nist.gov — §10, §12.
  6. NeMo Guardrails / Guardrails AI / Llama Guard documentation. Input/output guardrails — §13 guardrails section. github.com/NVIDIA/NeMo-Guardrails — §13.

Evaluation, quality & LLM-as-judge (§13, §14, §18)

  1. Liang, P., et al., "Holistic Evaluation of Language Models (HELM)." Stanford CRFM, 2022+. Broad eval framing — §13 eval types table. crfm.stanford.edu/helm — §13.
  2. Zheng, L., et al., "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena." NeurIPS, 2023. LLM-as-judge trade-offs — §13 LLM-as-judge row. arxiv.org/abs/2306.05685 — §13.
  3. Es, S., et al., RAGAS framework documentation. Faithfulness, answer relevance, context precision — §13 RAG-specific eval. docs.ragas.io — §09, §13.
  4. LangSmith / Weights & Biases / Arize Phoenix documentation. Offline eval sets, tracing, regression gates — §13 build-eval-first callout and §14 observability. docs.smith.langchain.com — §13, §14.
  5. Breck, E., et al., "The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction." ICML, 2017. ML testing discipline adapted to LLMOps — §13 eval-as-tests analogy. research.google — §13, §14.
  6. Human-in-the-loop AI literature (Holzinger et al.; industry HITL practice). High-stakes review — §13 guardrails stakes tuning and §18 checklist. — §13, §15, §18.

LLMOps & production operations (§14, §18)

  1. Sculley et al. (2015) — hidden technical debt; MLOps industry evolution. Prototype vs production ops split — §14 opening sub. — §14, §17.
  2. Google, "Rules of Machine Learning: Best Practices for ML Engineering." ML engineering rules adapted to LLMOps — §14 checklist themes. developers.google.com — §14.
  3. OpenTelemetry GenAI semantic conventions (draft). Trace prompts, tokens, tool calls — §14 observability bullet. opentelemetry.io — §14, §18.
  4. MLflow / Weights & Biases model & prompt registry documentation. Versioning models, prompts, RAG configs — §14 versioning bullet. mlflow.org — §14, §18.
  5. Evidently AI / WhyLabs drift monitoring documentation. Quality & drift monitoring — §14 drift bullet. evidentlyai.com — §14.
  6. Nygard, Release It! (2nd ed.). Graceful degradation, bulkheads — §14 graceful-degradation bullet. — §14, §18.

Decision framework, ADRs & anti-patterns (§15, §16, §17, §18)

  1. Thomson, J., "Architecture Decision Records." adr.github.io — §16 ADR step and §18 operability checklist. — §16, §18.
  2. Amazon leadership — "Disagree and commit"; Type 1/2 decisions (Bezos). Reversible vs one-way doors — §04 principle 3 and §16 framework. — §04, §16.
  3. Beck, K., Extreme Programming Explained — YAGNI. Simplest sufficient design — §16 step 6 and §18 complexity-earned line. — §16, §18.
  4. Chinchilla scaling laws (Hoffmann et al., 2022). Background for model-size economics — §07–§08 sizing context. arxiv.org/abs/2203.15556 — §07, §08.
  5. Truong, L., Enterprise AI Architect Trade-Offs — personal working notes. May 2026. Master trade-off matrix, model-strategy scorecard, constraint diamond, RAG pipeline & agent spectrum diagrams, pre-flight checklist, synthesis prose. LinhTruong.com — all sections.
Before you cite externally. Model capabilities, API pricing, context-window limits, and regulatory deadlines change frequently — verify against current vendor docs and legal counsel. LLM-as-judge scores require calibration against human labels. Self-hosted GPU economics depend on utilization and staffing costs unique to your org. EU AI Act obligations phase in over time; classify your use cases with qualified advisors. Scorecard ★ ratings are illustrative, not benchmark results.