I wrote this note to keep LLM spend economically rational — from token-level mechanics through routing, caching, RAG, agents, and FinOps guardrails. It is my working blueprint for teams shipping AI at scale, not a vendor white paper.
The question behind this note: does every token, GPU-second, and vector lookup earn its keep? What follows is how I map the bill, design the request path, and keep unit economics from drifting as usage grows.
Inference spend is now the fastest-growing line item in most engineering budgets I see — often eclipsing compute, storage, and observability combined. My job on these systems is to keep AI economically rational: every token, GPU-second, and vector lookup either generates measurable user value or gets cut.
10–100×
Cost delta between frontier & small models
50–90%
Token spend cuttable via caching & routing
3–5×
Inflation from naive agentic loops
< $0.01
Target unit cost per common user action
What I keep coming back to: cost control is an architectural property, not a procurement exercise. Design it into the request path: route the cheapest model that meets the SLA, cache aggressively at every layer, compress context ruthlessly, batch where latency allows, and instrument every call to a cost ledger.
What's in this note
The full taxonomy of where dollars go in an LLM system.
Engineering trade-offs: latency vs. cost vs. quality vs. risk.
FinOps tooling, observability schemas, and budget guardrails.
Org design, vendor negotiation, and a 5-stage maturity model.
02Cost Drivers Anatomy
Before optimizing, map the bill. AI cost decomposes into seven primary drivers. Most teams over-index on (1) and ignore (3)–(7), which is exactly where the avoidable waste lives.
Figure 1 — Seven primary drivers. The hub is total cost of ownership; each spoke needs its own owner and SLO.
Driver deep-dive
Driver
Typical share
Primary levers
Common waste
Token inference
50–80%
Routing, caching, compression, smaller models
Frontier model for trivial calls
Embeddings & vector store
3–10%
Smaller embed dims, quantization, TTL
Re-embedding unchanged corpora
Fine-tuning / training
2–15%
LoRA, distillation, early stopping
Tuning before prompt iteration plateau
GPU serving infra
10–40% (if self-hosted)
vLLM, paged-attention, autoscale-to-zero
Idle reserved fleets
Egress & networking
1–8%
Co-locate model + data, edge cache
Cross-region RAG hops
Observability & eval
5–15%
Sampled tracing, batch evals, cheap judge models
100% trace + frontier judge LLM
Human-in-the-loop
variable
Active learning, confidence-gated review
Reviewing every output
03Unit Economics & Formulas
Every product decision should resolve to a cost-per-action (CPA) number. If you can't compute CPA on demand, you cannot govern AI cost.
Core formula
CPA = Σ over calls c in action a:
(input_tokens(c) × P_in(model_c)
+ output_tokens(c) × P_out(model_c)
+ cached_tokens(c) × P_cache(model_c))
+ retrieval_cost(a)
+ tool_cost(a)
+ infra_amortization(a)
Five ratios I won't ship without
1. Cost per successful action ($/action)
Numerator is total LLM + infra spend. Denominator is successful outcomes (not raw API calls), defined by product. Retries, abandoned agentic loops, and rejections all inflate cost without value.
2. Tokens per action (in/out)
Track input and output independently. Output tokens are 3–5× more expensive than input on most APIs. A rising output/input ratio often signals prompt drift or verbose chain-of-thought.
3. Cache hit rate (%)
Prompt cache + semantic cache + retrieval cache. Below 30% means you have not invested. Above 80% means you may be serving stale answers — pair with freshness SLO.
4. Model mix ($-weighted)
Share of spend by tier: frontiermidsmallself-hosted. Healthy SaaS-style products run 10–25% frontier; the rest is mid or small.
5. Cost-to-Revenue (or Cost-to-Margin) ratio
For revenue-generating features: AI cost / incremental revenue. For internal tools: AI cost / labor hours saved × loaded labor rate. Hard line: investigate any feature > 30% C/R.
6. Waste rate (%)
Share of tokens spent on calls that produced no shipped output: failed JSON parses, guardrail rejections, agent dead-ends, eval/test traffic in prod accounts. Target < 5%.
A worked example
Scenario
Model
Input tok
Output tok
$ per call
1M calls/mo
Naïve (frontier, no cache)
Frontier
4,000
800
$0.0760
$76,000
+ Prompt caching (system reused)
Frontier
4,000 (3,500 cached)
800
$0.0265
$26,500
+ Route 70% to mid-tier
Mix
4,000
800
$0.0142
$14,200
+ Context compression (−60% in)
Mix
1,600
800
$0.0091
$9,100
+ Semantic cache (35% hit)
Mix
—
—
$0.0059
$5,900
Illustrative blended rates. The point is not the absolute dollars — it's that stacked optimizations compound to ~13× without changing the user-visible product.
04Reference Cost-Aware Architecture
The architecture below pushes every request through layers that progressively get more expensive. The cheapest possible answer wins. Each layer has a clear cost owner.
Figure 2 — Layered request path: L1 exact cache → L2 semantic cache → L3 cost-aware router → tiered models, with retrieval and provider prompt cache feeding the same ledger.
Layer-by-layer cost intent
Layer
Goal
Cost per hit
Owner
L1 Edge cache
Zero-LLM for hot, identical prompts
~$0
Platform
L2 Semantic cache
Reuse near-duplicate answers
~$0.0001 (vector lookup)
AI Platform
L3 Router
Pick cheapest viable model
Tiny classifier or rules
AI Platform
Prompt cache
Cut input-token cost on stable prefix
10–25% of normal input
Feature team
Retrieval
Bound context size; reduce hallucination
vector + rerank fee
Data / RAG team
Models
Generate final answer
per-token, tier-dependent
Feature team
Ledger
Attribute every dollar to a tenant/feature
logging only
FinOps
05Model Routing & Cascades
Routing is the single highest-leverage optimization. A well-designed router cuts spend 40–70% with no quality loss because most queries are easy. The goal: send each request to the cheapest model whose expected quality clears the SLA.
Three routing strategies
A. Rule-based router
Deterministic dispatch on request features: length, language, intent label, user tier, schema mode.
Pros Zero overhead, auditable. Cons Rigid; misses nuanced cases.
B. Classifier router
Small model (e.g., distilled BERT, Haiku-class) predicts "difficulty" and routes accordingly.
Watch out: Cascades can increase cost if your small model is wrong often. Always compare cascade cost-per-success vs. single-model cost-per-success on the same eval set before shipping.
06Caching Strategies
Caching is the cheapest dollar you'll ever save. A serious system runs at least four distinct caches.
Figure 4 — Four cache tiers, each with different latency, hit rates, and freshness risk.
Designing prompt caches that actually hit
Provider prompt caches (Anthropic, OpenAI, Google) only hit when the prefix is byte-identical. The prompt structure I enforce on teams I work with:
[ STABLE — cached ]
system instructions
tool / function definitions
large reference documents
few-shot examples
[ VOLATILE — not cached ]
user query
per-request retrieved snippets (or place AFTER stable block)
timestamps, user_id, session id
Engineering rule: sort prompt sections from most stable to most volatile. Never inject a timestamp at the top. Never reorder tool definitions between calls.
Semantic cache: when it's a trap
Don't use semantic cache for personalized, account-specific, or time-sensitive answers without per-tenant namespaces.
Don't set similarity threshold too low — false positives serve wrong answers and erode trust.
Do log every cache hit's similarity score and sample audit a slice weekly.
Do invalidate by content-hash when source documents change.
07Context & Prompt Compression
Context length is the second-biggest token lever after model choice. The discipline: send the minimum context needed for a correct answer, no more.
The compression toolbox
Technique
How it works
Typical reduction
Risk
Chunk + retrieve top-k
RAG instead of stuffing
70–95%
Recall loss if k too small
Reranking with cap
Keep top-k after rerank, hard-cap chars
30–60%
Reranker latency & cost
Summarize-then-prompt
Small model summarizes long docs once, cache summary
60–90%
Summary may drop critical detail
Sliding/decay memory
Drop or summarize older turns
40–80% on chat
Loss of long-range coherence
Structured extraction first
Convert docs to JSON, pass JSON
50–80%
Schema drift
Prompt minification
Remove filler, redundant instructions
10–25%
Quality regressions if over-pruned
Output schema constraints
Force concise structured output
30–70% on output
Need robust parsers
Stop sequences & max_tokens
Bound runaway generations
variable
Truncation
The asymmetry: output tokens cost more
On most APIs, output tokens are 3–5× the price of input. Yet engineers focus on shrinking input. Equally important:
Constrain output with a strict JSON/XML schema; reject and retry only with stronger constraints.
Use max_tokens as a hard guardrail, not a hope.
Discourage chain-of-thought in the visible answer unless a thinking budget is genuinely needed; if it is, use a separate thinking budget and don't include scratchpad in the final shown text.
For listy answers, ask explicitly for "3 bullets, ≤ 12 words each."
Heuristic: if your average output is > 800 tokens, you almost certainly have either over-broad prompts or a UX that rewards verbosity. Both are fixable.
08RAG Cost Optimization
RAG looks cheap because each call is small, but at scale embedding, storage, retrieval, and rerank fees stack. Worse, naive RAG inflates prompt size and the model bill it was meant to reduce.
Figure 5 — Six RAG stages; the highest-leverage levers are usually steps 4 (rerank cap) and 5 (model tier).
RAG cost anti-patterns
Embedding the whole corpus every nightly run. Use content hashes; re-embed only changed chunks.
Top-k = 20 by default. Measure recall; most workloads peak at 4–8.
1536-dim embeddings for short SKU descriptions. Use a smaller model and dimension.
Per-request reranking with a frontier model. Use a dedicated cross-encoder.
No tenant scoping. Larger index = more recall noise + more tokens fed in.
09Agentic Cost Control
Agents are the most dangerous cost category because their cost is unbounded by default. A single user query can trigger dozens of model calls, tool calls, and retries. Naïve loops have killed budgets overnight.
The five non-negotiable bounds
1. Step budget
Hard cap on tool/LLM iterations per session (e.g., 10). Exceeding it terminates the loop and returns a graceful error or a checkpoint.
2. Token budget
Cumulative token budget per session, enforced by the runtime — not by the model's good intentions.
3. Dollar budget
Hard per-tenant and per-feature daily spend cap; breaching it returns 429 + alert.
4. Loop detector
Detect identical or near-identical tool calls; break with diagnostic.
5. Cheap planner, expensive executor
Plan with a small model. Execute steps with a small model where possible. Only escalate the synthesis step if needed.
+ Memory compaction
Periodically summarize the running scratchpad/context so each step's prompt doesn't grow linearly.
Without compaction, avg_input_tokens grows with each step. The cost curve becomes quadratic. Always compact.
A pattern I've seen: a customer-support agent without a step cap looped on a malformed tool response 187 times in one session. Cost: $42 for one ticket. Fix: hard cap = 8 and a loop detector. New cost: $0.04.
10Batching & Async Inference
If your workload is not user-facing in real time, batching is a 40–60% discount you should not leave on the table.
Three batching modes
Mode
Latency
Typical discount
Use when
Provider Batch API
up to 24h
~50%
Nightly enrichment, evals, content gen pipelines
Micro-batching (server-side)
+5–50 ms
throughput, not price
Self-hosted vLLM / TGI fleets
Request coalescing
+latency depending
20–40% via prompt cache reuse
Many users hitting same prefix
Async, scheduled, and shadow workloads
Push evals, content moderation backfills, and re-embedding jobs to the Batch API.
Shadow-test cheaper models in parallel; never block the user path on them.
Use off-peak windows for self-hosted fleets to amortize fixed GPU costs.
11Fine-tune vs. Prompt vs. Distill
The default impulse to "just fine-tune" is usually a cost mistake. Here is the decision order I enforce before anyone trains anything.
Figure 6 — Try prompts and RAG first; tune only after prompt iteration plateaus; distill only for high-volume, stable workloads.
Distillation — the underused weapon
For high-volume, well-defined tasks (intent classification, extraction, routing decisions, ticket triage), distill frontier-model behavior into a fine-tuned small model. Typical economics:
Capex: a few hundred to a few thousand dollars of frontier calls to generate the training set + GPU-hours for tuning.
Opex: 5–20× cheaper per call than the teacher, often with comparable quality on the narrow task.
Break-even: usually within weeks at > 1M calls/month.
Discipline: never tune until prompt iteration genuinely plateaus on a held-out eval set, and always re-baseline tuned models when a new generation of base models ships — yesterday's tuned 8B may be worse than today's base small model.
12Self-Hosted Serving Economics
Self-hosting beats SaaS APIs only when utilization is high and predictable. Otherwise you pay for idle GPUs.
If your steady-state QPS is materially below break-even, stay on the API. If it's above and stable, consider self-hosting — and only then if you have the on-call expertise to operate it.
What actually moves self-hosted cost
Lever
Effect
Notes
vLLM / TGI / TensorRT-LLM
2–10× throughput vs. naive HF
Continuous batching, paged attention
FP8 / INT8 / INT4 quantization
1.5–4× throughput, smaller memory
Measure quality regression on your evals
Speculative decoding
1.5–3× latency improvement
Needs draft model
KV cache reuse / prefix sharing
Large gain on shared system prompts
Mirrors managed prompt cache
Autoscale to zero (LoRA hot-swap)
Eliminate idle cost for niche models
Cold start trade-off
Reserved / committed-use discounts
30–60% off on-demand
Only commit to verified baseline
Spot / preemptible GPUs
50–80% off
For batch / non-critical only
Co-locate model + data
Eliminates egress
Matters at scale
Hidden cost: the SRE/MLOps team you need to run a serving fleet 24/7 is often more expensive than the GPUs themselves. Bake people cost into the build-vs-buy model.
13Observability & FinOps
You cannot control what you cannot attribute. Every LLM call must be tagged at emission with enough metadata to roll up by tenant, feature, environment, and team.
Cloud FinOps: AWS Cost Explorer / CUR, GCP Billing BigQuery export, Azure Cost Management, plus Vantage, CloudHealth, Cloudability.
Custom ledger: a simple append-only table in your warehouse (BigQuery / Snowflake) is often the most trustworthy single source of truth — emit from a thin SDK wrapper around every provider call.
14Quotas, Budgets & Guardrails
Soft guidance is not control. Cost guardrails must be enforced in the request path, not in a wiki page.
The three-tier guardrail stack
Tier
Trigger
Action
Owner
1. Soft alert
80% of feature budget
Slack + dashboard banner
Feature team
2. Hard throttle
100% of feature budget
Rate-limit or downgrade model tier
AI Platform
3. Circuit break
2× burst, or per-tenant DoS
Block + 429 + page on-call
SRE / FinOps
Concrete guardrails worth shipping
Per-tenant daily $ cap (with override workflow).
Per-feature monthly $ cap tied to product manager sign-off.
Max tokens per request at the SDK wrapper level — refuse if exceeded.
Max steps per agent session with graceful fallback.
Production-only model allowlist — prevent staging models leaking to prod.
Cost-anomaly autoresponder — auto-page on > N% deviation from forecast.
Eval-traffic isolation — separate API key + budget; never pollute prod cost reports.
Design principle: fail closed on cost, fail open on quality. Better to throttle than to surprise the CFO.
15Vendor & Contract Strategy
List prices are starting points, not endings. Above ~$50K/month of spend, real money is in commercial structure.
Negotiation levers (in order of impact)
Volume commitments with quarterly true-ups — 10–30% off list at meaningful scale.
Multi-year deals with model-class portability — protect against being locked to a deprecated SKU.
Even if one vendor is your strategic primary, maintain a tested fallback for two reasons:
Resilience: outages happen; cost spikes during incidents are real.
Leverage: a vendor that knows you can route away will price differently.
Hide vendor specifics behind an internal LLM gateway with a unified schema; this also gives you the natural place to apply routing, caching, and budgets.
16Org Model & RACI
Cost control fails when no one owns it. The default state is "AI is everyone's tool and no one's bill."
Activity
R
A
C
I
Model selection per feature
Feature lead
AI Platform
FinOps
CTO
Routing / caching infra
AI Platform
Platform lead
SRE, Security
Eng leadership
Budget setting
Finance + PM
VP Eng
Feature leads
CFO
Quota enforcement
AI Platform
SRE
Feature teams
FinOps
Vendor contracts
Procurement
CTO
AI Platform, Legal
CFO
Cost anomaly response
On-call SRE
FinOps
Feature owner
VP Eng
Quarterly cost review
FinOps
VP Eng
Feature leads
Exec staff
The "AI Platform" team charter
If you do not have one, propose it. A small (3–8 person) AI Platform team owns:
The LLM gateway / SDK wrapper (single point of vendor swapping, routing, caching, budgeting).
The cost ledger and dashboards.
Shared retrieval infrastructure.
Eval and red-team harnesses.
Production guardrails (PII, abuse, cost).
This team's ROI is measured as $ saved per $ spent on the team, and at scale it is consistently > 10×.
17AI Cost Maturity Model
Figure 7 — Five stages; advancing one stage typically halves unit cost.
18Anti-Patterns & How to Diagnose Them
Anti-pattern
Smell
Fix
"Frontier by default"
> 60% spend on top-tier model
Introduce router + cascade; baseline cheaper tier
Volatile prompt prefix
Prompt cache hit < 10%
Move volatile content to end; freeze tool schemas
RAG top-k = 20
Avg input tokens > 6K with no need
Measure recall; rerank + cap
Verbose outputs
Output tokens > input tokens
Schema-constrained outputs, max_tokens
Unbounded agents
Long-tail of $$$ sessions
Step + token + $ caps; loop detector
Re-embed everything nightly
Embedding bill grows with no corpus growth
Hash-based incremental embedding
Eval traffic in prod budget
Spend up on weekends/CI
Separate API key; separate ledger bucket
No per-tenant attribution
Can't answer "who is the top spender?"
Mandatory tenant_id tag on every call
Tune-before-prompt-plateau
$10K tuning runs that win < 2%
Pause tuning; run prompt sprint first
Self-host at low utilization
GPU fleet at 15% load
Move back to managed API; revisit at scale
19What I keep on my desk
Not commandments — a checklist I re-read before sign-off. If every item is yes, the system is cost-disciplined enough to scale.
Visibility
Every LLM call is tagged with tenant, feature, environment, team, model.
There is a single ledger (warehouse table) that is the source of truth for AI spend.
Cost-per-successful-action is published per feature, refreshed daily.
Forecast-to-month-end exists, with alerting at deviation thresholds.
Architecture
All provider calls go through an internal LLM gateway.
The gateway implements: routing, prompt caching, semantic caching, budgets, retries.
Every prompt is structured stable-first, volatile-last.
RAG uses hybrid retrieval with reranking and a hard top-k cap.
Agents have hard caps on steps, tokens, dollars, and identical-tool-call loops.
Governance
Each feature has a named budget owner and a monthly cap.
Soft/hard/circuit-break guardrails are wired in the request path.
Eval traffic uses separate API keys and its own budget bucket.
A quarterly cost review covers spend, unit economics, and vendor posture.
Continuous improvement
Routing thresholds re-evaluated monthly on a held-out set.
New model releases are benchmarked for $/quality before promotion.
High-volume stable tasks are reviewed annually for distillation.
An "AI cost office hours" or equivalent exists; teams know who to ask.
20Glossary
Terms used in this note
Term
Meaning
CPA (Cost per Action)
Fully loaded LLM + infra cost divided by successful user outcomes.
Cascade
Routing pattern: try cheapest model, escalate on low confidence.
Prompt cache
Provider-side caching of stable prompt prefixes, billed at a fraction of normal input.
Semantic cache
Application-side cache keyed by vector similarity to prior prompts.
LoRA / PEFT
Parameter-efficient fine-tuning; small adapters on a frozen base model.
Distillation
Training a smaller model to mimic a larger model's behavior on a target task.
vLLM / TGI
High-throughput inference servers with paged attention and continuous batching.
FinOps
Operational discipline of bringing financial accountability to variable cloud/AI spend.
Budget bucket
A pre-allocated dollar pool tied to a team/feature, enforced at the gateway.
Waste rate
Tokens spent on calls that produced no shipped value (failed parses, dead agent loops, etc.).
21References & Sources
Annotated bibliography behind the cost-driver map, unit-economics formulas, cost-aware architecture, routing cascades, four cache layers, RAG pipeline levers, agentic bounds, batching economics, fine-tune/distill decision tree, self-hosted break-even, FinOps telemetry schema, guardrail stack, vendor posture, RACI, maturity model, anti-patterns, and desk checklist. Section tags (e.g. §05) show where each source is used. Diagrams and worked examples are my synthesis unless noted.
Scope. Synthesis of FinOps, LLM-inference, RAG, and platform-engineering sources (May 2026). Hero KPI ranges (10–100× model cost delta, 50–90% savings from caching/routing, 3–5× agentic inflation) blend FrugalGPT, provider pricing docs, and field benchmarks — directional ranges, not guarantees for your workload. Dollar figures in §03 worked example use illustrative blended rates; verify against current list prices before budgeting. Not vendor, financial, or procurement advice.
Citations are numbered continuously [1]–[n] within this section.
FinOps Foundation, FinOps Framework & FinOps Principles. 2024–25. Capabilities model for bringing financial accountability to variable cloud spend — backbone for §13 FinOps cadence, §14 budget guardrails, and §17 maturity stages 3–4. finops.org/framework — §13, §14, §17.
FinOps Foundation, FinOps for AI working group & community guidance. 2024–25. Applying FinOps disciplines to GPU and inference billing — §01 hero framing and §13 ledger/dashboard requirements. finops.org — §01, §13.
AWS, Cost and Usage Report (CUR) & Cost Explorer documentation. Granular attribution of cloud/GPU spend — §13 tooling landscape and §02 driver 4 (serving infra). docs.aws.amazon.com/cur — §02, §13.
OpenAI, API Pricing & usage documentation. 2025–26. Input/output token rates, Batch API discount, prompt caching — §03 CPA formula, §03 worked example, §06 L3 cache, §10 batching. openai.com/api/pricing — §03, §06, §10.
Anthropic, API Pricing & Prompt Caching documentation. 2025–26. Cached-input pricing (≈10–25% of base input), batch discounts — §06 prompt-cache layer and §03 output/input asymmetry. docs.anthropic.com/prompt-caching — §03, §06, §07.
Google Cloud, Vertex AI Generative AI pricing & Context caching. 2025–26. Gemini context-cache TTL and token billing — §06 four-layer cache and §04 architecture prompt-cache node. cloud.google.com/vertex-ai/pricing — §04, §06.
Stanford HAI, 2025 AI Index Report — Technical Performance & Economy chapters. 2025. Inference-cost declines, model tiers, and adoption economics — background for §01 hero 10–100× tier delta and §05 monthly re-eval guidance. hai.stanford.edu/ai-index — §01, §05, §11.
McKinsey & Company, The Economic Potential of Generative AI: The Next Productivity Frontier. June 2023. Gen-AI cost as a scaling constraint for enterprises — §01 “fastest-growing line item” context and §15 vendor negotiation at scale. mckinsey.com — §01, §15.
Cost-aware routing, cascades & model selection (§05, FIG 3, §18)
Chen, L. et al., “FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance.” arXiv:2305.05176, 2023. LLM cascade and prompt-adaptation strategies; reported large cost reductions with quality preservation — §01 hero 50–90% range, §05 cascade routing, §05 callout on cost-per-success. arxiv.org/abs/2305.05176 — §01, §05, hero.
Ong, I. et al., “RouteLLM: Learning to Route LLMs with Preference Data.” arXiv:2406.18647, 2024. Learned routers sending queries to strong vs. weak models — §05 classifier router and §04 L3 cost-aware router. arxiv.org/abs/2406.18647 — §04, §05.
Shnitzer, A. et al., “Large Language Model Routing with Benchmark Datasets.” arXiv:2309.15789, 2023. Router design and evaluation methodology — §05 routing-policy calibration steps. arxiv.org/abs/2309.15789 — §05.
Yan, E., “Patterns for Building LLM-based Systems & Products.” eugeneyan.com, 2023–25 (ongoing). Practical patterns for routing, eval, and cost-quality trade-offs — cited in §05 design steps and §08 RAG anti-patterns mindset. eugeneyan.com/writing/llm-patterns — §05, §08, §19.
Lewis, P. et al., “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.” NeurIPS 2020. Foundational RAG architecture — §08 pipeline stages and §07 chunk-and-retrieve compression lever. arxiv.org/abs/2005.11401 — §07, §08.
Liu, N. F. et al., “Lost in the Middle: How Language Models Use Long Contexts.” TACL 2024. Models under-use middle context — supports §07 hard top-k caps and §08 rerank-cap lever over stuffing. arxiv.org/abs/2307.03172 — §07, §08.
Es, S. et al., “RAGAS: Automated Evaluation of Retrieval Augmented Generation.” 2023. RAG eval metrics (faithfulness, recall) — §08 stage-6 eval and §13 judge-model cost control. arxiv.org/abs/2309.15217 — §08, §13.
Yao, S. et al., “ReAct: Synergizing Reasoning and Acting in Language Models.” ICLR 2023. Tool-use loops that multiply LLM calls — §09 cost-shape formula and step-budget rationale. arxiv.org/abs/2210.03629 — §09.
LangChain / LangGraph documentation — agent runtimes, recursion limits & checkpoints. 2024–25. Step caps and state management patterns — §09 five non-negotiable bounds and loop detector. langchain-ai.github.io/langgraph — §09, §14.
Shinn, N. et al., “Reflexion: Language Agents with Verbal Reinforcement Learning.” NeurIPS 2023. Iterative self-correction loops — background for §09 quadratic cost curve without compaction. arxiv.org/abs/2303.11366 — §09.
Helicone & Langfuse case studies on agent cost spikes. 2024–25. Runaway session patterns and guardrail responses — §09 “pattern I've seen” callout and §13 anomaly dashboard. helicone.ai/blog — §09, §13.
Batching, async inference & throughput (§10, §12)
OpenAI, Batch API documentation. ~50% discount, 24-hour completion window — §10 provider Batch API row and async workload bullets. platform.openai.com/docs/guides/batch — §10.
Yu, G.-I. et al., “Orca: A Distributed Serving System for Transformer-Based Generative Models.” OSDI 2022. Continuous batching concepts — precursor to modern inference servers in §10 micro-batching row. usenix.org/osdi22 — §10, §12.
Fine-tuning, LoRA, distillation & the tune-last discipline (§11, FIG 6)
Hu, E. J. et al., “LoRA: Low-Rank Adaptation of Large Language Models.” ICLR 2022. Parameter-efficient fine-tuning — §11 LoRA/PEFT card and §02 driver 3 amortization. arxiv.org/abs/2106.09685 — §02, §11.
Hinton, G., Vinyals, O., & Dean, J., “Distilling the Knowledge in a Neural Network.” NeurIPS 2014 Deep Learning Workshop. Knowledge distillation foundation — §11 distillation economics and FIG 6 decision tree. arxiv.org/abs/1503.02531 — §11.
OpenAI, Model distillation guide & fine-tuning API documentation. 2024–25. Teacher-student workflows and when distillation pays — §11 break-even bullets and §18 tune-before-prompt anti-pattern. platform.openai.com/docs/guides/distillation — §11, §18.
Dettmers, T. et al., “QLoRA: Efficient Finetuning of Quantized LLMs.” NeurIPS 2023. Lower-cost fine-tune runs — §11 LoRA card capex range. arxiv.org/abs/2305.14314 — §11.
Kwon, W. et al., “Efficient Memory Management for Large Language Model Serving with PagedAttention.” SOSP 2023 (vLLM). PagedAttention and throughput gains — §12 vLLM row and §04 self-hosted OSS tier. arxiv.org/abs/2309.06180 — §04, §12.
vLLM project documentation. 2024–25. Continuous batching, prefix caching, quantization — §12 lever table and §10 micro-batching. docs.vllm.ai — §10, §12.
NVIDIA, TensorRT-LLM documentation. 2024–25. FP8/INT quantization and speculative decoding — §12 quantization and speculative-decoding rows. nvidia.github.io/TensorRT-LLM — §12.
Hugging Face, Text Generation Inference (TGI) documentation. Production serving stack — §12 vLLM/TGI comparison and glossary entry. huggingface.co/docs/tgi — §12, §20.
NVIDIA & cloud-provider GPU pricing pages (H100/H200, reserved vs spot). 2025–26. Break-even inputs for §12 formula and §18 self-host-at-low-utilization anti-pattern. nvidia.com/data-center — §12, §18.
Braintrust — eval + production logging for LLM apps. 2024–25. Cost-per-success tracking — §03 CPA denominator and §13 unit-economics dashboard. braintrust.dev/docs — §03, §13.
Arize AI, Phoenix — open-source LLM observability. 2024–25. Tracing and eval for RAG — §08 eval stage and §13 tooling landscape. docs.arize.com/phoenix — §08, §13.
Datadog, LLM Observability product documentation. 2024–25. Enterprise tracing and cost correlation — §13 tooling landscape and §16 AI Platform charter. docs.datadoghq.com/llm_observability — §13, §16.
OpenTelemetry, generative-AI semantic conventions (draft). 2024–25. Standard spans for model calls — §13 minimum telemetry schema design. opentelemetry.io/gen-ai — §13.
Guardrails, quotas, platform teams & vendor posture (§14–§16, §15)
Google SRE Book — chapters on monitoring, alerting & capacity planning. O'Reilly, 2016–18. Circuit-breaker and throttle patterns — §14 three-tier guardrail stack and §14 circuit-break row. sre.google/sre-book — §14.
Team Topologies (Skelton & Pais) — platform team as internal product. 2019. AI Platform charter in §16 — gateway ownership and enablement vs feature-team autonomy. — §16.
Partnership on AI & NIST, AI Risk Management Framework — resource allocation & monitoring. 2023. Governance patterns for production AI — §14 prod model allowlist and §16 guardrails ownership. nist.gov/ai-rmf — §14, §16.
Truong, L., AI Cost Control — personal working notes. May 2026. Original diagrams (FIG 1–7), cost-aware architecture, maturity model, anti-pattern table, desk checklist, and §03 stacked-optimization example. LinhTruong.com — all sections.
📎 Before you quote externally
Provider list prices and cache-discount percentages change frequently — re-read OpenAI, Anthropic, and Google pricing pages before citing §03 dollar figures. The §03 13× stacked-optimization example is illustrative compounding, not a guaranteed outcome. FIG 1–7 driver shares and cache hit-rate bands are field heuristics; measure on your own traffic. Self-hosted break-even depends on utilization, labor cost, and reserved-GPU terms — run your own model before committing capex.
What I'd tell a team on day one: the edge is not picking the smartest model — it's designing systems whose unit economics improve as scale grows. Optimize for compounding: every routing improvement, every cache hit, every byte trimmed from a prompt pays you back every day the system runs.