Personal notes · May 2026

AI Cost Control

I wrote this note to keep LLM spend economically rational — from token-level mechanics through routing, caching, RAG, agents, and FinOps guardrails. It is my working blueprint for teams shipping AI at scale, not a vendor white paper.

Author: Linh Truong, MA (Harvard), MBA · Source: LinhTruong.com · Email: Linh@Alumni.Harvard.edu

The question behind this note: does every token, GPU-second, and vector lookup earn its keep? What follows is how I map the bill, design the request path, and keep unit economics from drifting as usage grows.

FinOps for AI Token Economics Caching · Routing · Batching RAG Optimization Inference Architecture Observability Governance & Guardrails ✍️ By: Linh Truong

01What I'm tracking

Inference spend is now the fastest-growing line item in most engineering budgets I see — often eclipsing compute, storage, and observability combined. My job on these systems is to keep AI economically rational: every token, GPU-second, and vector lookup either generates measurable user value or gets cut.

10–100×

Cost delta between frontier & small models

50–90%

Token spend cuttable via caching & routing

3–5×

Inflation from naive agentic loops

< $0.01

Target unit cost per common user action

What I keep coming back to: cost control is an architectural property, not a procurement exercise. Design it into the request path: route the cheapest model that meets the SLA, cache aggressively at every layer, compress context ruthlessly, batch where latency allows, and instrument every call to a cost ledger.

What's in this note

The full taxonomy of where dollars go in an LLM system.
Architectural patterns: routing, caching, RAG, agents, batching, serving.
Engineering trade-offs: latency vs. cost vs. quality vs. risk.
FinOps tooling, observability schemas, and budget guardrails.
Org design, vendor negotiation, and a 5-stage maturity model.

02Cost Drivers Anatomy

Before optimizing, map the bill. AI cost decomposes into seven primary drivers. Most teams over-index on (1) and ignore (3)–(7), which is exactly where the avoidable waste lives.

Figure 1 — Seven primary drivers. The hub is total cost of ownership; each spoke needs its own owner and SLO.

Driver deep-dive

Driver	Typical share	Primary levers	Common waste
Token inference	50–80%	Routing, caching, compression, smaller models	Frontier model for trivial calls
Embeddings & vector store	3–10%	Smaller embed dims, quantization, TTL	Re-embedding unchanged corpora
Fine-tuning / training	2–15%	LoRA, distillation, early stopping	Tuning before prompt iteration plateau
GPU serving infra	10–40% (if self-hosted)	vLLM, paged-attention, autoscale-to-zero	Idle reserved fleets
Egress & networking	1–8%	Co-locate model + data, edge cache	Cross-region RAG hops
Observability & eval	5–15%	Sampled tracing, batch evals, cheap judge models	100% trace + frontier judge LLM
Human-in-the-loop	variable	Active learning, confidence-gated review	Reviewing every output

03Unit Economics & Formulas

Every product decision should resolve to a cost-per-action (CPA) number. If you can't compute CPA on demand, you cannot govern AI cost.

Core formula

CPA = Σ over calls c in action a:
      (input_tokens(c)  × P_in(model_c)
     +  output_tokens(c) × P_out(model_c)
     +  cached_tokens(c) × P_cache(model_c))
    +  retrieval_cost(a)
    +  tool_cost(a)
    +  infra_amortization(a)

Five ratios I won't ship without

1. Cost per successful action ($/action)

Numerator is total LLM + infra spend. Denominator is successful outcomes (not raw API calls), defined by product. Retries, abandoned agentic loops, and rejections all inflate cost without value.

2. Tokens per action (in/out)

Track input and output independently. Output tokens are 3–5× more expensive than input on most APIs. A rising output/input ratio often signals prompt drift or verbose chain-of-thought.

3. Cache hit rate (%)

Prompt cache + semantic cache + retrieval cache. Below 30% means you have not invested. Above 80% means you may be serving stale answers — pair with freshness SLO.

4. Model mix ($-weighted)

Share of spend by tier: frontier mid small self-hosted. Healthy SaaS-style products run 10–25% frontier; the rest is mid or small.

5. Cost-to-Revenue (or Cost-to-Margin) ratio

For revenue-generating features: AI cost / incremental revenue. For internal tools: AI cost / labor hours saved × loaded labor rate. Hard line: investigate any feature > 30% C/R.

6. Waste rate (%)

Share of tokens spent on calls that produced no shipped output: failed JSON parses, guardrail rejections, agent dead-ends, eval/test traffic in prod accounts. Target < 5%.

A worked example

Scenario	Model	Input tok	Output tok	$ per call	1M calls/mo
Naïve (frontier, no cache)	Frontier	4,000	800	$0.0760	$76,000
+ Prompt caching (system reused)	Frontier	4,000 (3,500 cached)	800	$0.0265	$26,500
+ Route 70% to mid-tier	Mix	4,000	800	$0.0142	$14,200
+ Context compression (−60% in)	Mix	1,600	800	$0.0091	$9,100
+ Semantic cache (35% hit)	Mix	—	—	$0.0059	$5,900

Illustrative blended rates. The point is not the absolute dollars — it's that stacked optimizations compound to ~13× without changing the user-visible product.

04Reference Cost-Aware Architecture

The architecture below pushes every request through layers that progressively get more expensive. The cheapest possible answer wins. Each layer has a clear cost owner.

Figure 2 — Layered request path: L1 exact cache → L2 semantic cache → L3 cost-aware router → tiered models, with retrieval and provider prompt cache feeding the same ledger.

Layer-by-layer cost intent

Layer	Goal	Cost per hit	Owner
L1 Edge cache	Zero-LLM for hot, identical prompts	~$0	Platform
L2 Semantic cache	Reuse near-duplicate answers	~$0.0001 (vector lookup)	AI Platform
L3 Router	Pick cheapest viable model	Tiny classifier or rules	AI Platform
Prompt cache	Cut input-token cost on stable prefix	10–25% of normal input	Feature team
Retrieval	Bound context size; reduce hallucination	vector + rerank fee	Data / RAG team
Models	Generate final answer	per-token, tier-dependent	Feature team
Ledger	Attribute every dollar to a tenant/feature	logging only	FinOps

05Model Routing & Cascades

Routing is the single highest-leverage optimization. A well-designed router cuts spend 40–70% with no quality loss because most queries are easy. The goal: send each request to the cheapest model whose expected quality clears the SLA.

Three routing strategies

A. Rule-based router

Deterministic dispatch on request features: length, language, intent label, user tier, schema mode.

Pros Zero overhead, auditable.
Cons Rigid; misses nuanced cases.

B. Classifier router

Small model (e.g., distilled BERT, Haiku-class) predicts "difficulty" and routes accordingly.

Pros Adaptive, cheap (< $0.0002/call).
Cons Needs labeled traffic; can drift.

C. Cascade / fallback

Try small model first; escalate to mid/frontier only if confidence/judge says "insufficient."

Pros Self-correcting; great for long tail.
Cons Adds latency; needs a cheap judge.

Figure 3 — A typical 3-tier cascade: small model handles the majority; a cheap judge gates escalation; frontier model reserved for the long tail.

Designing the routing policy

Label the workload. Sample 1–5K real prompts, annotate difficulty / required capability / latency tolerance.
Calibrate thresholds. For each candidate model, measure quality on samples; pick the cheapest that meets the SLA.
Decide escalation signal. Options: log-prob/entropy, judge model verdict, JSON schema validation, regex/tool-call success, user thumbs-down.
Bound the cascade. Hard cap on escalations per request (e.g., 1) to prevent cost runaway.
Continuously re-evaluate. Cheap models improve quickly; re-run routing eval monthly.

Watch out: Cascades can increase cost if your small model is wrong often. Always compare cascade cost-per-success vs. single-model cost-per-success on the same eval set before shipping.

06Caching Strategies

Caching is the cheapest dollar you'll ever save. A serious system runs at least four distinct caches.

Figure 4 — Four cache tiers, each with different latency, hit rates, and freshness risk.

Designing prompt caches that actually hit

Provider prompt caches (Anthropic, OpenAI, Google) only hit when the prefix is byte-identical. The prompt structure I enforce on teams I work with:

[ STABLE — cached ]
  system instructions
  tool / function definitions
  large reference documents
  few-shot examples

[ VOLATILE — not cached ]
  user query
  per-request retrieved snippets (or place AFTER stable block)
  timestamps, user_id, session id

Engineering rule: sort prompt sections from most stable to most volatile. Never inject a timestamp at the top. Never reorder tool definitions between calls.

Semantic cache: when it's a trap

Don't use semantic cache for personalized, account-specific, or time-sensitive answers without per-tenant namespaces.
Don't set similarity threshold too low — false positives serve wrong answers and erode trust.
Do log every cache hit's similarity score and sample audit a slice weekly.
Do invalidate by content-hash when source documents change.

07Context & Prompt Compression

Context length is the second-biggest token lever after model choice. The discipline: send the minimum context needed for a correct answer, no more.

The compression toolbox

Technique	How it works	Typical reduction	Risk
Chunk + retrieve top-k	RAG instead of stuffing	70–95%	Recall loss if k too small
Reranking with cap	Keep top-k after rerank, hard-cap chars	30–60%	Reranker latency & cost
Summarize-then-prompt	Small model summarizes long docs once, cache summary	60–90%	Summary may drop critical detail
Sliding/decay memory	Drop or summarize older turns	40–80% on chat	Loss of long-range coherence
Structured extraction first	Convert docs to JSON, pass JSON	50–80%	Schema drift
Prompt minification	Remove filler, redundant instructions	10–25%	Quality regressions if over-pruned
Output schema constraints	Force concise structured output	30–70% on output	Need robust parsers
Stop sequences & max_tokens	Bound runaway generations	variable	Truncation

The asymmetry: output tokens cost more

On most APIs, output tokens are 3–5× the price of input. Yet engineers focus on shrinking input. Equally important:

Constrain output with a strict JSON/XML schema; reject and retry only with stronger constraints.
Use max_tokens as a hard guardrail, not a hope.
Discourage chain-of-thought in the visible answer unless a thinking budget is genuinely needed; if it is, use a separate thinking budget and don't include scratchpad in the final shown text.
For listy answers, ask explicitly for "3 bullets, ≤ 12 words each."

Heuristic: if your average output is > 800 tokens, you almost certainly have either over-broad prompts or a UX that rewards verbosity. Both are fixable.

08RAG Cost Optimization

RAG looks cheap because each call is small, but at scale embedding, storage, retrieval, and rerank fees stack. Worse, naive RAG inflates prompt size and the model bill it was meant to reduce.

Figure 5 — Six RAG stages; the highest-leverage levers are usually steps 4 (rerank cap) and 5 (model tier).

RAG cost anti-patterns

Embedding the whole corpus every nightly run. Use content hashes; re-embed only changed chunks.
Top-k = 20 by default. Measure recall; most workloads peak at 4–8.
1536-dim embeddings for short SKU descriptions. Use a smaller model and dimension.
Per-request reranking with a frontier model. Use a dedicated cross-encoder.
No tenant scoping. Larger index = more recall noise + more tokens fed in.

09Agentic Cost Control

Agents are the most dangerous cost category because their cost is unbounded by default. A single user query can trigger dozens of model calls, tool calls, and retries. Naïve loops have killed budgets overnight.

The five non-negotiable bounds

1. Step budget

Hard cap on tool/LLM iterations per session (e.g., 10). Exceeding it terminates the loop and returns a graceful error or a checkpoint.

2. Token budget

Cumulative token budget per session, enforced by the runtime — not by the model's good intentions.

3. Dollar budget

Hard per-tenant and per-feature daily spend cap; breaching it returns 429 + alert.

4. Loop detector

Detect identical or near-identical tool calls; break with diagnostic.

5. Cheap planner, expensive executor

Plan with a small model. Execute steps with a small model where possible. Only escalate the synthesis step if needed.

+ Memory compaction

Periodically summarize the running scratchpad/context so each step's prompt doesn't grow linearly.

Cost shape of agentic systems

cost(session) ≈ Σ steps × (avg_input_tokens + avg_output_tokens) × price
              + Σ tool_calls × tool_cost
              + Σ subagent_invocations × subagent_cost

Without compaction, avg_input_tokens grows with each step. The cost curve becomes quadratic. Always compact.

A pattern I've seen: a customer-support agent without a step cap looped on a malformed tool response 187 times in one session. Cost: $42 for one ticket. Fix: hard cap = 8 and a loop detector. New cost: $0.04.

10Batching & Async Inference

If your workload is not user-facing in real time, batching is a 40–60% discount you should not leave on the table.

Three batching modes

Mode	Latency	Typical discount	Use when
Provider Batch API	up to 24h	~50%	Nightly enrichment, evals, content gen pipelines
Micro-batching (server-side)	+5–50 ms	throughput, not price	Self-hosted vLLM / TGI fleets
Request coalescing	+latency depending	20–40% via prompt cache reuse	Many users hitting same prefix

Async, scheduled, and shadow workloads

Push evals, content moderation backfills, and re-embedding jobs to the Batch API.
Shadow-test cheaper models in parallel; never block the user path on them.
Use off-peak windows for self-hosted fleets to amortize fixed GPU costs.

11Fine-tune vs. Prompt vs. Distill

The default impulse to "just fine-tune" is usually a cost mistake. Here is the decision order I enforce before anyone trains anything.

Figure 6 — Try prompts and RAG first; tune only after prompt iteration plateaus; distill only for high-volume, stable workloads.

Distillation — the underused weapon

For high-volume, well-defined tasks (intent classification, extraction, routing decisions, ticket triage), distill frontier-model behavior into a fine-tuned small model. Typical economics:

Capex: a few hundred to a few thousand dollars of frontier calls to generate the training set + GPU-hours for tuning.
Opex: 5–20× cheaper per call than the teacher, often with comparable quality on the narrow task.
Break-even: usually within weeks at > 1M calls/month.

Discipline: never tune until prompt iteration genuinely plateaus on a held-out eval set, and always re-baseline tuned models when a new generation of base models ships — yesterday's tuned 8B may be worse than today's base small model.

12Self-Hosted Serving Economics

Self-hosting beats SaaS APIs only when utilization is high and predictable. Otherwise you pay for idle GPUs.

The break-even formula

break_even_qps = (gpu_hourly_cost × n_gpus) / (api_price_per_call × 3600 × utilization)

If your steady-state QPS is materially below break-even, stay on the API. If it's above and stable, consider self-hosting — and only then if you have the on-call expertise to operate it.

What actually moves self-hosted cost

Lever	Effect	Notes
vLLM / TGI / TensorRT-LLM	2–10× throughput vs. naive HF	Continuous batching, paged attention
FP8 / INT8 / INT4 quantization	1.5–4× throughput, smaller memory	Measure quality regression on your evals
Speculative decoding	1.5–3× latency improvement	Needs draft model
KV cache reuse / prefix sharing	Large gain on shared system prompts	Mirrors managed prompt cache
Autoscale to zero (LoRA hot-swap)	Eliminate idle cost for niche models	Cold start trade-off
Reserved / committed-use discounts	30–60% off on-demand	Only commit to verified baseline
Spot / preemptible GPUs	50–80% off	For batch / non-critical only
Co-locate model + data	Eliminates egress	Matters at scale

Hidden cost: the SRE/MLOps team you need to run a serving fleet 24/7 is often more expensive than the GPUs themselves. Bake people cost into the build-vs-buy model.

13Observability & FinOps

You cannot control what you cannot attribute. Every LLM call must be tagged at emission with enough metadata to roll up by tenant, feature, environment, and team.

Minimum required telemetry per call

{
  "request_id":      "uuid",
  "timestamp":       "2026-05-18T14:22:01Z",
  "tenant_id":       "acme-corp",
  "user_id_hash":    "...",
  "feature":         "support.summary",
  "environment":     "prod",
  "team":            "support-platform",
  "model":           "claude-sonnet-4-6",
  "provider":        "anthropic",
  "input_tokens":     3421,
  "output_tokens":    412,
  "cached_tokens":   2980,
  "cache_layer":     "provider_prompt",
  "tool_calls":      [{"name":"search_kb","cost":0.0001}],
  "latency_ms":       1840,
  "cost_usd":         0.0061,
  "budget_bucket":   "support-platform/prod/monthly",
  "outcome":         "success",
  "quality_signal":  {"thumbs":"up"}
}

Four dashboards I require before launch

1. Spend trajectory

Today, week-to-date, month-to-date vs. budget; forecast to month-end; top 10 features by % growth.

2. Unit economics

$ per action by feature; trend; outliers; cost-per-successful-action vs. cost-per-call.

3. Efficiency

Cache hit rates by layer; model mix; tokens in/out; waste rate; routing accuracy.

4. Anomalies

Spikes by tenant/feature; runaway agents; suspect prompts; budget-bucket breaches.

Tooling landscape

LLM-native: Langfuse, Helicone, Arize Phoenix, Braintrust, OpenLLMetry, Datadog LLM Observability.
Cloud FinOps: AWS Cost Explorer / CUR, GCP Billing BigQuery export, Azure Cost Management, plus Vantage, CloudHealth, Cloudability.
Custom ledger: a simple append-only table in your warehouse (BigQuery / Snowflake) is often the most trustworthy single source of truth — emit from a thin SDK wrapper around every provider call.

14Quotas, Budgets & Guardrails

Soft guidance is not control. Cost guardrails must be enforced in the request path, not in a wiki page.

The three-tier guardrail stack

Tier	Trigger	Action	Owner
1. Soft alert	80% of feature budget	Slack + dashboard banner	Feature team
2. Hard throttle	100% of feature budget	Rate-limit or downgrade model tier	AI Platform
3. Circuit break	2× burst, or per-tenant DoS	Block + 429 + page on-call	SRE / FinOps

Concrete guardrails worth shipping

Per-tenant daily $ cap (with override workflow).
Per-feature monthly $ cap tied to product manager sign-off.
Max tokens per request at the SDK wrapper level — refuse if exceeded.
Max steps per agent session with graceful fallback.
Production-only model allowlist — prevent staging models leaking to prod.
Cost-anomaly autoresponder — auto-page on > N% deviation from forecast.
Eval-traffic isolation — separate API key + budget; never pollute prod cost reports.

Design principle: fail closed on cost, fail open on quality. Better to throttle than to surprise the CFO.

15Vendor & Contract Strategy

List prices are starting points, not endings. Above ~$50K/month of spend, real money is in commercial structure.

Negotiation levers (in order of impact)

Volume commitments with quarterly true-ups — 10–30% off list at meaningful scale.
Multi-year deals with model-class portability — protect against being locked to a deprecated SKU.
Reserved throughput / provisioned capacity — predictable price, predictable latency.
Cross-product bundles when the vendor sells more than inference (eval, embeddings, fine-tuning).
Free / discounted dev & eval traffic — surprisingly often granted.
Right-to-export embeddings / tuned weights — exit clause that protects future negotiation.

Multi-provider posture

Even if one vendor is your strategic primary, maintain a tested fallback for two reasons:

Resilience: outages happen; cost spikes during incidents are real.
Leverage: a vendor that knows you can route away will price differently.

Hide vendor specifics behind an internal LLM gateway with a unified schema; this also gives you the natural place to apply routing, caching, and budgets.

16Org Model & RACI

Cost control fails when no one owns it. The default state is "AI is everyone's tool and no one's bill."

Activity	R	A	C	I
Model selection per feature	Feature lead	AI Platform	FinOps	CTO
Routing / caching infra	AI Platform	Platform lead	SRE, Security	Eng leadership
Budget setting	Finance + PM	VP Eng	Feature leads	CFO
Quota enforcement	AI Platform	SRE	Feature teams	FinOps
Vendor contracts	Procurement	CTO	AI Platform, Legal	CFO
Cost anomaly response	On-call SRE	FinOps	Feature owner	VP Eng
Quarterly cost review	FinOps	VP Eng	Feature leads	Exec staff

The "AI Platform" team charter

If you do not have one, propose it. A small (3–8 person) AI Platform team owns:

The LLM gateway / SDK wrapper (single point of vendor swapping, routing, caching, budgeting).
The cost ledger and dashboards.
Shared retrieval infrastructure.
Eval and red-team harnesses.
Production guardrails (PII, abuse, cost).

This team's ROI is measured as $ saved per $ spent on the team, and at scale it is consistently > 10×.

17AI Cost Maturity Model

Figure 7 — Five stages; advancing one stage typically halves unit cost.

18Anti-Patterns & How to Diagnose Them

Anti-pattern	Smell	Fix
"Frontier by default"	> 60% spend on top-tier model	Introduce router + cascade; baseline cheaper tier
Volatile prompt prefix	Prompt cache hit < 10%	Move volatile content to end; freeze tool schemas
RAG top-k = 20	Avg input tokens > 6K with no need	Measure recall; rerank + cap
Verbose outputs	Output tokens > input tokens	Schema-constrained outputs, max_tokens
Unbounded agents	Long-tail of $$$ sessions	Step + token + $ caps; loop detector
Re-embed everything nightly	Embedding bill grows with no corpus growth	Hash-based incremental embedding
Eval traffic in prod budget	Spend up on weekends/CI	Separate API key; separate ledger bucket
No per-tenant attribution	Can't answer "who is the top spender?"	Mandatory `tenant_id` tag on every call
Tune-before-prompt-plateau	$10K tuning runs that win < 2%	Pause tuning; run prompt sprint first
Self-host at low utilization	GPU fleet at 15% load	Move back to managed API; revisit at scale

19What I keep on my desk

Not commandments — a checklist I re-read before sign-off. If every item is yes, the system is cost-disciplined enough to scale.

Visibility

Every LLM call is tagged with tenant, feature, environment, team, model.
There is a single ledger (warehouse table) that is the source of truth for AI spend.
Cost-per-successful-action is published per feature, refreshed daily.
Forecast-to-month-end exists, with alerting at deviation thresholds.

Architecture

All provider calls go through an internal LLM gateway.
The gateway implements: routing, prompt caching, semantic caching, budgets, retries.
Every prompt is structured stable-first, volatile-last.
RAG uses hybrid retrieval with reranking and a hard top-k cap.
Agents have hard caps on steps, tokens, dollars, and identical-tool-call loops.

Governance

Each feature has a named budget owner and a monthly cap.
Soft/hard/circuit-break guardrails are wired in the request path.
Eval traffic uses separate API keys and its own budget bucket.
A quarterly cost review covers spend, unit economics, and vendor posture.

Continuous improvement

Routing thresholds re-evaluated monthly on a held-out set.
New model releases are benchmarked for $/quality before promotion.
High-volume stable tasks are reviewed annually for distillation.
An "AI cost office hours" or equivalent exists; teams know who to ask.

20Glossary

Terms used in this note

Term	Meaning
CPA (Cost per Action)	Fully loaded LLM + infra cost divided by successful user outcomes.
Cascade	Routing pattern: try cheapest model, escalate on low confidence.
Prompt cache	Provider-side caching of stable prompt prefixes, billed at a fraction of normal input.
Semantic cache	Application-side cache keyed by vector similarity to prior prompts.
LoRA / PEFT	Parameter-efficient fine-tuning; small adapters on a frozen base model.
Distillation	Training a smaller model to mimic a larger model's behavior on a target task.
vLLM / TGI	High-throughput inference servers with paged attention and continuous batching.
FinOps	Operational discipline of bringing financial accountability to variable cloud/AI spend.
Budget bucket	A pre-allocated dollar pool tied to a team/feature, enforced at the gateway.
Waste rate	Tokens spent on calls that produced no shipped value (failed parses, dead agent loops, etc.).

21References & Sources

Annotated bibliography behind the cost-driver map, unit-economics formulas, cost-aware architecture, routing cascades, four cache layers, RAG pipeline levers, agentic bounds, batching economics, fine-tune/distill decision tree, self-hosted break-even, FinOps telemetry schema, guardrail stack, vendor posture, RACI, maturity model, anti-patterns, and desk checklist. Section tags (e.g. §05) show where each source is used. Diagrams and worked examples are my synthesis unless noted.

Scope. Synthesis of FinOps, LLM-inference, RAG, and platform-engineering sources (May 2026). Hero KPI ranges (10–100× model cost delta, 50–90% savings from caching/routing, 3–5× agentic inflation) blend FrugalGPT, provider pricing docs, and field benchmarks — directional ranges, not guarantees for your workload. Dollar figures in §03 worked example use illustrative blended rates; verify against current list prices before budgeting. Not vendor, financial, or procurement advice.

Citations are numbered continuously [1]–[n] within this section.

FinOps, cloud cost accountability & AI spend governance (§01, §13–§14, §16–§17)

FinOps Foundation, FinOps Framework & FinOps Principles. 2024–25. Capabilities model for bringing financial accountability to variable cloud spend — backbone for §13 FinOps cadence, §14 budget guardrails, and §17 maturity stages 3–4. finops.org/framework — §13, §14, §17.
FinOps Foundation, FinOps for AI working group & community guidance. 2024–25. Applying FinOps disciplines to GPU and inference billing — §01 hero framing and §13 ledger/dashboard requirements. finops.org — §01, §13.
AWS, Cost and Usage Report (CUR) & Cost Explorer documentation. Granular attribution of cloud/GPU spend — §13 tooling landscape and §02 driver 4 (serving infra). docs.aws.amazon.com/cur — §02, §13.
Google Cloud, BigQuery billing export & Cloud Billing APIs. Warehouse-native cost rollups — §13 custom-ledger pattern and §16 FinOps RACI. cloud.google.com/billing — §13, §16.

LLM token economics, pricing & inference cost trends (§01, §03, hero KPIs)

OpenAI, API Pricing & usage documentation. 2025–26. Input/output token rates, Batch API discount, prompt caching — §03 CPA formula, §03 worked example, §06 L3 cache, §10 batching. openai.com/api/pricing — §03, §06, §10.
Anthropic, API Pricing & Prompt Caching documentation. 2025–26. Cached-input pricing (≈10–25% of base input), batch discounts — §06 prompt-cache layer and §03 output/input asymmetry. docs.anthropic.com/prompt-caching — §03, §06, §07.
Google Cloud, Vertex AI Generative AI pricing & Context caching. 2025–26. Gemini context-cache TTL and token billing — §06 four-layer cache and §04 architecture prompt-cache node. cloud.google.com/vertex-ai/pricing — §04, §06.
Stanford HAI, 2025 AI Index Report — Technical Performance & Economy chapters. 2025. Inference-cost declines, model tiers, and adoption economics — background for §01 hero 10–100× tier delta and §05 monthly re-eval guidance. hai.stanford.edu/ai-index — §01, §05, §11.
McKinsey & Company, The Economic Potential of Generative AI: The Next Productivity Frontier. June 2023. Gen-AI cost as a scaling constraint for enterprises — §01 “fastest-growing line item” context and §15 vendor negotiation at scale. mckinsey.com — §01, §15.

Cost-aware routing, cascades & model selection (§05, FIG 3, §18)

Chen, L. et al., “FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance.” arXiv:2305.05176, 2023. LLM cascade and prompt-adaptation strategies; reported large cost reductions with quality preservation — §01 hero 50–90% range, §05 cascade routing, §05 callout on cost-per-success. arxiv.org/abs/2305.05176 — §01, §05, hero.
Ong, I. et al., “RouteLLM: Learning to Route LLMs with Preference Data.” arXiv:2406.18647, 2024. Learned routers sending queries to strong vs. weak models — §05 classifier router and §04 L3 cost-aware router. arxiv.org/abs/2406.18647 — §04, §05.
Shnitzer, A. et al., “Large Language Model Routing with Benchmark Datasets.” arXiv:2309.15789, 2023. Router design and evaluation methodology — §05 routing-policy calibration steps. arxiv.org/abs/2309.15789 — §05.
Yan, E., “Patterns for Building LLM-based Systems & Products.” eugeneyan.com, 2023–25 (ongoing). Practical patterns for routing, eval, and cost-quality trade-offs — cited in §05 design steps and §08 RAG anti-patterns mindset. eugeneyan.com/writing/llm-patterns — §05, §08, §19.

Caching — prompt, semantic & retrieval layers (§06, FIG 4, §04)

Anthropic, Prompt caching — implementation guide & pricing. Byte-identical prefix requirements, TTL, cache breakpoints — §06 stable/volatile prompt structure and FIG 4 L3. docs.anthropic.com — §04, §06.
OpenAI, Prompt caching (automatic caching on supported models). 2024–25. Provider-side prefix reuse — §06 engineering rule on stable-first prompt ordering. platform.openai.com/docs/guides/prompt-caching — §06.
Zilliz / GPTCache project, “GPTCache: An open-source semantic cache for LLM applications.” 2023–25. Vector-similarity caching patterns and false-positive risks — §06 L2 semantic cache and semantic-cache trap bullets. github.com/zilliztech/GPTCache — §06, §18.
Redis, Vector search & caching best-practices documentation. Exact-match and TTL caches at the edge — §04 L1 edge cache and §06 L1 exact-match layer. redis.io/docs — §04, §06.

Context compression, long-context limits & RAG economics (§07–§08, FIG 5)

Lewis, P. et al., “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.” NeurIPS 2020. Foundational RAG architecture — §08 pipeline stages and §07 chunk-and-retrieve compression lever. arxiv.org/abs/2005.11401 — §07, §08.
Liu, N. F. et al., “Lost in the Middle: How Language Models Use Long Contexts.” TACL 2024. Models under-use middle context — supports §07 hard top-k caps and §08 rerank-cap lever over stuffing. arxiv.org/abs/2307.03172 — §07, §08.
Es, S. et al., “RAGAS: Automated Evaluation of Retrieval Augmented Generation.” 2023. RAG eval metrics (faithfulness, recall) — §08 stage-6 eval and §13 judge-model cost control. arxiv.org/abs/2309.15217 — §08, §13.
TruLens documentation — LLM app evaluation & tracing. 2024–25. Feedback functions and RAG triad — §08 eval stage and §13 observability tooling list. trulens.org — §08, §13.
Pinecone, Weaviate, Qdrant — vector-database sizing & embedding-dimension guides. 2024–25. Storage and query cost trade-offs — §08 embed stage (dim 256–512 vs 1536) and §02 driver 2. pinecone.io/learn — §02, §08.

Agentic systems, loops & unbounded cost risk (§09, hero 3–5×, §18)

Yao, S. et al., “ReAct: Synergizing Reasoning and Acting in Language Models.” ICLR 2023. Tool-use loops that multiply LLM calls — §09 cost-shape formula and step-budget rationale. arxiv.org/abs/2210.03629 — §09.
LangChain / LangGraph documentation — agent runtimes, recursion limits & checkpoints. 2024–25. Step caps and state management patterns — §09 five non-negotiable bounds and loop detector. langchain-ai.github.io/langgraph — §09, §14.
Shinn, N. et al., “Reflexion: Language Agents with Verbal Reinforcement Learning.” NeurIPS 2023. Iterative self-correction loops — background for §09 quadratic cost curve without compaction. arxiv.org/abs/2303.11366 — §09.
Helicone & Langfuse case studies on agent cost spikes. 2024–25. Runaway session patterns and guardrail responses — §09 “pattern I've seen” callout and §13 anomaly dashboard. helicone.ai/blog — §09, §13.

Batching, async inference & throughput (§10, §12)

OpenAI, Batch API documentation. ~50% discount, 24-hour completion window — §10 provider Batch API row and async workload bullets. platform.openai.com/docs/guides/batch — §10.
Anthropic, Message Batches API. 2024–25. Async batch processing for non-latency-sensitive workloads — §10 batching modes table. docs.anthropic.com/message-batches — §10.
Yu, G.-I. et al., “Orca: A Distributed Serving System for Transformer-Based Generative Models.” OSDI 2022. Continuous batching concepts — precursor to modern inference servers in §10 micro-batching row. usenix.org/osdi22 — §10, §12.

Fine-tuning, LoRA, distillation & the tune-last discipline (§11, FIG 6)

Hu, E. J. et al., “LoRA: Low-Rank Adaptation of Large Language Models.” ICLR 2022. Parameter-efficient fine-tuning — §11 LoRA/PEFT card and §02 driver 3 amortization. arxiv.org/abs/2106.09685 — §02, §11.
Hinton, G., Vinyals, O., & Dean, J., “Distilling the Knowledge in a Neural Network.” NeurIPS 2014 Deep Learning Workshop. Knowledge distillation foundation — §11 distillation economics and FIG 6 decision tree. arxiv.org/abs/1503.02531 — §11.
OpenAI, Model distillation guide & fine-tuning API documentation. 2024–25. Teacher-student workflows and when distillation pays — §11 break-even bullets and §18 tune-before-prompt anti-pattern. platform.openai.com/docs/guides/distillation — §11, §18.
Dettmers, T. et al., “QLoRA: Efficient Finetuning of Quantized LLMs.” NeurIPS 2023. Lower-cost fine-tune runs — §11 LoRA card capex range. arxiv.org/abs/2305.14314 — §11.

Self-hosted serving, GPU economics & inference stacks (§12, §02 driver 4)

Kwon, W. et al., “Efficient Memory Management for Large Language Model Serving with PagedAttention.” SOSP 2023 (vLLM). PagedAttention and throughput gains — §12 vLLM row and §04 self-hosted OSS tier. arxiv.org/abs/2309.06180 — §04, §12.
vLLM project documentation. 2024–25. Continuous batching, prefix caching, quantization — §12 lever table and §10 micro-batching. docs.vllm.ai — §10, §12.
NVIDIA, TensorRT-LLM documentation. 2024–25. FP8/INT quantization and speculative decoding — §12 quantization and speculative-decoding rows. nvidia.github.io/TensorRT-LLM — §12.
Hugging Face, Text Generation Inference (TGI) documentation. Production serving stack — §12 vLLM/TGI comparison and glossary entry. huggingface.co/docs/tgi — §12, §20.
NVIDIA & cloud-provider GPU pricing pages (H100/H200, reserved vs spot). 2025–26. Break-even inputs for §12 formula and §18 self-host-at-low-utilization anti-pattern. nvidia.com/data-center — §12, §18.

Observability, LLM tracing & cost ledgers (§13, §04 ledger, §19)

Langfuse documentation — LLM tracing, cost tracking & evaluations. 2024–25. Per-call metadata schema aligned with §13 telemetry JSON — §13 tooling and §19 visibility checklist. langfuse.com/docs — §04, §13, §19.
Helicone — LLM observability & cost analytics. 2024–25. Gateway proxy pattern for attribution — §04 cost ledger and §13 LLM-native tooling list. docs.helicone.ai — §04, §13.
Braintrust — eval + production logging for LLM apps. 2024–25. Cost-per-success tracking — §03 CPA denominator and §13 unit-economics dashboard. braintrust.dev/docs — §03, §13.
Arize AI, Phoenix — open-source LLM observability. 2024–25. Tracing and eval for RAG — §08 eval stage and §13 tooling landscape. docs.arize.com/phoenix — §08, §13.
Datadog, LLM Observability product documentation. 2024–25. Enterprise tracing and cost correlation — §13 tooling landscape and §16 AI Platform charter. docs.datadoghq.com/llm_observability — §13, §16.
OpenTelemetry, generative-AI semantic conventions (draft). 2024–25. Standard spans for model calls — §13 minimum telemetry schema design. opentelemetry.io/gen-ai — §13.

Guardrails, quotas, platform teams & vendor posture (§14–§16, §15)

Google SRE Book — chapters on monitoring, alerting & capacity planning. O'Reilly, 2016–18. Circuit-breaker and throttle patterns — §14 three-tier guardrail stack and §14 circuit-break row. sre.google/sre-book — §14.
Team Topologies (Skelton & Pais) — platform team as internal product. 2019. AI Platform charter in §16 — gateway ownership and enablement vs feature-team autonomy. — §16.
Partnership on AI & NIST, AI Risk Management Framework — resource allocation & monitoring. 2023. Governance patterns for production AI — §14 prod model allowlist and §16 guardrails ownership. nist.gov/ai-rmf — §14, §16.
Flexera / Gartner cloud-cost negotiation research (enterprise licensing). 2023–25. Volume-commitment and true-up structures — §15 negotiation levers 1–2. flexera.com/blog/finops — §15.

Author synthesis

Truong, L., AI Cost Control — personal working notes. May 2026. Original diagrams (FIG 1–7), cost-aware architecture, maturity model, anti-pattern table, desk checklist, and §03 stacked-optimization example. LinhTruong.com — all sections.

📎 Before you quote externally

Provider list prices and cache-discount percentages change frequently — re-read OpenAI, Anthropic, and Google pricing pages before citing §03 dollar figures. The §03 13× stacked-optimization example is illustrative compounding, not a guaranteed outcome. FIG 1–7 driver shares and cache hit-rate bands are field heuristics; measure on your own traffic. Self-hosted break-even depends on utilization, labor cost, and reserved-GPU terms — run your own model before committing capex.

What I'd tell a team on day one: the edge is not picking the smartest model — it's designing systems whose unit economics improve as scale grows. Optimize for compounding: every routing improvement, every cache hit, every byte trimmed from a prompt pays you back every day the system runs.