Personal notes · May 2026

AI Cost Control

I wrote this note to keep LLM spend economically rational — from token-level mechanics through routing, caching, RAG, agents, and FinOps guardrails. It is my working blueprint for teams shipping AI at scale, not a vendor white paper.

The question behind this note: does every token, GPU-second, and vector lookup earn its keep? What follows is how I map the bill, design the request path, and keep unit economics from drifting as usage grows.
FinOps for AI Token Economics Caching · Routing · Batching RAG Optimization Inference Architecture Observability Governance & Guardrails ✍️ By: Linh Truong

01What I'm tracking

Inference spend is now the fastest-growing line item in most engineering budgets I see — often eclipsing compute, storage, and observability combined. My job on these systems is to keep AI economically rational: every token, GPU-second, and vector lookup either generates measurable user value or gets cut.

10–100×
Cost delta between frontier & small models
50–90%
Token spend cuttable via caching & routing
3–5×
Inflation from naive agentic loops
< $0.01
Target unit cost per common user action
What I keep coming back to: cost control is an architectural property, not a procurement exercise. Design it into the request path: route the cheapest model that meets the SLA, cache aggressively at every layer, compress context ruthlessly, batch where latency allows, and instrument every call to a cost ledger.

What's in this note

02Cost Drivers Anatomy

Before optimizing, map the bill. AI cost decomposes into seven primary drivers. Most teams over-index on (1) and ignore (3)–(7), which is exactly where the avoidable waste lives.

Where AI Dollars Go — 7 Cost Drivers Total AI Cost (TCO) 1. Token Inference Input + output tokens × model rate Usually 50–80% of spend 2. Embeddings & Vector DB Index build, re-embed, ANN queries Storage + per-query reads 3. Fine-tuning & Training SFT, DPO, LoRA, eval runs Capex-like, amortize over usage 4. GPU / Serving Infra H100/H200, vLLM, autoscaling idle Reserved vs spot vs on-demand 5. Egress & Networking Cross-region, multi-cloud transfer Hidden in big RAG/streaming 6. Observability & Eval Tracing, judge-LLMs, dataset storage Often 5–15% of total 7. Human-in-the-Loop Labelers, reviewers, red-teamers
Figure 1 — Seven primary drivers. The hub is total cost of ownership; each spoke needs its own owner and SLO.

Driver deep-dive

DriverTypical sharePrimary leversCommon waste
Token inference50–80%Routing, caching, compression, smaller modelsFrontier model for trivial calls
Embeddings & vector store3–10%Smaller embed dims, quantization, TTLRe-embedding unchanged corpora
Fine-tuning / training2–15%LoRA, distillation, early stoppingTuning before prompt iteration plateau
GPU serving infra10–40% (if self-hosted)vLLM, paged-attention, autoscale-to-zeroIdle reserved fleets
Egress & networking1–8%Co-locate model + data, edge cacheCross-region RAG hops
Observability & eval5–15%Sampled tracing, batch evals, cheap judge models100% trace + frontier judge LLM
Human-in-the-loopvariableActive learning, confidence-gated reviewReviewing every output

03Unit Economics & Formulas

Every product decision should resolve to a cost-per-action (CPA) number. If you can't compute CPA on demand, you cannot govern AI cost.

Core formula

CPA = Σ over calls c in action a:
      (input_tokens(c)  × P_in(model_c)
     +  output_tokens(c) × P_out(model_c)
     +  cached_tokens(c) × P_cache(model_c))
    +  retrieval_cost(a)
    +  tool_cost(a)
    +  infra_amortization(a)

Five ratios I won't ship without

1. Cost per successful action ($/action)

Numerator is total LLM + infra spend. Denominator is successful outcomes (not raw API calls), defined by product. Retries, abandoned agentic loops, and rejections all inflate cost without value.

2. Tokens per action (in/out)

Track input and output independently. Output tokens are 3–5× more expensive than input on most APIs. A rising output/input ratio often signals prompt drift or verbose chain-of-thought.

3. Cache hit rate (%)

Prompt cache + semantic cache + retrieval cache. Below 30% means you have not invested. Above 80% means you may be serving stale answers — pair with freshness SLO.

4. Model mix ($-weighted)

Share of spend by tier: frontier mid small self-hosted. Healthy SaaS-style products run 10–25% frontier; the rest is mid or small.

5. Cost-to-Revenue (or Cost-to-Margin) ratio

For revenue-generating features: AI cost / incremental revenue. For internal tools: AI cost / labor hours saved × loaded labor rate. Hard line: investigate any feature > 30% C/R.

6. Waste rate (%)

Share of tokens spent on calls that produced no shipped output: failed JSON parses, guardrail rejections, agent dead-ends, eval/test traffic in prod accounts. Target < 5%.

A worked example

ScenarioModelInput tokOutput tok$ per call1M calls/mo
Naïve (frontier, no cache)Frontier4,000800$0.0760$76,000
+ Prompt caching (system reused)Frontier4,000 (3,500 cached)800$0.0265$26,500
+ Route 70% to mid-tierMix4,000800$0.0142$14,200
+ Context compression (−60% in)Mix1,600800$0.0091$9,100
+ Semantic cache (35% hit)Mix$0.0059$5,900

Illustrative blended rates. The point is not the absolute dollars — it's that stacked optimizations compound to ~13× without changing the user-visible product.

04Reference Cost-Aware Architecture

The architecture below pushes every request through layers that progressively get more expensive. The cheapest possible answer wins. Each layer has a clear cost owner.

Cost-Aware Request Path — Cheapest Answer Wins User / Client request L1: Edge / HTTP exact-match cache L2: Semantic Cache vector similarity + TTL L3: Cost-Aware Router classify difficulty pick cheapest model enforce budget Frontier (Opus-class) hard tasks · highest $ Mid (Sonnet-class) default workhorse Small (Haiku-class) classification, extraction Self-hosted OSS high-volume, predictable Rules / Heuristics no-LLM fast path RAG Retrieval hybrid + rerank cap Prompt Cache (provider) 5-min TTL · 1h ext Cost Ledger & Observability Bus per-request: tenant · feature · model · tokens (in/out/cache) · latency · outcome · $ cost · budget bucket
Figure 2 — Layered request path: L1 exact cache → L2 semantic cache → L3 cost-aware router → tiered models, with retrieval and provider prompt cache feeding the same ledger.

Layer-by-layer cost intent

LayerGoalCost per hitOwner
L1 Edge cacheZero-LLM for hot, identical prompts~$0Platform
L2 Semantic cacheReuse near-duplicate answers~$0.0001 (vector lookup)AI Platform
L3 RouterPick cheapest viable modelTiny classifier or rulesAI Platform
Prompt cacheCut input-token cost on stable prefix10–25% of normal inputFeature team
RetrievalBound context size; reduce hallucinationvector + rerank feeData / RAG team
ModelsGenerate final answerper-token, tier-dependentFeature team
LedgerAttribute every dollar to a tenant/featurelogging onlyFinOps

05Model Routing & Cascades

Routing is the single highest-leverage optimization. A well-designed router cuts spend 40–70% with no quality loss because most queries are easy. The goal: send each request to the cheapest model whose expected quality clears the SLA.

Three routing strategies

A. Rule-based router

Deterministic dispatch on request features: length, language, intent label, user tier, schema mode.

Pros Zero overhead, auditable.
Cons Rigid; misses nuanced cases.

B. Classifier router

Small model (e.g., distilled BERT, Haiku-class) predicts "difficulty" and routes accordingly.

Pros Adaptive, cheap (< $0.0002/call).
Cons Needs labeled traffic; can drift.

C. Cascade / fallback

Try small model first; escalate to mid/frontier only if confidence/judge says "insufficient."

Pros Self-correcting; great for long tail.
Cons Adds latency; needs a cheap judge.

Cascade Routing — Confidence-Gated Escalation Request Small model Haiku-class · $ ~70% of traffic ends here Cheap judge verify · ~$0.0005 pass / escalate Mid model Sonnet-class · $$ ~25% of traffic Frontier model Opus-class · $$$ ~5% — hard tail Response + ledger tokens, $, model, gate if low conf → if hard / fail →
Figure 3 — A typical 3-tier cascade: small model handles the majority; a cheap judge gates escalation; frontier model reserved for the long tail.

Designing the routing policy

  1. Label the workload. Sample 1–5K real prompts, annotate difficulty / required capability / latency tolerance.
  2. Calibrate thresholds. For each candidate model, measure quality on samples; pick the cheapest that meets the SLA.
  3. Decide escalation signal. Options: log-prob/entropy, judge model verdict, JSON schema validation, regex/tool-call success, user thumbs-down.
  4. Bound the cascade. Hard cap on escalations per request (e.g., 1) to prevent cost runaway.
  5. Continuously re-evaluate. Cheap models improve quickly; re-run routing eval monthly.
Watch out: Cascades can increase cost if your small model is wrong often. Always compare cascade cost-per-success vs. single-model cost-per-success on the same eval set before shipping.

06Caching Strategies

Caching is the cheapest dollar you'll ever save. A serious system runs at least four distinct caches.

Four Cache Layers in an LLM System L1 · Exact-Match HTTP / Redis · key = hash(prompt) Latency: < 5 ms Hit rate: 5–30% Cost saved: 100% of call Use for: docs, FAQs, idempotent tools L2 · Semantic Vector similarity threshold Latency: 10–50 ms Hit rate: 15–45% Cost saved: 100% of call Risk: stale or wrong answers L3 · Provider Prompt Cache Stable prefix, 5-min / 1-hour TTL Latency: faster TTFT Hit rate: high if prompt structured Cost saved: 75–90% of input tokens Use for: system, tools, RAG header L4 · Retrieval / Tool Cache vector hits & tool results Latency: avoid re-embed / re-call Hit rate: 40–70% for hot queries Cost saved: vector + tool fees TTL by data volatility
Figure 4 — Four cache tiers, each with different latency, hit rates, and freshness risk.

Designing prompt caches that actually hit

Provider prompt caches (Anthropic, OpenAI, Google) only hit when the prefix is byte-identical. The prompt structure I enforce on teams I work with:

[ STABLE — cached ]
  system instructions
  tool / function definitions
  large reference documents
  few-shot examples

[ VOLATILE — not cached ]
  user query
  per-request retrieved snippets (or place AFTER stable block)
  timestamps, user_id, session id
Engineering rule: sort prompt sections from most stable to most volatile. Never inject a timestamp at the top. Never reorder tool definitions between calls.

Semantic cache: when it's a trap

07Context & Prompt Compression

Context length is the second-biggest token lever after model choice. The discipline: send the minimum context needed for a correct answer, no more.

The compression toolbox

TechniqueHow it worksTypical reductionRisk
Chunk + retrieve top-kRAG instead of stuffing70–95%Recall loss if k too small
Reranking with capKeep top-k after rerank, hard-cap chars30–60%Reranker latency & cost
Summarize-then-promptSmall model summarizes long docs once, cache summary60–90%Summary may drop critical detail
Sliding/decay memoryDrop or summarize older turns40–80% on chatLoss of long-range coherence
Structured extraction firstConvert docs to JSON, pass JSON50–80%Schema drift
Prompt minificationRemove filler, redundant instructions10–25%Quality regressions if over-pruned
Output schema constraintsForce concise structured output30–70% on outputNeed robust parsers
Stop sequences & max_tokensBound runaway generationsvariableTruncation

The asymmetry: output tokens cost more

On most APIs, output tokens are 3–5× the price of input. Yet engineers focus on shrinking input. Equally important:

Heuristic: if your average output is > 800 tokens, you almost certainly have either over-broad prompts or a UX that rewards verbosity. Both are fixable.

08RAG Cost Optimization

RAG looks cheap because each call is small, but at scale embedding, storage, retrieval, and rerank fees stack. Worse, naive RAG inflates prompt size and the model bill it was meant to reduce.

RAG Pipeline — Cost Levers at Each Stage 1. Ingest Chunk size 300–800 tok Dedupe near-identical Drop boilerplate Lever: smaller, cleaner corpus saves embed + store 2. Embed Pick smallest dim that meets recall target Quantize (int8 / binary) Lever: dim 256–512 vs 1536 cuts storage 3–6× 3. Retrieve Hybrid: BM25 + vector Tenant-scoped index Cache hot queries Lever: k_initial → k_final avoid over-fetch 4. Rerank Cross-encoder on top-k Cap final tokens fed Optional; skip if precision OK Lever: smaller k_final → cheaper LLM biggest win on LLM bill 5. Generate Cite, schema, max_tokens Cache stable system block Route by query difficulty Lever: model tier + output cap final LLM cost 6. Eval Recall@k Faithfulness $ per answer Lever: cheap judge model Optimize across the chain — a single stage in isolation may not move the bill.
Figure 5 — Six RAG stages; the highest-leverage levers are usually steps 4 (rerank cap) and 5 (model tier).

RAG cost anti-patterns

09Agentic Cost Control

Agents are the most dangerous cost category because their cost is unbounded by default. A single user query can trigger dozens of model calls, tool calls, and retries. Naïve loops have killed budgets overnight.

The five non-negotiable bounds

1. Step budget

Hard cap on tool/LLM iterations per session (e.g., 10). Exceeding it terminates the loop and returns a graceful error or a checkpoint.

2. Token budget

Cumulative token budget per session, enforced by the runtime — not by the model's good intentions.

3. Dollar budget

Hard per-tenant and per-feature daily spend cap; breaching it returns 429 + alert.

4. Loop detector

Detect identical or near-identical tool calls; break with diagnostic.

5. Cheap planner, expensive executor

Plan with a small model. Execute steps with a small model where possible. Only escalate the synthesis step if needed.

+ Memory compaction

Periodically summarize the running scratchpad/context so each step's prompt doesn't grow linearly.

Cost shape of agentic systems

cost(session) ≈ Σ steps × (avg_input_tokens + avg_output_tokens) × price
              + Σ tool_calls × tool_cost
              + Σ subagent_invocations × subagent_cost

Without compaction, avg_input_tokens grows with each step. The cost curve becomes quadratic. Always compact.

A pattern I've seen: a customer-support agent without a step cap looped on a malformed tool response 187 times in one session. Cost: $42 for one ticket. Fix: hard cap = 8 and a loop detector. New cost: $0.04.

10Batching & Async Inference

If your workload is not user-facing in real time, batching is a 40–60% discount you should not leave on the table.

Three batching modes

ModeLatencyTypical discountUse when
Provider Batch APIup to 24h~50%Nightly enrichment, evals, content gen pipelines
Micro-batching (server-side)+5–50 msthroughput, not priceSelf-hosted vLLM / TGI fleets
Request coalescing+latency depending20–40% via prompt cache reuseMany users hitting same prefix

Async, scheduled, and shadow workloads

11Fine-tune vs. Prompt vs. Distill

The default impulse to "just fine-tune" is usually a cost mistake. Here is the decision order I enforce before anyone trains anything.

Decision Tree — When Tuning Actually Pays Off Quality below target? measure with eval set first Improve prompt few-shots, schema, retrieval cheapest, fastest Add RAG / tools freshness, factuality no model change LoRA / PEFT fine-tune style, format, narrow domain moderate capex Distill to small model high-volume, stable task biggest unit savings Re-evaluate quality, latency, $/action — pick lowest that meets SLA
Figure 6 — Try prompts and RAG first; tune only after prompt iteration plateaus; distill only for high-volume, stable workloads.

Distillation — the underused weapon

For high-volume, well-defined tasks (intent classification, extraction, routing decisions, ticket triage), distill frontier-model behavior into a fine-tuned small model. Typical economics:

Discipline: never tune until prompt iteration genuinely plateaus on a held-out eval set, and always re-baseline tuned models when a new generation of base models ships — yesterday's tuned 8B may be worse than today's base small model.

12Self-Hosted Serving Economics

Self-hosting beats SaaS APIs only when utilization is high and predictable. Otherwise you pay for idle GPUs.

The break-even formula

break_even_qps = (gpu_hourly_cost × n_gpus) / (api_price_per_call × 3600 × utilization)

If your steady-state QPS is materially below break-even, stay on the API. If it's above and stable, consider self-hosting — and only then if you have the on-call expertise to operate it.

What actually moves self-hosted cost

LeverEffectNotes
vLLM / TGI / TensorRT-LLM2–10× throughput vs. naive HFContinuous batching, paged attention
FP8 / INT8 / INT4 quantization1.5–4× throughput, smaller memoryMeasure quality regression on your evals
Speculative decoding1.5–3× latency improvementNeeds draft model
KV cache reuse / prefix sharingLarge gain on shared system promptsMirrors managed prompt cache
Autoscale to zero (LoRA hot-swap)Eliminate idle cost for niche modelsCold start trade-off
Reserved / committed-use discounts30–60% off on-demandOnly commit to verified baseline
Spot / preemptible GPUs50–80% offFor batch / non-critical only
Co-locate model + dataEliminates egressMatters at scale
Hidden cost: the SRE/MLOps team you need to run a serving fleet 24/7 is often more expensive than the GPUs themselves. Bake people cost into the build-vs-buy model.

13Observability & FinOps

You cannot control what you cannot attribute. Every LLM call must be tagged at emission with enough metadata to roll up by tenant, feature, environment, and team.

Minimum required telemetry per call

{
  "request_id":      "uuid",
  "timestamp":       "2026-05-18T14:22:01Z",
  "tenant_id":       "acme-corp",
  "user_id_hash":    "...",
  "feature":         "support.summary",
  "environment":     "prod",
  "team":            "support-platform",
  "model":           "claude-sonnet-4-6",
  "provider":        "anthropic",
  "input_tokens":     3421,
  "output_tokens":    412,
  "cached_tokens":   2980,
  "cache_layer":     "provider_prompt",
  "tool_calls":      [{"name":"search_kb","cost":0.0001}],
  "latency_ms":       1840,
  "cost_usd":         0.0061,
  "budget_bucket":   "support-platform/prod/monthly",
  "outcome":         "success",
  "quality_signal":  {"thumbs":"up"}
}

Four dashboards I require before launch

1. Spend trajectory

Today, week-to-date, month-to-date vs. budget; forecast to month-end; top 10 features by % growth.

2. Unit economics

$ per action by feature; trend; outliers; cost-per-successful-action vs. cost-per-call.

3. Efficiency

Cache hit rates by layer; model mix; tokens in/out; waste rate; routing accuracy.

4. Anomalies

Spikes by tenant/feature; runaway agents; suspect prompts; budget-bucket breaches.

Tooling landscape

14Quotas, Budgets & Guardrails

Soft guidance is not control. Cost guardrails must be enforced in the request path, not in a wiki page.

The three-tier guardrail stack

TierTriggerActionOwner
1. Soft alert80% of feature budgetSlack + dashboard bannerFeature team
2. Hard throttle100% of feature budgetRate-limit or downgrade model tierAI Platform
3. Circuit break2× burst, or per-tenant DoSBlock + 429 + page on-callSRE / FinOps

Concrete guardrails worth shipping

Design principle: fail closed on cost, fail open on quality. Better to throttle than to surprise the CFO.

15Vendor & Contract Strategy

List prices are starting points, not endings. Above ~$50K/month of spend, real money is in commercial structure.

Negotiation levers (in order of impact)

  1. Volume commitments with quarterly true-ups — 10–30% off list at meaningful scale.
  2. Multi-year deals with model-class portability — protect against being locked to a deprecated SKU.
  3. Reserved throughput / provisioned capacity — predictable price, predictable latency.
  4. Cross-product bundles when the vendor sells more than inference (eval, embeddings, fine-tuning).
  5. Free / discounted dev & eval traffic — surprisingly often granted.
  6. Right-to-export embeddings / tuned weights — exit clause that protects future negotiation.

Multi-provider posture

Even if one vendor is your strategic primary, maintain a tested fallback for two reasons:

Hide vendor specifics behind an internal LLM gateway with a unified schema; this also gives you the natural place to apply routing, caching, and budgets.

16Org Model & RACI

Cost control fails when no one owns it. The default state is "AI is everyone's tool and no one's bill."

ActivityRACI
Model selection per featureFeature leadAI PlatformFinOpsCTO
Routing / caching infraAI PlatformPlatform leadSRE, SecurityEng leadership
Budget settingFinance + PMVP EngFeature leadsCFO
Quota enforcementAI PlatformSREFeature teamsFinOps
Vendor contractsProcurementCTOAI Platform, LegalCFO
Cost anomaly responseOn-call SREFinOpsFeature ownerVP Eng
Quarterly cost reviewFinOpsVP EngFeature leadsExec staff

The "AI Platform" team charter

If you do not have one, propose it. A small (3–8 person) AI Platform team owns:

This team's ROI is measured as $ saved per $ spent on the team, and at scale it is consistently > 10×.

17AI Cost Maturity Model

Five Stages of AI Cost Maturity 1. Ad hoc No attribution Frontier-only Bill surprises monthly Symptom: "why so high?" 2. Measured Per-feature tagging Dashboards exist Some prompt caching Symptom: visibility, no control 3. Optimized Routing in place Multi-layer caches Budgets enforced Symptom: stable unit cost 4. Governed FinOps cadence Quarterly cost reviews Vendor leverage Symptom: predictable, defended 5. Adaptive Auto-tuning routers Cost-aware experiments Self-distilling models Symptom: cost ↓ as scale ↑ Most teams I audit sit at stage 2; I aim to reach stage 4 within a year, stage 5 within two.
Figure 7 — Five stages; advancing one stage typically halves unit cost.

18Anti-Patterns & How to Diagnose Them

Anti-patternSmellFix
"Frontier by default"> 60% spend on top-tier modelIntroduce router + cascade; baseline cheaper tier
Volatile prompt prefixPrompt cache hit < 10%Move volatile content to end; freeze tool schemas
RAG top-k = 20Avg input tokens > 6K with no needMeasure recall; rerank + cap
Verbose outputsOutput tokens > input tokensSchema-constrained outputs, max_tokens
Unbounded agentsLong-tail of $$$ sessionsStep + token + $ caps; loop detector
Re-embed everything nightlyEmbedding bill grows with no corpus growthHash-based incremental embedding
Eval traffic in prod budgetSpend up on weekends/CISeparate API key; separate ledger bucket
No per-tenant attributionCan't answer "who is the top spender?"Mandatory tenant_id tag on every call
Tune-before-prompt-plateau$10K tuning runs that win < 2%Pause tuning; run prompt sprint first
Self-host at low utilizationGPU fleet at 15% loadMove back to managed API; revisit at scale

19What I keep on my desk

Not commandments — a checklist I re-read before sign-off. If every item is yes, the system is cost-disciplined enough to scale.

Visibility

Architecture

Governance

Continuous improvement

20Glossary

Terms used in this note

TermMeaning
CPA (Cost per Action)Fully loaded LLM + infra cost divided by successful user outcomes.
CascadeRouting pattern: try cheapest model, escalate on low confidence.
Prompt cacheProvider-side caching of stable prompt prefixes, billed at a fraction of normal input.
Semantic cacheApplication-side cache keyed by vector similarity to prior prompts.
LoRA / PEFTParameter-efficient fine-tuning; small adapters on a frozen base model.
DistillationTraining a smaller model to mimic a larger model's behavior on a target task.
vLLM / TGIHigh-throughput inference servers with paged attention and continuous batching.
FinOpsOperational discipline of bringing financial accountability to variable cloud/AI spend.
Budget bucketA pre-allocated dollar pool tied to a team/feature, enforced at the gateway.
Waste rateTokens spent on calls that produced no shipped value (failed parses, dead agent loops, etc.).

21References & Sources

Annotated bibliography behind the cost-driver map, unit-economics formulas, cost-aware architecture, routing cascades, four cache layers, RAG pipeline levers, agentic bounds, batching economics, fine-tune/distill decision tree, self-hosted break-even, FinOps telemetry schema, guardrail stack, vendor posture, RACI, maturity model, anti-patterns, and desk checklist. Section tags (e.g. §05) show where each source is used. Diagrams and worked examples are my synthesis unless noted.

Scope. Synthesis of FinOps, LLM-inference, RAG, and platform-engineering sources (May 2026). Hero KPI ranges (10–100× model cost delta, 50–90% savings from caching/routing, 3–5× agentic inflation) blend FrugalGPT, provider pricing docs, and field benchmarks — directional ranges, not guarantees for your workload. Dollar figures in §03 worked example use illustrative blended rates; verify against current list prices before budgeting. Not vendor, financial, or procurement advice.

Citations are numbered continuously [1]–[n] within this section.

FinOps, cloud cost accountability & AI spend governance (§01, §13–§14, §16–§17)

  1. FinOps Foundation, FinOps Framework & FinOps Principles. 2024–25. Capabilities model for bringing financial accountability to variable cloud spend — backbone for §13 FinOps cadence, §14 budget guardrails, and §17 maturity stages 3–4. finops.org/framework — §13, §14, §17.
  2. FinOps Foundation, FinOps for AI working group & community guidance. 2024–25. Applying FinOps disciplines to GPU and inference billing — §01 hero framing and §13 ledger/dashboard requirements. finops.org — §01, §13.
  3. AWS, Cost and Usage Report (CUR) & Cost Explorer documentation. Granular attribution of cloud/GPU spend — §13 tooling landscape and §02 driver 4 (serving infra). docs.aws.amazon.com/cur — §02, §13.
  4. Google Cloud, BigQuery billing export & Cloud Billing APIs. Warehouse-native cost rollups — §13 custom-ledger pattern and §16 FinOps RACI. cloud.google.com/billing — §13, §16.

LLM token economics, pricing & inference cost trends (§01, §03, hero KPIs)

  1. OpenAI, API Pricing & usage documentation. 2025–26. Input/output token rates, Batch API discount, prompt caching — §03 CPA formula, §03 worked example, §06 L3 cache, §10 batching. openai.com/api/pricing — §03, §06, §10.
  2. Anthropic, API Pricing & Prompt Caching documentation. 2025–26. Cached-input pricing (≈10–25% of base input), batch discounts — §06 prompt-cache layer and §03 output/input asymmetry. docs.anthropic.com/prompt-caching — §03, §06, §07.
  3. Google Cloud, Vertex AI Generative AI pricing & Context caching. 2025–26. Gemini context-cache TTL and token billing — §06 four-layer cache and §04 architecture prompt-cache node. cloud.google.com/vertex-ai/pricing — §04, §06.
  4. Stanford HAI, 2025 AI Index Report — Technical Performance & Economy chapters. 2025. Inference-cost declines, model tiers, and adoption economics — background for §01 hero 10–100× tier delta and §05 monthly re-eval guidance. hai.stanford.edu/ai-index — §01, §05, §11.
  5. McKinsey & Company, The Economic Potential of Generative AI: The Next Productivity Frontier. June 2023. Gen-AI cost as a scaling constraint for enterprises — §01 “fastest-growing line item” context and §15 vendor negotiation at scale. mckinsey.com — §01, §15.

Cost-aware routing, cascades & model selection (§05, FIG 3, §18)

  1. Chen, L. et al., “FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance.” arXiv:2305.05176, 2023. LLM cascade and prompt-adaptation strategies; reported large cost reductions with quality preservation — §01 hero 50–90% range, §05 cascade routing, §05 callout on cost-per-success. arxiv.org/abs/2305.05176 — §01, §05, hero.
  2. Ong, I. et al., “RouteLLM: Learning to Route LLMs with Preference Data.” arXiv:2406.18647, 2024. Learned routers sending queries to strong vs. weak models — §05 classifier router and §04 L3 cost-aware router. arxiv.org/abs/2406.18647 — §04, §05.
  3. Shnitzer, A. et al., “Large Language Model Routing with Benchmark Datasets.” arXiv:2309.15789, 2023. Router design and evaluation methodology — §05 routing-policy calibration steps. arxiv.org/abs/2309.15789 — §05.
  4. Yan, E., “Patterns for Building LLM-based Systems & Products.” eugeneyan.com, 2023–25 (ongoing). Practical patterns for routing, eval, and cost-quality trade-offs — cited in §05 design steps and §08 RAG anti-patterns mindset. eugeneyan.com/writing/llm-patterns — §05, §08, §19.

Caching — prompt, semantic & retrieval layers (§06, FIG 4, §04)

  1. Anthropic, Prompt caching — implementation guide & pricing. Byte-identical prefix requirements, TTL, cache breakpoints — §06 stable/volatile prompt structure and FIG 4 L3. docs.anthropic.com — §04, §06.
  2. OpenAI, Prompt caching (automatic caching on supported models). 2024–25. Provider-side prefix reuse — §06 engineering rule on stable-first prompt ordering. platform.openai.com/docs/guides/prompt-caching — §06.
  3. Zilliz / GPTCache project, “GPTCache: An open-source semantic cache for LLM applications.” 2023–25. Vector-similarity caching patterns and false-positive risks — §06 L2 semantic cache and semantic-cache trap bullets. github.com/zilliztech/GPTCache — §06, §18.
  4. Redis, Vector search & caching best-practices documentation. Exact-match and TTL caches at the edge — §04 L1 edge cache and §06 L1 exact-match layer. redis.io/docs — §04, §06.

Context compression, long-context limits & RAG economics (§07–§08, FIG 5)

  1. Lewis, P. et al., “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.” NeurIPS 2020. Foundational RAG architecture — §08 pipeline stages and §07 chunk-and-retrieve compression lever. arxiv.org/abs/2005.11401 — §07, §08.
  2. Liu, N. F. et al., “Lost in the Middle: How Language Models Use Long Contexts.” TACL 2024. Models under-use middle context — supports §07 hard top-k caps and §08 rerank-cap lever over stuffing. arxiv.org/abs/2307.03172 — §07, §08.
  3. Es, S. et al., “RAGAS: Automated Evaluation of Retrieval Augmented Generation.” 2023. RAG eval metrics (faithfulness, recall) — §08 stage-6 eval and §13 judge-model cost control. arxiv.org/abs/2309.15217 — §08, §13.
  4. TruLens documentation — LLM app evaluation & tracing. 2024–25. Feedback functions and RAG triad — §08 eval stage and §13 observability tooling list. trulens.org — §08, §13.
  5. Pinecone, Weaviate, Qdrant — vector-database sizing & embedding-dimension guides. 2024–25. Storage and query cost trade-offs — §08 embed stage (dim 256–512 vs 1536) and §02 driver 2. pinecone.io/learn — §02, §08.

Agentic systems, loops & unbounded cost risk (§09, hero 3–5×, §18)

  1. Yao, S. et al., “ReAct: Synergizing Reasoning and Acting in Language Models.” ICLR 2023. Tool-use loops that multiply LLM calls — §09 cost-shape formula and step-budget rationale. arxiv.org/abs/2210.03629 — §09.
  2. LangChain / LangGraph documentation — agent runtimes, recursion limits & checkpoints. 2024–25. Step caps and state management patterns — §09 five non-negotiable bounds and loop detector. langchain-ai.github.io/langgraph — §09, §14.
  3. Shinn, N. et al., “Reflexion: Language Agents with Verbal Reinforcement Learning.” NeurIPS 2023. Iterative self-correction loops — background for §09 quadratic cost curve without compaction. arxiv.org/abs/2303.11366 — §09.
  4. Helicone & Langfuse case studies on agent cost spikes. 2024–25. Runaway session patterns and guardrail responses — §09 “pattern I've seen” callout and §13 anomaly dashboard. helicone.ai/blog — §09, §13.

Batching, async inference & throughput (§10, §12)

  1. OpenAI, Batch API documentation. ~50% discount, 24-hour completion window — §10 provider Batch API row and async workload bullets. platform.openai.com/docs/guides/batch — §10.
  2. Anthropic, Message Batches API. 2024–25. Async batch processing for non-latency-sensitive workloads — §10 batching modes table. docs.anthropic.com/message-batches — §10.
  3. Yu, G.-I. et al., “Orca: A Distributed Serving System for Transformer-Based Generative Models.” OSDI 2022. Continuous batching concepts — precursor to modern inference servers in §10 micro-batching row. usenix.org/osdi22 — §10, §12.

Fine-tuning, LoRA, distillation & the tune-last discipline (§11, FIG 6)

  1. Hu, E. J. et al., “LoRA: Low-Rank Adaptation of Large Language Models.” ICLR 2022. Parameter-efficient fine-tuning — §11 LoRA/PEFT card and §02 driver 3 amortization. arxiv.org/abs/2106.09685 — §02, §11.
  2. Hinton, G., Vinyals, O., & Dean, J., “Distilling the Knowledge in a Neural Network.” NeurIPS 2014 Deep Learning Workshop. Knowledge distillation foundation — §11 distillation economics and FIG 6 decision tree. arxiv.org/abs/1503.02531 — §11.
  3. OpenAI, Model distillation guide & fine-tuning API documentation. 2024–25. Teacher-student workflows and when distillation pays — §11 break-even bullets and §18 tune-before-prompt anti-pattern. platform.openai.com/docs/guides/distillation — §11, §18.
  4. Dettmers, T. et al., “QLoRA: Efficient Finetuning of Quantized LLMs.” NeurIPS 2023. Lower-cost fine-tune runs — §11 LoRA card capex range. arxiv.org/abs/2305.14314 — §11.

Self-hosted serving, GPU economics & inference stacks (§12, §02 driver 4)

  1. Kwon, W. et al., “Efficient Memory Management for Large Language Model Serving with PagedAttention.” SOSP 2023 (vLLM). PagedAttention and throughput gains — §12 vLLM row and §04 self-hosted OSS tier. arxiv.org/abs/2309.06180 — §04, §12.
  2. vLLM project documentation. 2024–25. Continuous batching, prefix caching, quantization — §12 lever table and §10 micro-batching. docs.vllm.ai — §10, §12.
  3. NVIDIA, TensorRT-LLM documentation. 2024–25. FP8/INT quantization and speculative decoding — §12 quantization and speculative-decoding rows. nvidia.github.io/TensorRT-LLM — §12.
  4. Hugging Face, Text Generation Inference (TGI) documentation. Production serving stack — §12 vLLM/TGI comparison and glossary entry. huggingface.co/docs/tgi — §12, §20.
  5. NVIDIA & cloud-provider GPU pricing pages (H100/H200, reserved vs spot). 2025–26. Break-even inputs for §12 formula and §18 self-host-at-low-utilization anti-pattern. nvidia.com/data-center — §12, §18.

Observability, LLM tracing & cost ledgers (§13, §04 ledger, §19)

  1. Langfuse documentation — LLM tracing, cost tracking & evaluations. 2024–25. Per-call metadata schema aligned with §13 telemetry JSON — §13 tooling and §19 visibility checklist. langfuse.com/docs — §04, §13, §19.
  2. Helicone — LLM observability & cost analytics. 2024–25. Gateway proxy pattern for attribution — §04 cost ledger and §13 LLM-native tooling list. docs.helicone.ai — §04, §13.
  3. Braintrust — eval + production logging for LLM apps. 2024–25. Cost-per-success tracking — §03 CPA denominator and §13 unit-economics dashboard. braintrust.dev/docs — §03, §13.
  4. Arize AI, Phoenix — open-source LLM observability. 2024–25. Tracing and eval for RAG — §08 eval stage and §13 tooling landscape. docs.arize.com/phoenix — §08, §13.
  5. Datadog, LLM Observability product documentation. 2024–25. Enterprise tracing and cost correlation — §13 tooling landscape and §16 AI Platform charter. docs.datadoghq.com/llm_observability — §13, §16.
  6. OpenTelemetry, generative-AI semantic conventions (draft). 2024–25. Standard spans for model calls — §13 minimum telemetry schema design. opentelemetry.io/gen-ai — §13.

Guardrails, quotas, platform teams & vendor posture (§14–§16, §15)

  1. Google SRE Book — chapters on monitoring, alerting & capacity planning. O'Reilly, 2016–18. Circuit-breaker and throttle patterns — §14 three-tier guardrail stack and §14 circuit-break row. sre.google/sre-book — §14.
  2. Team Topologies (Skelton & Pais) — platform team as internal product. 2019. AI Platform charter in §16 — gateway ownership and enablement vs feature-team autonomy. — §16.
  3. Partnership on AI & NIST, AI Risk Management Framework — resource allocation & monitoring. 2023. Governance patterns for production AI — §14 prod model allowlist and §16 guardrails ownership. nist.gov/ai-rmf — §14, §16.
  4. Flexera / Gartner cloud-cost negotiation research (enterprise licensing). 2023–25. Volume-commitment and true-up structures — §15 negotiation levers 1–2. flexera.com/blog/finops — §15.

Author synthesis

  1. Truong, L., AI Cost Control — personal working notes. May 2026. Original diagrams (FIG 1–7), cost-aware architecture, maturity model, anti-pattern table, desk checklist, and §03 stacked-optimization example. LinhTruong.com — all sections.
📎 Before you quote externally
Provider list prices and cache-discount percentages change frequently — re-read OpenAI, Anthropic, and Google pricing pages before citing §03 dollar figures. The §03 13× stacked-optimization example is illustrative compounding, not a guaranteed outcome. FIG 1–7 driver shares and cache hit-rate bands are field heuristics; measure on your own traffic. Self-hosted break-even depends on utilization, labor cost, and reserved-GPU terms — run your own model before committing capex.
What I'd tell a team on day one: the edge is not picking the smartest model — it's designing systems whose unit economics improve as scale grows. Optimize for compounding: every routing improvement, every cache hit, every byte trimmed from a prompt pays you back every day the system runs.