Generative AI — strategy & architecture I actually use
One long page on how LLM systems get built today: transformer and diffusion intuition, the stack as I sketch it (retrieval, tools, agents, adapters), and what breaks in production—evals, latency, bills, and guardrails. I use this when coaching teams and when I forget whether LoRA or RAG was the right fix last time.
Generative AI here means models that generate new content—text, code, images, audio, video, structured fields—conditioned on prompts, context, and tools. In practice I see most real stacks hang on three threads: a foundation model, retrieval and/or tools for facts and actions, and orchestration when the task is more than one step.
In production this is never “one API call.” Each layer has its own failure modes, vendors, and whatever your observability tool ends up labeling.
Figure 2 — The 7-layer GenAI stack. Treat each as an independent, swappable component.
Models come and go; your harness should not
New checkpoints land all the time. What tends to stay useful is boring infrastructure: eval datasets, chunking and retrieval, the tool contracts, prompt versions, and how you learn from user thumbs-down. I plan interfaces so swapping a model is a config change, not a rewrite.
4. Choosing a Model
Decision tree
Figure 3 — Map task profile → model class. Names are representative of 2026 tiers.
Trade-off cheatsheet
Dimension
Hosted frontier
Hosted small
Open-weights (self-host)
Quality on hard tasks
Best
Adequate
Catching up; varies
Cost per 1M tokens
$$$
$
¢ at scale (need GPUs)
Latency
100–500ms TTFT
50–200ms
You control it
Privacy / data residency
Vendor dependent
Vendor dependent
Full control
Fine-tuning
Limited / managed
Limited
Full (LoRA, full FT)
Best for
Agents, reasoning, code
Bulk, latency-sensitive
Regulated, edge, niche domains
Don't pick a model first
Pick the eval first. Without a way to measure quality on your task, model comparisons are vibes. See §9.
5. Prompting & Context Engineering
"Prompt engineering" has matured into context engineering: structuring the full input — system prompt, tools, examples, retrieved docs, output schema, and conversation history — so the model can succeed.
The anatomy of a production prompt
# 1. Role / persona — who the model isSYSTEM: You are a senior support engineer at Acme. Be concise and cite sources.
# 2. Capabilities & constraints — what it can/can't do
- Use the search_kb tool before answering policy questions.
- Never invent SKUs. If unsure, say "I don't know."
# 3. Output contract — structured shape
Respond as JSON: {"answer": str, "citations": [str], "confidence": 0..1}
# 4. Few-shot examples — calibrate behavior
EXAMPLE 1: ...
EXAMPLE 2: ...
# 5. Retrieved context — fresh, task-specific
<documents>
<doc id="kb-184">...</doc>
</documents>
# 6. User turnUSER: How do I return a damaged item?
Techniques that actually move the needle
High ROI
Structured output
Force JSON Schema / tool-use mode. Stops the model from being chatty and lets downstream code parse safely. Use response_format or function-calling.
High ROI
Decomposition
Break complex tasks into a planner → workers → verifier pipeline. Each step has its own prompt and eval.
Medium ROI
Few-shot with hard cases
2–6 examples that include the edge cases your model gets wrong. Don't waste shots on easy cases.
Medium ROI
XML / Markdown delimiters
Wrap variable inputs in tags (<document>, <user_query>) so the model can reliably refer to them and you can prevent prompt injection.
Reasoning
Chain-of-thought / scratchpad
Ask the model to think before answering. For reasoning models, just give them the room; for others, prompt explicitly. Hide CoT from the user, return only the final answer.
Caching
Prompt caching
Put stable content (system prompt, tool defs, big docs) at the start and mark it cacheable. 5–10× cost reduction and 2× latency improvement on repeated calls.
RAG grounds a model in your data. It's the cheapest and safest way to make a generic LLM speak in your domain — and the most commonly misimplemented part of the stack.
Figure 4 — The full RAG pipeline. Most failures are in chunking, hybrid search, and re-ranking — not in the LLM call.
RAG best practices
Hybrid retrieval (semantic + keyword) beats pure vector by 10–30% on most enterprise corpora.
Re-rank top-50 candidates down to top-5 with a cross-encoder (e.g., bge-reranker, Cohere Rerank).
Chunk semantically, not by character count. Respect headings, tables, and code blocks.
Attach metadata to every chunk: source, date, author, permissions — enables filtering and ACL enforcement.
Citations are non-negotiable. The user must be able to verify every claim.
Evaluate retrieval independently: recall@k and MRR before you blame the LLM.
Refresh strategy: incremental upserts, content hashes, and a way to delete stale docs.
When NOT to use RAG
If your data fits in the context window and is read end-to-end (e.g., a single contract), just pass it in with prompt caching. RAG adds engineering surface; don't introduce a vector DB for 50 documents.
7. Agents & Tool Use
An agent is an LLM in a loop with tools and memory. It plans, calls functions, observes results, and iterates until done. 2026's defining shift was making agents reliable enough for production.
Figure 5 — Agent reference architecture. The LLM is the policy; everything else is engineering.
Return typed, structured results with error codes — not free-form strings.
Make destructive tools require confirmation or a separate "executor" agent gated by a human.
Use the Model Context Protocol (MCP) to ship tools as reusable servers across agents and IDEs.
Cap tool budget, time, and tokens per task. Always.
The autonomy / reliability trade-off
Every increment of autonomy multiplies error modes. Start with a tool-augmented call, prove value, then promote to a loop. Don't build a multi-agent swarm before you've shipped a single-agent prototype.
8. Fine-tuning, LoRA & Adaptation
Fine-tuning is the last lever, not the first. It is justified when prompting + RAG plateau and you have repeatable, well-labeled examples of the desired behavior.
The adaptation ladder (cheap → expensive)
Technique
What it changes
Cost
When
Better prompt + examples
Nothing in the model
$
Always start here
RAG
Adds knowledge
$$
Stale or proprietary facts
Tool use / agents
Adds capabilities
$$
Need actions or fresh data
LoRA / QLoRA fine-tune
~0.1–2% of weights
$$$
Style, format, narrow domain
Full SFT
All weights
$$$$
Large, distinctive datasets
DPO / RLAIF
Preference alignment
$$$$
Subtle quality / safety tuning
Pre-training
Whole model
$$$$$
Almost never — only labs
LoRA in one picture
Figure 6 — LoRA adds a low-rank update on top of frozen weights. QLoRA = same idea on a 4-bit quantized base.
Data is the product
500–5,000 high-quality examples usually beat 100k noisy ones.
Hold out a real eval set before you start training — never sample it from training data.
De-duplicate aggressively. Near-duplicate examples inflate train loss without improving generalization.
Track schema versions and labeling guidelines like code.
Default playbook
Use QLoRA on a strong open-weights base (e.g., Llama-3.x-70B or Qwen-2.x) with a small, curated dataset. Serve multiple adapters from one base via vLLM's multi-LoRA support.
9. Evaluation & Observability
Without measurements you're guessing. I treat evals as non-optional: same role as regression tests, except the “function” is stochastic. They catch when a shiny prompt tweak helps three demos and quietly wrecks everything else.
Layered evaluation
Unit
Component-level
Prompt regressions, retrieval recall@k, tool argument accuracy. Fast, deterministic-ish, run on every PR.
Integration
End-to-end
Full task scenarios. Mix of programmatic checks (does JSON parse?) and LLM-as-judge with rubrics.
Online
Production
User feedback (👍/👎, edits, retries), implicit signals (copy, dwell), and shadow-mode A/B tests.
Building an eval set the right way
Mine real traffic. Sample 200–500 actual user inputs, stratified by intent and difficulty.
Label with rubrics, not vibes. Define 3–5 quality dimensions (correctness, completeness, tone, safety, citation).
Freeze a golden set. Version it. Never let it leak into training.
Log what you'd need to replay a bad answer: messages, retrieval hits, tool calls, intermediate notes if any, final output, token counts, cost, latency. The OpenTelemetry GenAI semantic conventions are a usable common shape; pipe traces into whatever you already run (Langfuse, Phoenix, Datadog, etc.).
Rule of thumb
If you can't reproduce a bad output from logs alone, your tracing is broken — fix it before adding features.
10. Deployment & Inference
Hosted vs. self-hosted decision
Question
Lean hosted
Lean self-host
QPS / cost profile
Bursty, low-medium volume
Sustained, high volume (>~50 QPS)
Data sensitivity
Vendor BAA / region OK
Strict residency, air-gapped
Quality ceiling
Need frontier model
Open-weights suffices
Team
Small, no GPU ops
Has SRE / GPU expertise
Self-hosting essentials
vLLM or TensorRT-LLM for throughput. SGLang for complex programs. Ollama / llama.cpp for dev/edge.
Continuous batching + paged KV cache are non-negotiable for throughput.
Quantization: FP8 (H100/H200) or INT4 (AWQ/GPTQ) for serving — measure quality loss on your evals, don't trust generic claims.
Speculative decoding + prefix caching can 2–3× latency for repeated-prefix workloads (agents, RAG).
Autoscaling by KV-cache pressure, not CPU. Warm pools for time-sensitive endpoints.
GPU choice: H100/H200 for general, B200 for frontier scale, MI300X for memory-heavy, L40S / RTX 6000 Ada for smaller models.
Latency budget anatomy
Figure 7 — Where the milliseconds go in a typical request.
Always stream
Streaming the first token in <500 ms feels instant. A non-streamed 5-second wait feels broken — even if total time is identical.
11. Safety, Security & Compliance
GenAI introduces a new threat model. The OWASP Top 10 for LLM Applications is the baseline; build defense-in-depth around it.
PR pipeline: run unit evals (fast subset) on every change to prompts/, pipelines/, or model config.
Nightly: full eval set across candidate models; publish a quality+cost+latency report.
Release gate: no regression on golden metrics; cost delta within budget; safety classifiers green.
Canary in prod: shadow-mode new prompt/model for 5–10% traffic; auto-rollback on metric breach.
14. Strategy & 90-Day Roadmap
Figure 9 — A defensible 90-day path from idea to GA.
Strategic principles
Start with the workflow, not the model. Map the human task end-to-end first.
Ship narrow, deepen later. A great single feature beats a mediocre platform.
Own your evals and your data. Models are rented; these are owned.
Design for model swaps. Quarterly upgrades are normal; avoid hard-coding one provider's quirks.
Keep a human in the loop early. Use confidence + HITL to ship before fully autonomous.
Measure cost like latency. Both go in dashboards, alerts, and PR reviews.
Bet on open standards (MCP, OTel-GenAI, JSON Schema) over single-vendor SDKs.
15. Common Pitfalls & Anti-Patterns
Anti-pattern
"Let's fine-tune!"
Skipping prompting, RAG, and evals. Fine-tuning a poorly understood task wastes weeks and bakes mistakes into weights.
Anti-pattern
Vibe-driven prompts
Iterating on prompts without an eval set. Improvements on three favorite examples often regress everything else.
Anti-pattern
Multi-agent swarm too early
Five "specialist" agents arguing in a loop. Cost explodes; success rate drops; debugging is impossible.
Anti-pattern
Trusting tool descriptions blindly
Untyped, ambiguous tools cause the model to hallucinate arguments. Tools are an API — design them like one.
Anti-pattern
No streaming, no cancel
A 7-second wait for a JSON blob feels broken. Always stream; always allow cancellation.
Anti-pattern
Stuffing the context window
"More context = better" is false past a point — performance drops in the middle (the "lost in the middle" effect). Retrieve precisely.
Anti-pattern
Single-vendor lock-in
Coding straight to one vendor's SDK shape. Put a thin gateway or interface in front so a model change doesn't fork your codebase.
Anti-pattern
Logging prompts but not outcomes
Without user feedback / outcomes tied to each call, you can never close the loop on quality.
16. Glossary & Resources
Glossary (the terms you'll hear weekly)
Term
Meaning
Token
Subword unit; ~4 chars of English. Billing and context limits are in tokens.
Context window
Max tokens the model can attend to in one call.
TTFT / TPOT
Time to first token / time per output token. Latency KPIs.
KV cache
Per-token key/value tensors cached during generation; defines memory cost of long context.
Embedding
Vector representation of text/image used for similarity search.
RAG
Retrieval-Augmented Generation — fetch relevant docs and pass them to the model.
LoRA / QLoRA
Parameter-efficient fine-tuning that trains a low-rank adapter on top of frozen weights.
SFT / DPO / RLHF / RLAIF
Supervised fine-tuning; preference optimization variants for alignment.
MoE
Mixture-of-Experts — only a subset of parameters active per token.
Speculative decoding
A draft model proposes tokens, a large model verifies — 2–3× speedup.
MCP
Model Context Protocol — open standard for connecting tools/data to LLM clients.
Guardrails
Input/output filters enforcing safety, schema, and policy.
Eval
Automated test of model/system quality on a labeled dataset.
HITL
Human-in-the-loop — required approval/intervention for sensitive actions.
Recommended reading paths
Foundations
"Attention Is All You Need" (Vaswani et al.)
The Illustrated Transformer (Alammar)
Karpathy's "Zero to Hero" series
Lilian Weng's blog (attention, agents, diffusion)
Systems & serving
vLLM & PagedAttention paper
FlashAttention 2/3 papers
The Llama / Mixtral technical reports
NVIDIA TensorRT-LLM docs
Practice
Anthropic engineering blog (agents, prompts, safety)
OpenAI cookbook
OWASP Top 10 for LLM Applications
NIST AI RMF & ISO/IEC 42001
One-page strategy summary
Models rent; process owns. I still start narrow—one workflow, one eval set, a frontier model to get the shape of the answers right—then add guardrails, tracing, and cost tricks (cache, cascade, smaller model on the hot path) once something real is in front of users. Swap-friendly layers matter because the name on the best checkpoint will change; your tickets and datasets shouldn't have to.
17. References
Figures and the narrative in §1–§16 are mine unless noted inline; primary papers and standards are listed below. Author: Linh Truong · LinhTruong.com.
Transformers, attention, and long-context mechanics
Vaswani, Ashish, Noam Shazeer, Niki Parmar, et al. “Attention Is All You Need.” Advances in Neural Information Processing Systems 30 (2017). arXiv:1706.03762. https://arxiv.org/abs/1706.03762
Dao, Tri, et al. “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.” NeurIPS (2022). arXiv:2205.14135.
Dao, Tri. “FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning.” ICLR (2024). arXiv:2307.08691.
Gu, Albert, and Tri Dao. “Mamba: Linear-Time Sequence Modeling with Selective State Spaces.” arXiv:2312.00752 (2023). (SSM / “Mamba-class” hybrids referenced in §2.)
Rombach, Robin, et al. “High-Resolution Image Synthesis with Latent Diffusion Models.” CVPR (2022). arXiv:2112.10752. (Stable Diffusion family lineage.)
Retrieval-augmented generation and context effects
Lewis, Patrick, et al. “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.” NeurIPS (2020). arXiv:2005.11401.
Liu, Nelson F., et al. “Lost in the Middle: How Language Models Use Long Contexts.” arXiv:2307.03172 (2023). (Cited in §15 anti-pattern on stuffing context.)
Alignment, preference optimization, and adaptation
Ouyang, Long, et al. “Training language models to follow instructions with human feedback.” NeurIPS (2022). arXiv:2203.02155. (InstructGPT / RLHF pipeline commonly referenced as “RLHF”.)
Rafailov, Rafael, et al. “Direct Preference Optimization: Your Language Model Is Secretly a Reward Model.” NeurIPS (2023). arXiv:2305.18290.
Hu, Edward J., et al. “LoRA: Low-Rank Adaptation of Large Language Models.” ICLR (2022). arXiv:2106.09685.
Dettmers, Tim, et al. “QLoRA: Efficient Finetuning of Quantized LLMs.” NeurIPS (2023). arXiv:2305.14314.
Bai, Yuntao, et al. “Constitutional AI: Harmlessness from AI Feedback.” arXiv:2212.08073 (2022). (Anthropic alignment lineage; useful alongside provider safety docs.)
Prompting, reasoning, tools, and agents
Wei, Jason, et al. “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.” NeurIPS (2022). arXiv:2201.11903.
Yao, Shunyu, et al. “ReAct: Synergizing Reasoning and Acting in Language Models.” ICLR (2023). arXiv:2210.03629.
Schick, Timo, et al. “Toolformer: Language Models Can Teach Themselves to Use Tools.” NeurIPS (2023). arXiv:2302.04761.
Inference systems: throughput, paging, and speculation
Kwon, Woosuk, et al. “Efficient Memory Management for Large Language Model Serving with PagedAttention.” SOSP (2023). arXiv:2309.06185. (vLLM.)
Leviathan, Yaniv, Matan Kalman, and Yossi Matias. “Fast Inference from Transformers via Speculative Decoding.” ICML (2023). arXiv:2211.17192.
Chen, Charlie, et al. “Accelerating Large Language Model Decoding with Speculative Sampling.” arXiv:2302.01318 (2023). (Related speculative decoding line.)
Open protocols, telemetry, and integration
Anthropic et al. Model Context Protocol (MCP) — specification for connecting tools and data sources to model clients. https://modelcontextprotocol.io
ISO/IEC JTC 1/SC 42. ISO/IEC 42001 — Artificial intelligence management system (AIMS) standard (commercial standard; cite official ISO catalog entry for formal work).
Pedagogy and illustrated explainers (non-archival but standard entry points)
Model families — cite technical reports for specific architectures
Touvron, Hugo, et al. “LLaMA: Open and Efficient Foundation Language Models.” arXiv:2302.13971 (2023); and follow-on Llama 2 / Llama 3 reports from Meta AI (retrieve the report matching the checkpoint you benchmark).
Mistral AI / partners. Mistral / Mixtral technical reports and blog releases — use the official PDF or post corresponding to the weight set under discussion.
OpenAI, Google DeepMind, Anthropic. System cards, model cards, and safety reports for GPT, Gemini, and Claude families — authoritative for capability, training, and usage policies; URLs change per release.