Developer notes · May 2026

Generative AI — strategy & architecture I actually use

One long page on how LLM systems get built today: transformer and diffusion intuition, the stack as I sketch it (retrieval, tools, agents, adapters), and what breaks in production—evals, latency, bills, and guardrails. I use this when coaching teams and when I forget whether LoRA or RAG was the right fix last time.

Linh Truong, MA (Harvard), MBA · LinhTruong.com · Linh@Alumni.Harvard.edu

Source: please link to LinhTruong.com if you redistribute or adapt this file—not stale exports with the author line stripped.

Last revised: May 2026 For: engineers shipping LLM-backed products

LLMsMultimodalRAG AgentsFine-tuningEvals ServingSafetyCost

1. The Generative AI Landscape (2026)

Generative AI here means models that generate new content—text, code, images, audio, video, structured fields—conditioned on prompts, context, and tools. In practice I see most real stacks hang on three threads: a foundation model, retrieval and/or tools for facts and actions, and orchestration when the task is more than one step.

~1M+

Effective context window (frontier)

10–100×

Inference cost drop since 2023

<200ms

Time-to-first-token (optimized)

70%+

Enterprise apps using RAG

Capability matrix

Text

LLMs

Reasoning, code, summarization, extraction, classification, translation.

Image

Diffusion

Text-to-image, editing, in/outpainting, ControlNet conditioning.

Audio

Speech / Music

TTS, STT, voice cloning, music generation, real-time conversational APIs.

Video / 3D

Sora-class

Long-form video, 3D scene synthesis, NeRF/Gaussian splats.

2. Foundations: Transformers & Diffusion

The Transformer in one diagram

Figure 1 — Decoder-only Transformer block, the architecture behind modern LLMs (GPT, Claude, Llama, Gemini).

Attention, diffusion — what to remember

Self-attention

Attention(Q,K,V) = softmax(QKᵀ / √dₖ) · V

Cost is O(n²·d) in sequence length — the reason long context is hard. FlashAttention, GQA, and sliding-window attention bring it down in practice.

Diffusion (forward / reverse)

Forward: gradually add Gaussian noise x_t = √ᾱ_t·x₀ + √(1-ᾱ_t)·ε. Reverse: a U-Net/DiT predicts the noise. Latent diffusion runs in a compressed VAE latent space — the core trick behind Stable Diffusion / Sora.

What changed since 2023

Long context via RoPE scaling, YaRN, and Mamba/SSM hybrids → 1M+ tokens usable.
Mixture-of-Experts made parameter scale cheaper at inference (e.g., 8×22B with only ~39B active).
Reasoning models (think-then-answer) with reinforcement learning from verifiable rewards.
Multimodality is native: vision/audio tokens flow through the same transformer.
Speculative decoding + paged KV cache cut latency 2–4×.

3. The GenAI stack (as layers)

In production this is never “one API call.” Each layer has its own failure modes, vendors, and whatever your observability tool ends up labeling.

Figure 2 — The 7-layer GenAI stack. Treat each as an independent, swappable component.

Models come and go; your harness should not

New checkpoints land all the time. What tends to stay useful is boring infrastructure: eval datasets, chunking and retrieval, the tool contracts, prompt versions, and how you learn from user thumbs-down. I plan interfaces so swapping a model is a config change, not a rewrite.

4. Choosing a Model

Decision tree

Figure 3 — Map task profile → model class. Names are representative of 2026 tiers.

Trade-off cheatsheet

Dimension	Hosted frontier	Hosted small	Open-weights (self-host)
Quality on hard tasks	Best	Adequate	Catching up; varies
Cost per 1M tokens	$$$	$	¢ at scale (need GPUs)
Latency	100–500ms TTFT	50–200ms	You control it
Privacy / data residency	Vendor dependent	Vendor dependent	Full control
Fine-tuning	Limited / managed	Limited	Full (LoRA, full FT)
Best for	Agents, reasoning, code	Bulk, latency-sensitive	Regulated, edge, niche domains

Don't pick a model first

Pick the eval first. Without a way to measure quality on your task, model comparisons are vibes. See §9.

5. Prompting & Context Engineering

"Prompt engineering" has matured into context engineering: structuring the full input — system prompt, tools, examples, retrieved docs, output schema, and conversation history — so the model can succeed.

The anatomy of a production prompt

# 1. Role / persona — who the model is
SYSTEM: You are a senior support engineer at Acme. Be concise and cite sources.

# 2. Capabilities & constraints — what it can/can't do
- Use the search_kb tool before answering policy questions.
- Never invent SKUs. If unsure, say "I don't know."

# 3. Output contract — structured shape
Respond as JSON: {"answer": str, "citations": [str], "confidence": 0..1}

# 4. Few-shot examples — calibrate behavior
EXAMPLE 1: ...
EXAMPLE 2: ...

# 5. Retrieved context — fresh, task-specific
<documents>
  <doc id="kb-184">...</doc>
</documents>

# 6. User turn
USER: How do I return a damaged item?

Techniques that actually move the needle

High ROI

Structured output

Force JSON Schema / tool-use mode. Stops the model from being chatty and lets downstream code parse safely. Use response_format or function-calling.

High ROI

Decomposition

Break complex tasks into a planner → workers → verifier pipeline. Each step has its own prompt and eval.

Medium ROI

Few-shot with hard cases

2–6 examples that include the edge cases your model gets wrong. Don't waste shots on easy cases.

Medium ROI

XML / Markdown delimiters

Wrap variable inputs in tags (<document>, <user_query>) so the model can reliably refer to them and you can prevent prompt injection.

Reasoning

Chain-of-thought / scratchpad

Ask the model to think before answering. For reasoning models, just give them the room; for others, prompt explicitly. Hide CoT from the user, return only the final answer.

Caching

Prompt caching

Put stable content (system prompt, tool defs, big docs) at the start and mark it cacheable. 5–10× cost reduction and 2× latency improvement on repeated calls.

Output contracts (must-have)

// JSON Schema enforced at the API layer
const schema = {
  "type": "object",
  "properties": {
    "intent": { "enum": ["refund", "status", "other"] },
    "order_id": { "type": "string", "pattern": "^ORD-\\d{6}$" },
    "confidence": { "type": "number", "minimum": 0, "maximum": 1 }
  },
  "required": ["intent", "confidence"],
  "additionalProperties": false
};

6. Retrieval-Augmented Generation (RAG)

RAG grounds a model in your data. It's the cheapest and safest way to make a generic LLM speak in your domain — and the most commonly misimplemented part of the stack.

Figure 4 — The full RAG pipeline. Most failures are in chunking, hybrid search, and re-ranking — not in the LLM call.

RAG best practices

Hybrid retrieval (semantic + keyword) beats pure vector by 10–30% on most enterprise corpora.
Re-rank top-50 candidates down to top-5 with a cross-encoder (e.g., bge-reranker, Cohere Rerank).
Chunk semantically, not by character count. Respect headings, tables, and code blocks.
Attach metadata to every chunk: source, date, author, permissions — enables filtering and ACL enforcement.
Citations are non-negotiable. The user must be able to verify every claim.
Evaluate retrieval independently: recall@k and MRR before you blame the LLM.
Refresh strategy: incremental upserts, content hashes, and a way to delete stale docs.

When NOT to use RAG

If your data fits in the context window and is read end-to-end (e.g., a single contract), just pass it in with prompt caching. RAG adds engineering surface; don't introduce a vector DB for 50 documents.

7. Agents & Tool Use

An agent is an LLM in a loop with tools and memory. It plans, calls functions, observes results, and iterates until done. 2026's defining shift was making agents reliable enough for production.

Figure 5 — Agent reference architecture. The LLM is the policy; everything else is engineering.

Patterns from simple to complex

Pattern	Use when	Risk
Tool-augmented call (one shot, 1–3 tools)	Lookups, calculations, simple actions	Low
ReAct loop (think → act → observe, repeat)	Research, debugging, Q&A over data	Medium — needs step cap
Plan-then-execute (planner → workers)	Multi-step tasks with clear sub-goals	Medium
Graph / state machine (LangGraph, Agents SDK)	Workflows with conditionals and retries	Medium — high engineering cost
Multi-agent (specialists + orchestrator)	Truly orthogonal sub-tasks; ablate first	High — coordination overhead, cost balloon
Computer use (browser / desktop control)	Apps without APIs	High — sandbox + permission gating mandatory

Tool design (the biggest agent-quality lever)

Give tools narrow, well-named, side-effect-explicit contracts. Bad: do_thing(x). Good: refund_order(order_id, amount_usd).
Return typed, structured results with error codes — not free-form strings.
Make destructive tools require confirmation or a separate "executor" agent gated by a human.
Use the Model Context Protocol (MCP) to ship tools as reusable servers across agents and IDEs.
Cap tool budget, time, and tokens per task. Always.

The autonomy / reliability trade-off

Every increment of autonomy multiplies error modes. Start with a tool-augmented call, prove value, then promote to a loop. Don't build a multi-agent swarm before you've shipped a single-agent prototype.

8. Fine-tuning, LoRA & Adaptation

Fine-tuning is the last lever, not the first. It is justified when prompting + RAG plateau and you have repeatable, well-labeled examples of the desired behavior.

The adaptation ladder (cheap → expensive)

Technique	What it changes	Cost	When
Better prompt + examples	Nothing in the model	$	Always start here
RAG	Adds knowledge	$$	Stale or proprietary facts
Tool use / agents	Adds capabilities	$$	Need actions or fresh data
LoRA / QLoRA fine-tune	~0.1–2% of weights	$$$	Style, format, narrow domain
Full SFT	All weights	$$$$	Large, distinctive datasets
DPO / RLAIF	Preference alignment	$$$$	Subtle quality / safety tuning
Pre-training	Whole model	$$$$$	Almost never — only labs

LoRA in one picture

Figure 6 — LoRA adds a low-rank update on top of frozen weights. QLoRA = same idea on a 4-bit quantized base.

Data is the product

500–5,000 high-quality examples usually beat 100k noisy ones.
Hold out a real eval set before you start training — never sample it from training data.
De-duplicate aggressively. Near-duplicate examples inflate train loss without improving generalization.
Track schema versions and labeling guidelines like code.

Default playbook

Use QLoRA on a strong open-weights base (e.g., Llama-3.x-70B or Qwen-2.x) with a small, curated dataset. Serve multiple adapters from one base via vLLM's multi-LoRA support.

9. Evaluation & Observability

Without measurements you're guessing. I treat evals as non-optional: same role as regression tests, except the “function” is stochastic. They catch when a shiny prompt tweak helps three demos and quietly wrecks everything else.

Layered evaluation

Unit

Component-level

Prompt regressions, retrieval recall@k, tool argument accuracy. Fast, deterministic-ish, run on every PR.

Integration

End-to-end

Full task scenarios. Mix of programmatic checks (does JSON parse?) and LLM-as-judge with rubrics.

Online

Production

User feedback (👍/👎, edits, retries), implicit signals (copy, dwell), and shadow-mode A/B tests.

Building an eval set the right way

Mine real traffic. Sample 200–500 actual user inputs, stratified by intent and difficulty.
Label with rubrics, not vibes. Define 3–5 quality dimensions (correctness, completeness, tone, safety, citation).
Freeze a golden set. Version it. Never let it leak into training.
Add adversarial cases. Jailbreaks, ambiguous queries, edge tools, multilingual.
Automate scoring with a strong judge model + spot human checks for calibration.
Track over time. Run on every prompt/model change; gate releases on quality + cost + latency.

Metrics that matter

Layer	Metric	Tooling
Retrieval	recall@k, MRR, hit-rate	Ragas, custom
Generation quality	Faithfulness, answer-correctness, helpfulness	LLM-as-judge, Ragas, Promptfoo
Agents	Task success, steps-to-success, tool-arg precision	LangSmith, Braintrust, custom
Safety	Toxicity, PII leak, jailbreak rate	Llama Guard, Granite Guardian, custom red-team
Operational	p50/p95 latency, TTFT, tokens/sec, cost/req	OpenTelemetry, Langfuse, Phoenix
Business	CSAT, deflection rate, conversion uplift	Product analytics

Tracing every call

Log what you'd need to replay a bad answer: messages, retrieval hits, tool calls, intermediate notes if any, final output, token counts, cost, latency. The OpenTelemetry GenAI semantic conventions are a usable common shape; pipe traces into whatever you already run (Langfuse, Phoenix, Datadog, etc.).

Rule of thumb

If you can't reproduce a bad output from logs alone, your tracing is broken — fix it before adding features.

10. Deployment & Inference

Hosted vs. self-hosted decision

Question	Lean hosted	Lean self-host
QPS / cost profile	Bursty, low-medium volume	Sustained, high volume (>~50 QPS)
Data sensitivity	Vendor BAA / region OK	Strict residency, air-gapped
Quality ceiling	Need frontier model	Open-weights suffices
Team	Small, no GPU ops	Has SRE / GPU expertise

Self-hosting essentials

vLLM or TensorRT-LLM for throughput. SGLang for complex programs. Ollama / llama.cpp for dev/edge.
Continuous batching + paged KV cache are non-negotiable for throughput.
Quantization: FP8 (H100/H200) or INT4 (AWQ/GPTQ) for serving — measure quality loss on your evals, don't trust generic claims.
Speculative decoding + prefix caching can 2–3× latency for repeated-prefix workloads (agents, RAG).
Autoscaling by KV-cache pressure, not CPU. Warm pools for time-sensitive endpoints.
GPU choice: H100/H200 for general, B200 for frontier scale, MI300X for memory-heavy, L40S / RTX 6000 Ada for smaller models.

Latency budget anatomy

Figure 7 — Where the milliseconds go in a typical request.

Always stream

Streaming the first token in <500 ms feels instant. A non-streamed 5-second wait feels broken — even if total time is identical.

11. Safety, Security & Compliance

GenAI introduces a new threat model. The OWASP Top 10 for LLM Applications is the baseline; build defense-in-depth around it.

Top risks & mitigations

Risk	What it looks like	Mitigation
Prompt injection	Untrusted content overrides instructions	Trust boundaries, structured I/O, content tagging, allow-list tools, output filters
Data exfiltration via tools	Agent leaks secrets through browser/email tool	Per-tool data classification, egress allow-list, redaction, HITL for sensitive ops
Hallucination	Confident, wrong output	RAG with citations, faithfulness eval, refusal calibration, abstain options
Jailbreak / policy violation	Unsafe content slipping through	Safety classifier (Llama Guard / Granite Guardian), prompt hardening, red-team
Sensitive data in prompts	PII in logs and provider history	Redaction at gateway, regional routing, BAA / DPA, zero-retention modes
Model / supply-chain	Tampered open-weights, dependency CVEs	Signed model artifacts, SBOM, model registry, scan tool descriptions
Cost / DoS	Token bomb, infinite loops	Per-user budgets, max-steps, timeout, abuse detection
Copyright / IP	Generated content reproduces training data	Output filters, indemnity terms, attribution, retrieval-grounding

The trust-boundary diagram

data, never policy.

Figure 8 — Always separate untrusted content from instructions.

Compliance checklist

Map data flows (DPIA): what leaves your boundary, where does it go, retention windows.
Use vendor zero-retention / no-training endpoints for sensitive workloads.
Region pinning for GDPR, India DPDP, etc. Document residency in your privacy policy.
Maintain a model card for every deployed model/adapter.
Align to NIST AI RMF, ISO/IEC 42001, and (in EU) the EU AI Act risk class for your use case.
Disclose AI-generated content to end users where required; provide an opt-out for sensitive features.

12. Economics & Cost Engineering

GenAI is the first system where unit costs scale with quality of thinking, not just request count. A careless feature can 100× a bill overnight.

Cost levers, ranked

Lever	Typical savings	Notes
Prompt / context caching	40–90% on stable prefixes	Best ROI; put stable content first
Model cascading (small → big)	30–70%	Cheap model handles easy cases; escalate only when needed
Shorter outputs (max_tokens, schemas)	20–50%	Decode tokens cost more than input
Batch APIs (async)	~50% off list	For non-interactive workloads
Tighter retrieval (fewer / smaller chunks)	20–40%	Re-ranking + dedup
Fine-tuned small model for hot path	50–90%	Only after you've proven the task with a frontier model
Caching final answers (semantic key)	highly variable	For repeat queries; watch staleness

The unit-economics worksheet

# Per request
input_tokens   = 3,500   # system + retrieved docs + user
output_tokens  = 450
cache_hit_rate = 0.75

cost_in  = (1 - cache_hit_rate) * input_tokens * PRICE_IN
        + cache_hit_rate * input_tokens * PRICE_CACHE_READ
cost_out =      output_tokens * PRICE_OUT
req_cost = cost_in + cost_out

# Margin sanity
gross_margin = (revenue_per_request - req_cost) / revenue_per_request
# Target >70% before scaling marketing.

Instrument cost per user, feature, and model from the first real traffic

Aggregate-only cost dashboards hide which feature is bleeding money. Tag every call.

13. Developer Workflow & Tooling

Reference toolbox (pick one per slot)

Layer	Options
App framework	LangChain, LlamaIndex, Pydantic AI, Vercel AI SDK, Mastra, plain SDK
Agent / graph	LangGraph, OpenAI Agents SDK, CrewAI, Claude Agent SDK, smolagents
Vector store	pgvector, Qdrant, Pinecone, Weaviate, Vespa, OpenSearch, Turbopuffer
Embeddings	OpenAI text-embedding-3, Cohere v3, BGE-M3, Voyage, Nomic
Re-rankers	Cohere Rerank, BGE Reranker, ColBERT, Jina
Gateway	LiteLLM, Portkey, Helicone, Cloudflare AI Gateway
Serving	vLLM, TGI, TensorRT-LLM, SGLang, Ollama, llama.cpp
Evals / tracing	Langfuse, Phoenix, Braintrust, LangSmith, Promptfoo, Ragas
Guardrails	Llama Guard 3, Granite Guardian, NeMo Guardrails, Guardrails AI
Coding assistant	Claude Code, Cursor, Windsurf, GitHub Copilot, Cline
Tool protocol	Model Context Protocol (MCP), OpenAPI, gRPC

Reference repo layout

app/
  prompts/                  # versioned, reviewable
    customer_support.v3.md
  schemas/                  # JSON Schemas for tool I/O and outputs
  tools/                    # typed wrappers — one file per tool
    search_kb.ts
    create_ticket.ts
  pipelines/
    rag.ts                  # retrieval pipeline
    support_agent.ts        # orchestration graph
  gateway/                  # model routing, fallback, cost ledger
  guardrails/               # input/output filters
evals/
  datasets/                 # golden sets, frozen versions
  judges/                   # rubric prompts
  run.ts                    # CI entrypoint
ops/
  dashboards/               # grafana, langfuse
  alerts.yml
infra/
  vllm.yaml                 # if self-hosting
  terraform/
docs/
  model-cards/
  threat-model.md

CI for prompts & models

Pre-commit: lint prompts, validate JSON Schemas, regenerate types.
PR pipeline: run unit evals (fast subset) on every change to prompts/, pipelines/, or model config.
Nightly: full eval set across candidate models; publish a quality+cost+latency report.
Release gate: no regression on golden metrics; cost delta within budget; safety classifiers green.
Canary in prod: shadow-mode new prompt/model for 5–10% traffic; auto-rollback on metric breach.

14. Strategy & 90-Day Roadmap

Figure 9 — A defensible 90-day path from idea to GA.

Strategic principles

Start with the workflow, not the model. Map the human task end-to-end first.
Ship narrow, deepen later. A great single feature beats a mediocre platform.
Own your evals and your data. Models are rented; these are owned.
Design for model swaps. Quarterly upgrades are normal; avoid hard-coding one provider's quirks.
Keep a human in the loop early. Use confidence + HITL to ship before fully autonomous.
Measure cost like latency. Both go in dashboards, alerts, and PR reviews.
Bet on open standards (MCP, OTel-GenAI, JSON Schema) over single-vendor SDKs.

15. Common Pitfalls & Anti-Patterns

Anti-pattern

"Let's fine-tune!"

Skipping prompting, RAG, and evals. Fine-tuning a poorly understood task wastes weeks and bakes mistakes into weights.

Anti-pattern

Vibe-driven prompts

Iterating on prompts without an eval set. Improvements on three favorite examples often regress everything else.

Anti-pattern

Multi-agent swarm too early

Five "specialist" agents arguing in a loop. Cost explodes; success rate drops; debugging is impossible.

Anti-pattern

Trusting tool descriptions blindly

Untyped, ambiguous tools cause the model to hallucinate arguments. Tools are an API — design them like one.

Anti-pattern

No streaming, no cancel

A 7-second wait for a JSON blob feels broken. Always stream; always allow cancellation.

Anti-pattern

Stuffing the context window

"More context = better" is false past a point — performance drops in the middle (the "lost in the middle" effect). Retrieve precisely.

Anti-pattern

Single-vendor lock-in

Coding straight to one vendor's SDK shape. Put a thin gateway or interface in front so a model change doesn't fork your codebase.

Anti-pattern

Logging prompts but not outcomes

Without user feedback / outcomes tied to each call, you can never close the loop on quality.

16. Glossary & Resources

Glossary (the terms you'll hear weekly)

Term	Meaning
Token	Subword unit; ~4 chars of English. Billing and context limits are in tokens.
Context window	Max tokens the model can attend to in one call.
TTFT / TPOT	Time to first token / time per output token. Latency KPIs.
KV cache	Per-token key/value tensors cached during generation; defines memory cost of long context.
Embedding	Vector representation of text/image used for similarity search.
RAG	Retrieval-Augmented Generation — fetch relevant docs and pass them to the model.
LoRA / QLoRA	Parameter-efficient fine-tuning that trains a low-rank adapter on top of frozen weights.
SFT / DPO / RLHF / RLAIF	Supervised fine-tuning; preference optimization variants for alignment.
MoE	Mixture-of-Experts — only a subset of parameters active per token.
Speculative decoding	A draft model proposes tokens, a large model verifies — 2–3× speedup.
MCP	Model Context Protocol — open standard for connecting tools/data to LLM clients.
Guardrails	Input/output filters enforcing safety, schema, and policy.
Eval	Automated test of model/system quality on a labeled dataset.
HITL	Human-in-the-loop — required approval/intervention for sensitive actions.

Foundations

"Attention Is All You Need" (Vaswani et al.)
The Illustrated Transformer (Alammar)
Karpathy's "Zero to Hero" series
Lilian Weng's blog (attention, agents, diffusion)

Systems & serving

vLLM & PagedAttention paper
FlashAttention 2/3 papers
The Llama / Mixtral technical reports
NVIDIA TensorRT-LLM docs

Practice

Anthropic engineering blog (agents, prompts, safety)
OpenAI cookbook
OWASP Top 10 for LLM Applications
NIST AI RMF & ISO/IEC 42001

One-page strategy summary

Models rent; process owns. I still start narrow—one workflow, one eval set, a frontier model to get the shape of the answers right—then add guardrails, tracing, and cost tricks (cache, cascade, smaller model on the hot path) once something real is in front of users. Swap-friendly layers matter because the name on the best checkpoint will change; your tickets and datasets shouldn't have to.

17. References

Figures and the narrative in §1–§16 are mine unless noted inline; primary papers and standards are listed below. Author: Linh Truong · LinhTruong.com.

Transformers, attention, and long-context mechanics

Vaswani, Ashish, Noam Shazeer, Niki Parmar, et al. “Attention Is All You Need.” Advances in Neural Information Processing Systems 30 (2017). arXiv:1706.03762. https://arxiv.org/abs/1706.03762
Dao, Tri, et al. “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.” NeurIPS (2022). arXiv:2205.14135.
Dao, Tri. “FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning.” ICLR (2024). arXiv:2307.08691.
Gu, Albert, and Tri Dao. “Mamba: Linear-Time Sequence Modeling with Selective State Spaces.” arXiv:2312.00752 (2023). (SSM / “Mamba-class” hybrids referenced in §2.)

Diffusion, latent diffusion, and generative media

Ho, Jonathan, Ajay Jain, and Pieter Abbeel. “Denoising Diffusion Probabilistic Models.” NeurIPS (2020). arXiv:2006.11239.
Rombach, Robin, et al. “High-Resolution Image Synthesis with Latent Diffusion Models.” CVPR (2022). arXiv:2112.10752. (Stable Diffusion family lineage.)

Retrieval-augmented generation and context effects

Lewis, Patrick, et al. “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.” NeurIPS (2020). arXiv:2005.11401.
Liu, Nelson F., et al. “Lost in the Middle: How Language Models Use Long Contexts.” arXiv:2307.03172 (2023). (Cited in §15 anti-pattern on stuffing context.)

Alignment, preference optimization, and adaptation

Ouyang, Long, et al. “Training language models to follow instructions with human feedback.” NeurIPS (2022). arXiv:2203.02155. (InstructGPT / RLHF pipeline commonly referenced as “RLHF”.)
Rafailov, Rafael, et al. “Direct Preference Optimization: Your Language Model Is Secretly a Reward Model.” NeurIPS (2023). arXiv:2305.18290.
Hu, Edward J., et al. “LoRA: Low-Rank Adaptation of Large Language Models.” ICLR (2022). arXiv:2106.09685.
Dettmers, Tim, et al. “QLoRA: Efficient Finetuning of Quantized LLMs.” NeurIPS (2023). arXiv:2305.14314.
Bai, Yuntao, et al. “Constitutional AI: Harmlessness from AI Feedback.” arXiv:2212.08073 (2022). (Anthropic alignment lineage; useful alongside provider safety docs.)

Prompting, reasoning, tools, and agents

Wei, Jason, et al. “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.” NeurIPS (2022). arXiv:2201.11903.
Yao, Shunyu, et al. “ReAct: Synergizing Reasoning and Acting in Language Models.” ICLR (2023). arXiv:2210.03629.
Schick, Timo, et al. “Toolformer: Language Models Can Teach Themselves to Use Tools.” NeurIPS (2023). arXiv:2302.04761.

Inference systems: throughput, paging, and speculation

Kwon, Woosuk, et al. “Efficient Memory Management for Large Language Model Serving with PagedAttention.” SOSP (2023). arXiv:2309.06185. (vLLM.)
Leviathan, Yaniv, Matan Kalman, and Yossi Matias. “Fast Inference from Transformers via Speculative Decoding.” ICML (2023). arXiv:2211.17192.
Chen, Charlie, et al. “Accelerating Large Language Model Decoding with Speculative Sampling.” arXiv:2302.01318 (2023). (Related speculative decoding line.)

Open protocols, telemetry, and integration

Anthropic et al. Model Context Protocol (MCP) — specification for connecting tools and data sources to model clients. https://modelcontextprotocol.io
OpenTelemetry Project. Semantic Conventions for Generative AI — tracing and observability attributes for LLM calls (evolving; check current release). https://opentelemetry.io/docs/specs/semconv/gen-ai/

Safety, security, and governance

OWASP Foundation. OWASP Top 10 for Large Language Model Applications (project page and release artifacts). https://owasp.org/www-project-top-10-for-large-language-model-applications/
National Institute of Standards and Technology (NIST). Artificial Intelligence Risk Management Framework (AI RMF 1.0), 2023. https://www.nist.gov/itl/ai-risk-management-framework
ISO/IEC JTC 1/SC 42. ISO/IEC 42001 — Artificial intelligence management system (AIMS) standard (commercial standard; cite official ISO catalog entry for formal work).

Pedagogy and illustrated explainers (non-archival but standard entry points)

Alammar, Jay. “The Illustrated Transformer.” Jalammar.github.io (blog). https://jalammar.github.io/illustrated-transformer/
Karpathy, Andrej. “Neural Networks: Zero to Hero” (video/code course). https://karpathy.ai/zero-to-hero.html
Weng, Lilian. Lil’ Log (blog) — long-form notes on LLMs, agents, diffusion, and RLHF. https://lilianweng.github.io/

Model families — cite technical reports for specific architectures

Touvron, Hugo, et al. “LLaMA: Open and Efficient Foundation Language Models.” arXiv:2302.13971 (2023); and follow-on Llama 2 / Llama 3 reports from Meta AI (retrieve the report matching the checkpoint you benchmark).
Mistral AI / partners. Mistral / Mixtral technical reports and blog releases — use the official PDF or post corresponding to the weight set under discussion.
OpenAI, Google DeepMind, Anthropic. System cards, model cards, and safety reports for GPT, Gemini, and Claude families — authoritative for capability, training, and usage policies; URLs change per release.

1. The Generative AI Landscape (2026)

Capability matrix

LLMs

Diffusion

Speech / Music

Sora-class

2. Foundations: Transformers & Diffusion

The Transformer in one diagram

Attention, diffusion — what to remember

Self-attention

Diffusion (forward / reverse)

What changed since 2023

3. The GenAI stack (as layers)

Models come and go; your harness should not

4. Choosing a Model

Decision tree

Trade-off cheatsheet

Don't pick a model first

5. Prompting & Context Engineering

The anatomy of a production prompt

Techniques that actually move the needle

Structured output

Decomposition

Few-shot with hard cases

XML / Markdown delimiters

Chain-of-thought / scratchpad

Prompt caching

Output contracts (must-have)

6. Retrieval-Augmented Generation (RAG)

RAG best practices

When NOT to use RAG

7. Agents & Tool Use

Patterns from simple to complex

Tool design (the biggest agent-quality lever)

The autonomy / reliability trade-off

8. Fine-tuning, LoRA & Adaptation

The adaptation ladder (cheap → expensive)

LoRA in one picture

Data is the product

Default playbook

9. Evaluation & Observability

Layered evaluation

Component-level

End-to-end

Production

Building an eval set the right way

Metrics that matter

Tracing every call

Rule of thumb

10. Deployment & Inference

Hosted vs. self-hosted decision

Self-hosting essentials

Latency budget anatomy

Always stream

11. Safety, Security & Compliance

Top risks & mitigations

The trust-boundary diagram

Compliance checklist

12. Economics & Cost Engineering

Cost levers, ranked

The unit-economics worksheet

Instrument cost per user, feature, and model from the first real traffic

13. Developer Workflow & Tooling

Reference toolbox (pick one per slot)

Reference repo layout

CI for prompts & models

14. Strategy & 90-Day Roadmap

Strategic principles

15. Common Pitfalls & Anti-Patterns

"Let's fine-tune!"

Vibe-driven prompts

Multi-agent swarm too early

Trusting tool descriptions blindly

No streaming, no cancel

Stuffing the context window

Single-vendor lock-in

Logging prompts but not outcomes

16. Glossary & Resources

Glossary (the terms you'll hear weekly)

Recommended reading paths