Developer notes · May 2026

Generative AI — strategy & architecture I actually use

One long page on how LLM systems get built today: transformer and diffusion intuition, the stack as I sketch it (retrieval, tools, agents, adapters), and what breaks in production—evals, latency, bills, and guardrails. I use this when coaching teams and when I forget whether LoRA or RAG was the right fix last time.

Source: please link to LinhTruong.com if you redistribute or adapt this file—not stale exports with the author line stripped.

Last revised: May 2026 For: engineers shipping LLM-backed products
LLMsMultimodalRAG AgentsFine-tuningEvals ServingSafetyCost

1. The Generative AI Landscape (2026)

Generative AI here means models that generate new content—text, code, images, audio, video, structured fields—conditioned on prompts, context, and tools. In practice I see most real stacks hang on three threads: a foundation model, retrieval and/or tools for facts and actions, and orchestration when the task is more than one step.

~1M+
Effective context window (frontier)
10–100×
Inference cost drop since 2023
<200ms
Time-to-first-token (optimized)
70%+
Enterprise apps using RAG

Capability matrix

Text

LLMs

Reasoning, code, summarization, extraction, classification, translation.

Image

Diffusion

Text-to-image, editing, in/outpainting, ControlNet conditioning.

Audio

Speech / Music

TTS, STT, voice cloning, music generation, real-time conversational APIs.

Video / 3D

Sora-class

Long-form video, 3D scene synthesis, NeRF/Gaussian splats.

2. Foundations: Transformers & Diffusion

The Transformer in one diagram

Tokens + Pos. Embedding Transformer Block × N (e.g., 32–120 layers) Multi-Head Self-Attention Q · Kᵀ / √dₖ → softmax → · V RoPE · GQA · FlashAttention Add & RMSNorm residual FFN / MoE SwiGLU · GeGLU Mixture-of-Experts routing Add & RMSNorm residual KV Cache paged · prefix-shared · quantized enables fast autoregressive decoding Logits → sampler Sampling temperature · top-p min-p · spec. decoding
Figure 1 — Decoder-only Transformer block, the architecture behind modern LLMs (GPT, Claude, Llama, Gemini).

Attention, diffusion — what to remember

Self-attention

Attention(Q,K,V) = softmax(QKᵀ / √dₖ) · V

Cost is O(n²·d) in sequence length — the reason long context is hard. FlashAttention, GQA, and sliding-window attention bring it down in practice.

Diffusion (forward / reverse)

Forward: gradually add Gaussian noise x_t = √ᾱ_t·x₀ + √(1-ᾱ_t)·ε. Reverse: a U-Net/DiT predicts the noise. Latent diffusion runs in a compressed VAE latent space — the core trick behind Stable Diffusion / Sora.

What changed since 2023

3. The GenAI stack (as layers)

In production this is never “one API call.” Each layer has its own failure modes, vendors, and whatever your observability tool ends up labeling.

User Experience Chat UI · streaming · citations · structured outputs · voice · canvas/artifacts Application & Agent Orchestration Workflow graph · planner · tool router · memory · guardrails (LangGraph, Agents SDK, MCP) Retrieval & Context Chunking · embeddings · vector + BM25 hybrid · re-ranker · context cache Model Gateway Routing · fallback · caching · rate-limit · cost ledger · PII redaction (LiteLLM, Portkey, custom) Foundation Models Claude · GPT · Gemini · Llama · Mistral · open-weights · fine-tuned adapters (LoRA) Infrastructure & Serving vLLM · TensorRT-LLM · TGI · GPU pools · KV cache · autoscaling · GPU choice (H100/H200/B200/MI300) Evals, Safety & Observability Offline + online evals · tracing · red-team · drift · cost telemetry · feedback loops
Figure 2 — The 7-layer GenAI stack. Treat each as an independent, swappable component.

Models come and go; your harness should not

New checkpoints land all the time. What tends to stay useful is boring infrastructure: eval datasets, chunking and retrieval, the tool contracts, prompt versions, and how you learn from user thumbs-down. I plan interfaces so swapping a model is a config change, not a rewrite.

4. Choosing a Model

Decision tree

Start: what's the task? latency · accuracy · privacy · cost Frontier reasoning / agents code, multi-step, long context Bulk / classification extraction, tagging, summaries On-device / private offline, regulated, edge Multimodal generation image · video · audio · 3D Claude Opus 4.7 / Sonnet 4.6 GPT-5 / o-series Gemini 2.x Pro Claude Haiku 4.5 GPT-5-mini · Gemini Flash Llama 3.x / Mistral Small Llama 3.x-8B / 70B Phi-4 · Qwen · Gemma Quantized (4–8 bit) via Ollama Flux · SDXL · Imagen Veo / Sora · Runway ElevenLabs · Suno Always benchmark 2–3 candidates on your own eval set before committing.
Figure 3 — Map task profile → model class. Names are representative of 2026 tiers.

Trade-off cheatsheet

DimensionHosted frontierHosted smallOpen-weights (self-host)
Quality on hard tasksBestAdequateCatching up; varies
Cost per 1M tokens$$$$¢ at scale (need GPUs)
Latency100–500ms TTFT50–200msYou control it
Privacy / data residencyVendor dependentVendor dependentFull control
Fine-tuningLimited / managedLimitedFull (LoRA, full FT)
Best forAgents, reasoning, codeBulk, latency-sensitiveRegulated, edge, niche domains

Don't pick a model first

Pick the eval first. Without a way to measure quality on your task, model comparisons are vibes. See §9.

5. Prompting & Context Engineering

"Prompt engineering" has matured into context engineering: structuring the full input — system prompt, tools, examples, retrieved docs, output schema, and conversation history — so the model can succeed.

The anatomy of a production prompt

# 1. Role / persona — who the model is
SYSTEM: You are a senior support engineer at Acme. Be concise and cite sources.

# 2. Capabilities & constraints — what it can/can't do
- Use the search_kb tool before answering policy questions.
- Never invent SKUs. If unsure, say "I don't know."

# 3. Output contract — structured shape
Respond as JSON: {"answer": str, "citations": [str], "confidence": 0..1}

# 4. Few-shot examples — calibrate behavior
EXAMPLE 1: ...
EXAMPLE 2: ...

# 5. Retrieved context — fresh, task-specific
<documents>
  <doc id="kb-184">...</doc>
</documents>

# 6. User turn
USER: How do I return a damaged item?

Techniques that actually move the needle

High ROI

Structured output

Force JSON Schema / tool-use mode. Stops the model from being chatty and lets downstream code parse safely. Use response_format or function-calling.

High ROI

Decomposition

Break complex tasks into a planner → workers → verifier pipeline. Each step has its own prompt and eval.

Medium ROI

Few-shot with hard cases

2–6 examples that include the edge cases your model gets wrong. Don't waste shots on easy cases.

Medium ROI

XML / Markdown delimiters

Wrap variable inputs in tags (<document>, <user_query>) so the model can reliably refer to them and you can prevent prompt injection.

Reasoning

Chain-of-thought / scratchpad

Ask the model to think before answering. For reasoning models, just give them the room; for others, prompt explicitly. Hide CoT from the user, return only the final answer.

Caching

Prompt caching

Put stable content (system prompt, tool defs, big docs) at the start and mark it cacheable. 5–10× cost reduction and 2× latency improvement on repeated calls.

Output contracts (must-have)

// JSON Schema enforced at the API layer
const schema = {
  "type": "object",
  "properties": {
    "intent": { "enum": ["refund", "status", "other"] },
    "order_id": { "type": "string", "pattern": "^ORD-\\d{6}$" },
    "confidence": { "type": "number", "minimum": 0, "maximum": 1 }
  },
  "required": ["intent", "confidence"],
  "additionalProperties": false
};

6. Retrieval-Augmented Generation (RAG)

RAG grounds a model in your data. It's the cheapest and safest way to make a generic LLM speak in your domain — and the most commonly misimplemented part of the stack.

A. Ingestion (offline / batch) Source docs PDFs · web · DBs · APIs Parse / clean layout · OCR · tables Chunk semantic · overlap · metadata Embed cohere · openai · bge Index → Vector DB + BM25 pgvector · pinecone · qdrant · weaviate B. Query time (online) User query + history · filters Query rewrite HyDE · multi-query · decomp Hybrid retrieve vector + BM25 + metadata Re-rank cross-encoder · top-k → top-n Context pack & cite dedupe · token budget · IDs C. Generate & verify LLM call prompt + retrieved docs Citation check does answer match sources? Guardrails PII · policy · refusal Stream → user log: query, docs, answer, feedback index built once, queried millions of times
Figure 4 — The full RAG pipeline. Most failures are in chunking, hybrid search, and re-ranking — not in the LLM call.

RAG best practices

When NOT to use RAG

If your data fits in the context window and is read end-to-end (e.g., a single contract), just pass it in with prompt caching. RAG adds engineering surface; don't introduce a vector DB for 50 documents.

7. Agents & Tool Use

An agent is an LLM in a loop with tools and memory. It plans, calls functions, observes results, and iterates until done. 2026's defining shift was making agents reliable enough for production.

LLM plan · decide tool call · stop Tools search · code · DB · API · MCP Memory short · long · episodic Planner ReAct · Plan-Execute · graph Guardrails budgets · policies · HITL Environment sandbox · filesystem · browser Observation tool output · errors · state Critic / Verifier self-check · test runs User / HITL approve · clarify
Figure 5 — Agent reference architecture. The LLM is the policy; everything else is engineering.

Patterns from simple to complex

PatternUse whenRisk
Tool-augmented call (one shot, 1–3 tools)Lookups, calculations, simple actionsLow
ReAct loop (think → act → observe, repeat)Research, debugging, Q&A over dataMedium — needs step cap
Plan-then-execute (planner → workers)Multi-step tasks with clear sub-goalsMedium
Graph / state machine (LangGraph, Agents SDK)Workflows with conditionals and retriesMedium — high engineering cost
Multi-agent (specialists + orchestrator)Truly orthogonal sub-tasks; ablate firstHigh — coordination overhead, cost balloon
Computer use (browser / desktop control)Apps without APIsHigh — sandbox + permission gating mandatory

Tool design (the biggest agent-quality lever)

The autonomy / reliability trade-off

Every increment of autonomy multiplies error modes. Start with a tool-augmented call, prove value, then promote to a loop. Don't build a multi-agent swarm before you've shipped a single-agent prototype.

8. Fine-tuning, LoRA & Adaptation

Fine-tuning is the last lever, not the first. It is justified when prompting + RAG plateau and you have repeatable, well-labeled examples of the desired behavior.

The adaptation ladder (cheap → expensive)

TechniqueWhat it changesCostWhen
Better prompt + examplesNothing in the model$Always start here
RAGAdds knowledge$$Stale or proprietary facts
Tool use / agentsAdds capabilities$$Need actions or fresh data
LoRA / QLoRA fine-tune~0.1–2% of weights$$$Style, format, narrow domain
Full SFTAll weights$$$$Large, distinctive datasets
DPO / RLAIFPreference alignment$$$$Subtle quality / safety tuning
Pre-trainingWhole model$$$$$Almost never — only labs

LoRA in one picture

Frozen W d × d (huge) unchanged + A d × r (r=8–64) × B r × d = Effective W' trained A·B is ~0.1–2% of params Multiple LoRA adapters can be hot-swapped at serve time, one base model.
Figure 6 — LoRA adds a low-rank update on top of frozen weights. QLoRA = same idea on a 4-bit quantized base.

Data is the product

Default playbook

Use QLoRA on a strong open-weights base (e.g., Llama-3.x-70B or Qwen-2.x) with a small, curated dataset. Serve multiple adapters from one base via vLLM's multi-LoRA support.

9. Evaluation & Observability

Without measurements you're guessing. I treat evals as non-optional: same role as regression tests, except the “function” is stochastic. They catch when a shiny prompt tweak helps three demos and quietly wrecks everything else.

Layered evaluation

Unit

Component-level

Prompt regressions, retrieval recall@k, tool argument accuracy. Fast, deterministic-ish, run on every PR.

Integration

End-to-end

Full task scenarios. Mix of programmatic checks (does JSON parse?) and LLM-as-judge with rubrics.

Online

Production

User feedback (👍/👎, edits, retries), implicit signals (copy, dwell), and shadow-mode A/B tests.

Building an eval set the right way

  1. Mine real traffic. Sample 200–500 actual user inputs, stratified by intent and difficulty.
  2. Label with rubrics, not vibes. Define 3–5 quality dimensions (correctness, completeness, tone, safety, citation).
  3. Freeze a golden set. Version it. Never let it leak into training.
  4. Add adversarial cases. Jailbreaks, ambiguous queries, edge tools, multilingual.
  5. Automate scoring with a strong judge model + spot human checks for calibration.
  6. Track over time. Run on every prompt/model change; gate releases on quality + cost + latency.

Metrics that matter

LayerMetricTooling
Retrievalrecall@k, MRR, hit-rateRagas, custom
Generation qualityFaithfulness, answer-correctness, helpfulnessLLM-as-judge, Ragas, Promptfoo
AgentsTask success, steps-to-success, tool-arg precisionLangSmith, Braintrust, custom
SafetyToxicity, PII leak, jailbreak rateLlama Guard, Granite Guardian, custom red-team
Operationalp50/p95 latency, TTFT, tokens/sec, cost/reqOpenTelemetry, Langfuse, Phoenix
BusinessCSAT, deflection rate, conversion upliftProduct analytics

Tracing every call

Log what you'd need to replay a bad answer: messages, retrieval hits, tool calls, intermediate notes if any, final output, token counts, cost, latency. The OpenTelemetry GenAI semantic conventions are a usable common shape; pipe traces into whatever you already run (Langfuse, Phoenix, Datadog, etc.).

Rule of thumb

If you can't reproduce a bad output from logs alone, your tracing is broken — fix it before adding features.

10. Deployment & Inference

Hosted vs. self-hosted decision

QuestionLean hostedLean self-host
QPS / cost profileBursty, low-medium volumeSustained, high volume (>~50 QPS)
Data sensitivityVendor BAA / region OKStrict residency, air-gapped
Quality ceilingNeed frontier modelOpen-weights suffices
TeamSmall, no GPU opsHas SRE / GPU expertise

Self-hosting essentials

Latency budget anatomy

Network in / auth ~20–60 ms Retrieval + re-rank ~50–250 ms LLM prefill (input tokens) ~50–500 ms — biggest variable Decode (output tokens) tokens × ~10–30 ms/token Post / stream ~10–40 ms Stream first token ASAP (TTFT). Use shorter prompts & prompt caching to shrink prefill.
Figure 7 — Where the milliseconds go in a typical request.

Always stream

Streaming the first token in <500 ms feels instant. A non-streamed 5-second wait feels broken — even if total time is identical.

11. Safety, Security & Compliance

GenAI introduces a new threat model. The OWASP Top 10 for LLM Applications is the baseline; build defense-in-depth around it.

Top risks & mitigations

RiskWhat it looks likeMitigation
Prompt injectionUntrusted content overrides instructionsTrust boundaries, structured I/O, content tagging, allow-list tools, output filters
Data exfiltration via toolsAgent leaks secrets through browser/email toolPer-tool data classification, egress allow-list, redaction, HITL for sensitive ops
HallucinationConfident, wrong outputRAG with citations, faithfulness eval, refusal calibration, abstain options
Jailbreak / policy violationUnsafe content slipping throughSafety classifier (Llama Guard / Granite Guardian), prompt hardening, red-team
Sensitive data in promptsPII in logs and provider historyRedaction at gateway, regional routing, BAA / DPA, zero-retention modes
Model / supply-chainTampered open-weights, dependency CVEsSigned model artifacts, SBOM, model registry, scan tool descriptions
Cost / DoSToken bomb, infinite loopsPer-user budgets, max-steps, timeout, abuse detection
Copyright / IPGenerated content reproduces training dataOutput filters, indemnity terms, attribution, retrieval-grounding

The trust-boundary diagram

UNTRUSTED user input · web pages · emails PDFs · tool results from internet SANITIZE / TAG wrap in <untrusted>…</untrusted> strip secrets · classify · scan TRUSTED CORE system prompt · tool registry keys · policy · approved data Rule: instructions in untrusted content are data, never policy.
Figure 8 — Always separate untrusted content from instructions.

Compliance checklist

12. Economics & Cost Engineering

GenAI is the first system where unit costs scale with quality of thinking, not just request count. A careless feature can 100× a bill overnight.

Cost levers, ranked

LeverTypical savingsNotes
Prompt / context caching40–90% on stable prefixesBest ROI; put stable content first
Model cascading (small → big)30–70%Cheap model handles easy cases; escalate only when needed
Shorter outputs (max_tokens, schemas)20–50%Decode tokens cost more than input
Batch APIs (async)~50% off listFor non-interactive workloads
Tighter retrieval (fewer / smaller chunks)20–40%Re-ranking + dedup
Fine-tuned small model for hot path50–90%Only after you've proven the task with a frontier model
Caching final answers (semantic key)highly variableFor repeat queries; watch staleness

The unit-economics worksheet

# Per request
input_tokens   = 3,500   # system + retrieved docs + user
output_tokens  = 450
cache_hit_rate = 0.75

cost_in  = (1 - cache_hit_rate) * input_tokens * PRICE_IN
        + cache_hit_rate * input_tokens * PRICE_CACHE_READ
cost_out =      output_tokens * PRICE_OUT
req_cost = cost_in + cost_out

# Margin sanity
gross_margin = (revenue_per_request - req_cost) / revenue_per_request
# Target >70% before scaling marketing.

Instrument cost per user, feature, and model from the first real traffic

Aggregate-only cost dashboards hide which feature is bleeding money. Tag every call.

13. Developer Workflow & Tooling

Reference toolbox (pick one per slot)

LayerOptions
App frameworkLangChain, LlamaIndex, Pydantic AI, Vercel AI SDK, Mastra, plain SDK
Agent / graphLangGraph, OpenAI Agents SDK, CrewAI, Claude Agent SDK, smolagents
Vector storepgvector, Qdrant, Pinecone, Weaviate, Vespa, OpenSearch, Turbopuffer
EmbeddingsOpenAI text-embedding-3, Cohere v3, BGE-M3, Voyage, Nomic
Re-rankersCohere Rerank, BGE Reranker, ColBERT, Jina
GatewayLiteLLM, Portkey, Helicone, Cloudflare AI Gateway
ServingvLLM, TGI, TensorRT-LLM, SGLang, Ollama, llama.cpp
Evals / tracingLangfuse, Phoenix, Braintrust, LangSmith, Promptfoo, Ragas
GuardrailsLlama Guard 3, Granite Guardian, NeMo Guardrails, Guardrails AI
Coding assistantClaude Code, Cursor, Windsurf, GitHub Copilot, Cline
Tool protocolModel Context Protocol (MCP), OpenAPI, gRPC

Reference repo layout

app/
  prompts/                  # versioned, reviewable
    customer_support.v3.md
  schemas/                  # JSON Schemas for tool I/O and outputs
  tools/                    # typed wrappers — one file per tool
    search_kb.ts
    create_ticket.ts
  pipelines/
    rag.ts                  # retrieval pipeline
    support_agent.ts        # orchestration graph
  gateway/                  # model routing, fallback, cost ledger
  guardrails/               # input/output filters
evals/
  datasets/                 # golden sets, frozen versions
  judges/                   # rubric prompts
  run.ts                    # CI entrypoint
ops/
  dashboards/               # grafana, langfuse
  alerts.yml
infra/
  vllm.yaml                 # if self-hosting
  terraform/
docs/
  model-cards/
  threat-model.md

CI for prompts & models

  1. Pre-commit: lint prompts, validate JSON Schemas, regenerate types.
  2. PR pipeline: run unit evals (fast subset) on every change to prompts/, pipelines/, or model config.
  3. Nightly: full eval set across candidate models; publish a quality+cost+latency report.
  4. Release gate: no regression on golden metrics; cost delta within budget; safety classifiers green.
  5. Canary in prod: shadow-mode new prompt/model for 5–10% traffic; auto-rollback on metric breach.

14. Strategy & 90-Day Roadmap

Day 0 Day 30 Day 60 Day 90 Phase 1 · Validate (0–30) • Pick ONE concrete use case + KPI • Build minimal eval set (50–100 cases) • Prototype with hosted frontier model • Add basic RAG if data-bound • Ship to ~10 internal users • Measure: quality, latency, cost, CSAT Gate to Phase 2: KPI hit + clear demand Phase 2 · Harden (30–60) • Expand eval set + adversarial cases • Add gateway: routing, caching, fallback • Guardrails: PII, jailbreak, schema • Tracing + cost ledger per user/feature • Beta to broader cohort, A/B vs baseline • Optimize: cascade, caching, output shape Gate: SLOs hit, unit economics >70% margin Phase 3 · Scale (60–90) • GA launch with rollout controls • Consider fine-tune for hot path • Multi-model routing + model swap drills • Feedback loops: 👍/👎 → dataset • Quarterly model refresh playbook • Plan agentic / multimodal expansion Outcome: swap models without a rewrite
Figure 9 — A defensible 90-day path from idea to GA.

Strategic principles

15. Common Pitfalls & Anti-Patterns

Anti-pattern

"Let's fine-tune!"

Skipping prompting, RAG, and evals. Fine-tuning a poorly understood task wastes weeks and bakes mistakes into weights.

Anti-pattern

Vibe-driven prompts

Iterating on prompts without an eval set. Improvements on three favorite examples often regress everything else.

Anti-pattern

Multi-agent swarm too early

Five "specialist" agents arguing in a loop. Cost explodes; success rate drops; debugging is impossible.

Anti-pattern

Trusting tool descriptions blindly

Untyped, ambiguous tools cause the model to hallucinate arguments. Tools are an API — design them like one.

Anti-pattern

No streaming, no cancel

A 7-second wait for a JSON blob feels broken. Always stream; always allow cancellation.

Anti-pattern

Stuffing the context window

"More context = better" is false past a point — performance drops in the middle (the "lost in the middle" effect). Retrieve precisely.

Anti-pattern

Single-vendor lock-in

Coding straight to one vendor's SDK shape. Put a thin gateway or interface in front so a model change doesn't fork your codebase.

Anti-pattern

Logging prompts but not outcomes

Without user feedback / outcomes tied to each call, you can never close the loop on quality.

16. Glossary & Resources

Glossary (the terms you'll hear weekly)

TermMeaning
TokenSubword unit; ~4 chars of English. Billing and context limits are in tokens.
Context windowMax tokens the model can attend to in one call.
TTFT / TPOTTime to first token / time per output token. Latency KPIs.
KV cachePer-token key/value tensors cached during generation; defines memory cost of long context.
EmbeddingVector representation of text/image used for similarity search.
RAGRetrieval-Augmented Generation — fetch relevant docs and pass them to the model.
LoRA / QLoRAParameter-efficient fine-tuning that trains a low-rank adapter on top of frozen weights.
SFT / DPO / RLHF / RLAIFSupervised fine-tuning; preference optimization variants for alignment.
MoEMixture-of-Experts — only a subset of parameters active per token.
Speculative decodingA draft model proposes tokens, a large model verifies — 2–3× speedup.
MCPModel Context Protocol — open standard for connecting tools/data to LLM clients.
GuardrailsInput/output filters enforcing safety, schema, and policy.
EvalAutomated test of model/system quality on a labeled dataset.
HITLHuman-in-the-loop — required approval/intervention for sensitive actions.

Recommended reading paths

Foundations

  • "Attention Is All You Need" (Vaswani et al.)
  • The Illustrated Transformer (Alammar)
  • Karpathy's "Zero to Hero" series
  • Lilian Weng's blog (attention, agents, diffusion)

Systems & serving

  • vLLM & PagedAttention paper
  • FlashAttention 2/3 papers
  • The Llama / Mixtral technical reports
  • NVIDIA TensorRT-LLM docs

Practice

  • Anthropic engineering blog (agents, prompts, safety)
  • OpenAI cookbook
  • OWASP Top 10 for LLM Applications
  • NIST AI RMF & ISO/IEC 42001

One-page strategy summary

Models rent; process owns. I still start narrow—one workflow, one eval set, a frontier model to get the shape of the answers right—then add guardrails, tracing, and cost tricks (cache, cascade, smaller model on the hot path) once something real is in front of users. Swap-friendly layers matter because the name on the best checkpoint will change; your tickets and datasets shouldn't have to.

17. References

Figures and the narrative in §1–§16 are mine unless noted inline; primary papers and standards are listed below. Author: Linh Truong · LinhTruong.com.

Transformers, attention, and long-context mechanics

  1. Vaswani, Ashish, Noam Shazeer, Niki Parmar, et al. “Attention Is All You Need.” Advances in Neural Information Processing Systems 30 (2017). arXiv:1706.03762. https://arxiv.org/abs/1706.03762
  2. Dao, Tri, et al. “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.” NeurIPS (2022). arXiv:2205.14135.
  3. Dao, Tri. “FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning.” ICLR (2024). arXiv:2307.08691.
  4. Gu, Albert, and Tri Dao. “Mamba: Linear-Time Sequence Modeling with Selective State Spaces.” arXiv:2312.00752 (2023). (SSM / “Mamba-class” hybrids referenced in §2.)

Diffusion, latent diffusion, and generative media

  1. Ho, Jonathan, Ajay Jain, and Pieter Abbeel. “Denoising Diffusion Probabilistic Models.” NeurIPS (2020). arXiv:2006.11239.
  2. Rombach, Robin, et al. “High-Resolution Image Synthesis with Latent Diffusion Models.” CVPR (2022). arXiv:2112.10752. (Stable Diffusion family lineage.)

Retrieval-augmented generation and context effects

  1. Lewis, Patrick, et al. “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.” NeurIPS (2020). arXiv:2005.11401.
  2. Liu, Nelson F., et al. “Lost in the Middle: How Language Models Use Long Contexts.” arXiv:2307.03172 (2023). (Cited in §15 anti-pattern on stuffing context.)

Alignment, preference optimization, and adaptation

  1. Ouyang, Long, et al. “Training language models to follow instructions with human feedback.” NeurIPS (2022). arXiv:2203.02155. (InstructGPT / RLHF pipeline commonly referenced as “RLHF”.)
  2. Rafailov, Rafael, et al. “Direct Preference Optimization: Your Language Model Is Secretly a Reward Model.” NeurIPS (2023). arXiv:2305.18290.
  3. Hu, Edward J., et al. “LoRA: Low-Rank Adaptation of Large Language Models.” ICLR (2022). arXiv:2106.09685.
  4. Dettmers, Tim, et al. “QLoRA: Efficient Finetuning of Quantized LLMs.” NeurIPS (2023). arXiv:2305.14314.
  5. Bai, Yuntao, et al. “Constitutional AI: Harmlessness from AI Feedback.” arXiv:2212.08073 (2022). (Anthropic alignment lineage; useful alongside provider safety docs.)

Prompting, reasoning, tools, and agents

  1. Wei, Jason, et al. “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.” NeurIPS (2022). arXiv:2201.11903.
  2. Yao, Shunyu, et al. “ReAct: Synergizing Reasoning and Acting in Language Models.” ICLR (2023). arXiv:2210.03629.
  3. Schick, Timo, et al. “Toolformer: Language Models Can Teach Themselves to Use Tools.” NeurIPS (2023). arXiv:2302.04761.

Inference systems: throughput, paging, and speculation

  1. Kwon, Woosuk, et al. “Efficient Memory Management for Large Language Model Serving with PagedAttention.” SOSP (2023). arXiv:2309.06185. (vLLM.)
  2. Leviathan, Yaniv, Matan Kalman, and Yossi Matias. “Fast Inference from Transformers via Speculative Decoding.” ICML (2023). arXiv:2211.17192.
  3. Chen, Charlie, et al. “Accelerating Large Language Model Decoding with Speculative Sampling.” arXiv:2302.01318 (2023). (Related speculative decoding line.)

Open protocols, telemetry, and integration

  1. Anthropic et al. Model Context Protocol (MCP) — specification for connecting tools and data sources to model clients. https://modelcontextprotocol.io
  2. OpenTelemetry Project. Semantic Conventions for Generative AI — tracing and observability attributes for LLM calls (evolving; check current release). https://opentelemetry.io/docs/specs/semconv/gen-ai/

Safety, security, and governance

  1. OWASP Foundation. OWASP Top 10 for Large Language Model Applications (project page and release artifacts). https://owasp.org/www-project-top-10-for-large-language-model-applications/
  2. National Institute of Standards and Technology (NIST). Artificial Intelligence Risk Management Framework (AI RMF 1.0), 2023. https://www.nist.gov/itl/ai-risk-management-framework
  3. ISO/IEC JTC 1/SC 42. ISO/IEC 42001 — Artificial intelligence management system (AIMS) standard (commercial standard; cite official ISO catalog entry for formal work).

Pedagogy and illustrated explainers (non-archival but standard entry points)

  1. Alammar, Jay. “The Illustrated Transformer.” Jalammar.github.io (blog). https://jalammar.github.io/illustrated-transformer/
  2. Karpathy, Andrej. “Neural Networks: Zero to Hero” (video/code course). https://karpathy.ai/zero-to-hero.html
  3. Weng, Lilian. Lil’ Log (blog) — long-form notes on LLMs, agents, diffusion, and RLHF. https://lilianweng.github.io/

Model families — cite technical reports for specific architectures

  1. Touvron, Hugo, et al. “LLaMA: Open and Efficient Foundation Language Models.” arXiv:2302.13971 (2023); and follow-on Llama 2 / Llama 3 reports from Meta AI (retrieve the report matching the checkpoint you benchmark).
  2. Mistral AI / partners. Mistral / Mixtral technical reports and blog releases — use the official PDF or post corresponding to the weight set under discussion.
  3. OpenAI, Google DeepMind, Anthropic. System cards, model cards, and safety reports for GPT, Gemini, and Claude families — authoritative for capability, training, and usage policies; URLs change per release.