Linh Truong · LLM training · May 2026

Training a Large Language Model (LLM) — end-to-end

How I think about building and shipping an LLM in 2026: data pipelines, transformer internals, distributed training (FSDP / 3D parallelism), post-training (SFT, DPO, RLHF, RLVR), evals, infra, and the cost math that kills projects when you ignore it.

Author: Linh Truong, MA (Harvard), MBA · Source: LinhTruong.com · Email: Linh@Alumni.Harvard.edu

Last revised: May 2026 For: ML engineers, infra engineers, applied researchers

Pre-trainingTokenizationTransformer FSDP3D ParallelismMoE RLHFDPORLVR LoRAEvalsInference

1. The 2026 LLM Training Landscape

In 2026, I treat training an LLM less as a single act and more as a pipeline of three pipelines: a pre-training run that turns raw web-scale text into a base model, a mid-training stage that injects domain and reasoning data, and a post-training stage where the model learns instructions, preferences, and verifiable reasoning. Most teams will not pre-train from scratch — they will fine-tune or post-train an open base (Llama, Mistral, Qwen, DeepSeek, Gemma). The math, the data, and the engineering, however, are the same.

15T+

Pre-training tokens (frontier)

~20×

Chinchilla tokens/param ratio

$10M–$500M

Frontier pre-train spend

<$100

LoRA SFT on a 7B (cloud)

Frontier spend and LoRA cost bands are directional—see §23 and primary sources before you quote them.

What changed since 2023

Scale

Data > params

Compute-optimal trends past Chinchilla: train smaller models on far more tokens for cheaper inference.

Architecture

MoE is default at scale

Sparse Mixture-of-Experts (DeepSeek-V3, Mixtral, GPT-class) gives 4–10× active-param efficiency.

Alignment

RLVR & preference learning

DPO and verifier-based RL (GRPO, RLVR) largely replaced classic PPO-RLHF for most teams.

Context

1M+ tokens routine

YaRN, RoPE scaling, and ring/flash attention make 128K–1M practical without quality collapse.

Reasoning

Test-time compute

o1/R1-style models trained with long chains-of-thought via RL on verifiable rewards.

Open weights

Strong open bases

Llama 3/4, Qwen 2.5/3, DeepSeek-V3, Mistral, Gemma 2/3 — most apps start here.

2. The End-to-End Training Pipeline

Before tearing into any single stage, see the whole flow. Every box below has its own section.

Figure 1. The canonical LLM training pipeline from raw data to served model. Stage 2 is where 95% of the FLOPs go; stage 4 is where 95% of the perceived quality is decided.

Scope

I focus on training—data, model, optimizer, loss, infra, alignment, evals. Inference appears only where it constrains training (vocab, context, quantization, distillation). RAG and agents are out of scope here.

3. Data: Sourcing, Cleaning, Mixing

Data is the most under-appreciated lever in LLM training. A 2024 result still holds in 2026: better data beats more parameters, often by a wide margin. Most of the gap between an open base and a frontier closed model is data quality, not compute.

3.1 Sources (and their gotchas)

Source	Typical share	Pros	Watch out for
Web crawl (CommonCrawl, FineWeb, RedPajama, DCLM)	50–70%	Massive, diverse	Toxicity, PII, boilerplate, near-duplicates, machine-generated slop
Code (The Stack v2, GitHub mirror, StackExchange)	10–20%	Improves reasoning, structure	License hygiene (GPL/AGPL), exfiltrated secrets
Books & papers (Books3-like, arXiv, PubMed, Project Gutenberg)	5–10%	Long-form coherence	Copyright exposure, OCR artifacts
Wikipedia & reference	2–5%	High signal, factual	Small absolute volume — often upsampled
Math & STEM (OpenMathInstruct, ProofPile, FineMath)	2–5%	Boosts reasoning	Hard to dedupe vs benchmarks (contamination)
Multilingual (MADLAD-400, CulturaX)	10–30%	Cross-lingual transfer	Quality varies wildly by language
Synthetic (model-generated, distilled)	5–30%	Controllable, fills gaps	Mode collapse, model-collapse if recursive

3.2 The cleaning pipeline

Figure 2. Cleaning funnel for a single CommonCrawl dump. Numbers are order-of-magnitude; FineWeb-Edu, DCLM and RedPajama-v2 publish the actual ratios.

3.3 Decontamination — the silent killer

Decontaminate before you train, not after

Strip n-gram overlap (13-gram is the common cutoff) against every benchmark you plan to report — MMLU, HumanEval, GSM8K, MATH, BBH, IFEval, AGIEval, the lot. Contamination is the most common reason a model "looks SOTA" in the paper and falls over in production. Maintain a frozen contamination index in the data team's repo.

3.4 Data mixing & curriculum

The mixture weight per source is itself a hyperparameter. Two practical recipes:

DoReMi (Domain Reweighting with Minimax) — train a small proxy model, learn per-domain weights that maximize worst-domain perplexity, then train the big model with those weights.
Curriculum/annealing — start with diverse web data, gradually shift toward high-quality math/code/reasoning in the last 10–20% of tokens (the "annealing" phase popularized by Llama 3 and DeepSeek). LR is decayed in parallel.

4. Tokenization & Vocab

The tokenizer freezes before pre-training and is effectively unchangeable. A bad tokenizer leaves performance on the table forever — especially for non-English, code, and math.

4.1 Algorithm choice

Algorithm	Used by	Notes
Byte-Pair Encoding (BPE)	GPT-2/3/4, Llama, Mistral	Default. Use byte-level BPE (no UNK).
SentencePiece Unigram	T5, Gemma, mT5	Probabilistic; good for many languages
WordPiece	BERT	Legacy for new LLMs
Tiktoken (cl100k/o200k)	OpenAI models	Highly tuned BPE; ~100K–200K vocab

4.2 Vocab size trade-offs

Small (32K)

Llama 2 era

Cheaper embed/lm-head, more tokens per word, slower for non-English.

Mid (128K)

Llama 3 / Qwen

Sweet spot for multilingual + code. ~15% fewer tokens per doc vs 32K.

Large (200K+)

GPT-4o / Gemini

Excellent compression, but embedding table eats memory and may bloat small models.

4.3 Critical details devs miss

Digit splitting: split each digit into its own token (Llama-style) — dramatically helps arithmetic.
Pre-tokenization regex: the GPT-4 split pattern (contractions, whitespace, numbers) is widely copied; use it.
Byte fallback: guarantees no UNK tokens; mandatory for code and arbitrary Unicode.
Special tokens: reserve <|im_start|>, <|im_end|>, tool tags, thinking tags now — adding them later requires embedding surgery.
Train the tokenizer on the data mix you will train on — including code and target languages, not just English web.

4.4 Reference snippet — train a BPE tokenizer

from tokenizers import Tokenizer, models, trainers, pre_tokenizers, decoders

tok = Tokenizer(models.BPE(unk_token=None))
tok.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=False)
tok.decoder       = decoders.ByteLevel()

trainer = trainers.BpeTrainer(
    vocab_size=128_000,
    min_frequency=2,
    initial_alphabet=pre_tokenizers.ByteLevel.alphabet(),
    special_tokens=["<|endoftext|>", "<|im_start|>", "<|im_end|>"],
)
tok.train(files=["shard_0.txt", "shard_1.txt", ...], trainer=trainer)
tok.save("tokenizer.json")

5. Architecture: Transformer & Beyond

Almost every modern LLM is a decoder-only transformer with a handful of upgrades. Below is the canonical block, then the modifications that actually matter in 2026.

Figure 3. Modern decoder-only transformer block. Llama-3/Qwen-3/Mistral all variations of this skeleton.

5.1 Component-by-component cheatsheet

Component	2020 default	2026 default	Why
Position encoding	Learned absolute	RoPE (+ YaRN/NTK scaling)	Relative, extrapolates, length-generalizes
Normalization	LayerNorm post	RMSNorm pre-norm	Stable training, fewer params
Activation	GELU	SwiGLU	~1% perplexity gain
Attention	MHA	GQA (or MLA in DeepSeek)	Cuts KV cache 4–8×
Bias terms	Everywhere	None (or QK only)	Cleaner, marginally faster
Tied embeddings	Sometimes	Often tied (small models)	Saves vocab×d_model params
MoE	Rare	Mixture of Experts at scale	Sparse compute, top-k routing

5.2 The "shape" math (sanity-check your config)

# Llama-3-style 7B reference shape
n_layers   = 32
d_model    = 4096
n_heads    = 32
n_kv_heads = 8          # GQA: 4 Q heads share one KV head
d_ff       = 14336       # ≈ 3.5 × d_model (SwiGLU uses 2 matrices)
vocab      = 128_000

# Parameter count (rough):
# embeddings  : vocab * d_model                   = 524M
# per layer   : 4 * d_model^2  (attn proj)
#             + 3 * d_model * d_ff  (SwiGLU MLP)  ≈ 218M
# total       : 524M + 32 * 218M  ≈ 7.5B

5.3 Mixture of Experts (MoE) — when to reach for it

MoE replaces the MLP in each block with E experts and a router that activates top-k per token (typically k=1 or 2). Active parameters per token are far below total parameters, so you train a "huge" model at the FLOPs of a small one.

Pro: 4–10× more parameters at the same training cost; better quality/FLOP.
Con: hard to fine-tune; needs expert-parallel comms; load-balancing loss; memory-heavy at inference.
Examples: DeepSeek-V3 (671B total / 37B active), Mixtral 8×22B, Qwen-MoE, Snowflake Arctic.

5.4 Non-transformer contenders (know them, don't always use them)

Mamba / SSM hybrids (Jamba, Zamba) — linear-time long context, but ecosystem still thin.
RWKV / RetNet — RNN-flavored with parallel training; niche but useful for ultra-long sequences.
Hybrid attention+SSM blocks — best of both, increasingly mainstream in 2026.

6. Scaling Laws & Compute Budgets

6.1 The Chinchilla rule of thumb

Given a fixed compute budget C (FLOPs), loss is minimized when the number of training tokens D and the number of parameters N are scaled in roughly equal proportion, with D ≈ 20 × N. The familiar 6× factor relates them to FLOPs:

C ≈ 6 · N · D

For inference-heavy products, push D/N far above 20 (Llama-3-8B used ~1900). You pay more to train, but every served token is cheaper.

Figure 4. Order-of-magnitude compute budget by model size. MFU (Model FLOPs Utilization) of 35–55% is realistic on a healthy cluster.

6.2 Inference-aware scaling

Llama-3 and DeepSeek both deliberately broke Chinchilla — training 8B and 67B models on 15T+ tokens. The reasoning: lifetime inference dwarfs training, so a slightly under-parameterized but over-trained model wins on total cost of ownership.

Pick D/N by where the dollars live

If you serve more than ~10²² inference FLOPs (large product), over-train. If the model is research-only or short-lived, stick near Chinchilla optimum.

7. Pre-training: Objectives & Loop

7.1 The objective

Canonical pre-training loss is causal language modeling — predict the next token given all previous tokens, averaged over the corpus. Sequences are packed (concatenated with separators) to keep GPUs busy.

# Conceptual training step (PyTorch flavor)
inputs = batch["input_ids"]              # [B, T]
labels = inputs.clone()
labels[:, :-1] = inputs[:, 1:]            # shift
labels[:, -1]   = -100                  # ignore last

logits = model(inputs)                       # [B, T, V]
loss   = F.cross_entropy(
    logits.view(-1, V).float(),
    labels.view(-1),
    ignore_index=-100,
)
loss.backward()

7.2 Document packing & attention masking

Pack multiple short documents into one fixed-length sequence with <|endoftext|> separators.
Use document-level attention masking (sometimes called "intra-document masking") so attention doesn't bleed across documents — this matters for clean long-context training.
FlashAttention 2/3 supports variable-length packed sequences via cu_seqlens.

7.3 The training loop, end-to-end

Figure 5. Pre-training loop. Loss spikes are normal; well-engineered runs rewind by ~100 steps and skip the offending batch.

7.4 Batch size: tokens, not examples

Think in tokens per step, not sequences. Frontier runs use 4M–16M tokens per global batch. Why so large? Adam's noise scale and tensor parallel efficiency both prefer big batches; small batches stall training and destabilize loss.

8. Distributed Training (DP, TP, PP, FSDP)

A 7B model in BF16 is ~14 GB of weights, but training needs ~6× that (weights + grads + Adam moments + activations). For anything above 1B, you split the model across devices.

8.1 The four kinds of parallelism

Data Parallel (DDP)

Replicate, split batch

Each GPU has a full model copy and processes a slice of the batch. Simplest. Caps at ~10B before memory wall.

Tensor Parallel (TP)

Split each matmul

Megatron-style: split weight matrices across GPUs intra-layer. Heavy all-reduce traffic — only across NVLink/NVSwitch in one node.

Pipeline Parallel (PP)

Split layer stack

Different GPUs hold different layers. Use 1F1B / interleaved schedules to minimize bubbles.

Sequence / Context Parallel

Split the sequence

Ring attention spreads long-context attention across GPUs. Essential for 128K+ training.

8.2 ZeRO / FSDP — sharded data parallel

PyTorch FSDP (and DeepSpeed ZeRO) shards optimizer state, gradients, and parameters across data-parallel ranks. It is the default for < 70B models because it's almost as simple as DDP but scales to 100B+.

Stage	Shards	Memory savings	Comms cost
ZeRO-1 / FSDP `SHARD_GRAD_OP`	Optimizer state	~4×	Low
ZeRO-2	+ Gradients	~8×	Medium
ZeRO-3 / FSDP `FULL_SHARD`	+ Parameters	~N× (N = world size)	High (all-gather per layer)

8.3 3D parallelism — the frontier recipe

Figure 6. 3D parallelism on a 64-GPU pod. Real frontier runs nest this further with sequence parallel and expert parallel.

8.4 Picking a strategy by model size

Model	Recommended
<1B	Plain DDP
1B – 13B	FSDP (ZeRO-3) with selective activation checkpointing
13B – 70B	FSDP + tensor parallel within a node (TP=2/4/8)
70B – 400B dense	3D parallelism (TP + PP + DP) + sequence parallel for long context
MoE at any scale	Add expert parallel; tune all-to-all communication

8.5 Reference: launching FSDP on 8 GPUs

# torchrun handles process group setup
torchrun --standalone --nproc_per_node=8 train.py \
    --model_size 7B \
    --fsdp full_shard \
    --activation_checkpointing \
    --bf16 \
    --seq_len 8192 \
    --micro_batch 2 \
    --grad_accum 8

9. Hyperparameters & Optimizer

9.1 Sensible defaults that almost always work

Knob	Default	Notes
Optimizer	AdamW (β₁=0.9, β₂=0.95, ε=1e-8)	β₂=0.95 (not 0.999) is the LLM-specific tweak
Weight decay	0.1	Apply to weights only, never to norms or biases
Peak LR (dense, <10B)	3e-4	Scale roughly as 1/√width as model grows
Peak LR (70B+)	1.5e-4 → 6e-5	Bigger models tolerate less LR
Warmup	~2000 steps or 1% of total	Critical for stability
Schedule	Cosine → 10% of peak, then constant cooldown	WSD schedule (warmup-stable-decay) increasingly popular
Grad clip	1.0 (global L2)	Drop offending batch if spikes persist
Init	Normal(0, 0.02) scaled by 1/√(2·N_layers) on residual projections	Mu-Parametrization (μP) lets you tune at small scale and transfer
Dropout	0.0 in pre-training	Use only in fine-tune if overfitting

9.2 The optimizer of 2026

AdamW remains the default, but watch:

Lion / Sophia / Shampoo — sometimes faster, less validated.
Distributed Shampoo — used in production at Google; 30–40% step reduction at scale.
Adafactor — memory-light, popular for very large MoE.

9.3 μP — tune small, train big

Maximal Update Parametrization re-scales initialization, LR, and projection so that the same hyperparameters transfer from a tiny proxy model up to the full-scale run. It collapses the hyperparameter search from "burn 1% of your training budget" to "burn 0.01%".

10. Stability, Mixed Precision, Memory

10.1 Precision recipes

Format	Use for	Notes
BF16	All forward / backward activations & weights	Wider exponent than FP16; no loss-scaling needed
FP32	Optimizer master copy of weights, loss reduction	Avoids drift over millions of steps
FP8 (E4M3 / E5M2)	H100/H200/B200 matmuls, with per-tensor scaling	~2× throughput; needs careful scaling, used by DeepSeek-V3
INT8 / INT4	Post-training quantization for inference	Not for training

10.2 The memory budget per GPU (BF16 training)

# Per-parameter cost
weight (BF16)            : 2 bytes
grad   (BF16)            : 2 bytes
adam   (FP32 m, v, mast) : 12 bytes
                          ────
                           16 bytes / param

# 7B model on one GPU:  7e9 * 16 = 112 GB  (won't fit on H100)
# With FSDP-FullShard over 8 GPUs: 14 GB / GPU for state
# Activations dominate the rest — use activation checkpointing

10.3 Loss spike playbook

When the loss explodes at step 47,328

Check grad norm — if it spiked, ratio of last few norms tells you how bad.
Rewind to the last clean checkpoint (~100 steps back).
Skip the data shard / batch that caused it.
If recurring: reduce peak LR by 20% and resume.
If still happening: switch one layer of attention to FP32 softmax, or check for bad init in a specific layer.

10.4 Activation checkpointing

Recompute activations during backward instead of storing them. Costs ~30% extra compute, saves ~10× activation memory. Use selective checkpointing (only the attention and MLP outputs) to balance speed and memory.

10.5 Other essentials

FlashAttention 2/3 — fused, IO-aware attention. Non-negotiable.
RoPE in FP32 for the sin/cos table — small fix that prevents long-context drift.
Z-loss (auxiliary loss on logsumexp) — stabilizes the LM head; used by PaLM and many frontier runs.
QK-norm — RMSNorm on Q and K before attention; tames softmax outliers.

11. Mid-training & Continued Pre-training

Between base pre-training and post-training there is now a distinct phase — variously called annealing, mid-training, or continued pre-training — where the model is exposed to higher-quality data with a lower learning rate.

11.1 What goes in

High-quality reasoning corpora (textbooks, curated arXiv, math proofs).
Code repositories with passing tests.
Long documents for context extension (concatenated books, long codebases).
Multilingual upsampling.
Tool-use traces (function calls, JSON outputs).

11.2 Long-context extension

To extend RoPE-based attention from 8K to 128K/1M:

Increase RoPE base (the θ parameter) — e.g. 10,000 → 500,000.
Apply YaRN or NTK-aware scaling at the same time for better extrapolation.
Continue training on long documents for ~1–10B tokens.
Evaluate on needle-in-a-haystack, RULER, LongBench v2 to confirm.

11.3 Continued pre-training on your domain (CPT)

If you have a corpus that is mostly out-of-distribution for the base model (legal, biomedical, niche programming language), CPT can lift downstream performance more cheaply than re-pretraining. Rules:

Keep some % (10–30%) of the original mix to prevent catastrophic forgetting.
Use 0.1× the original peak LR.
Train for tokens roughly equal to 0.1–1× the new-domain corpus size.
Evaluate on both old and new domain to catch regression.

12. Post-training: SFT, DPO, RLHF, RLVR

12.1 The pipeline

Figure 7. Post-training pipeline. Many teams stop at SFT + DPO; RL adds another quality jump but is expensive and finicky.

12.2 SFT (Supervised Fine-Tuning) — turn a base into a chat model

SFT teaches format and basic instruction-following. Quality of examples >> quantity. 5K LIMA-style hand-curated examples can beat 500K low-quality ones.

# Conversation template (ChatML-flavored)
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
Explain RoPE in one paragraph.<|im_end|>
<|im_start|>assistant
Rotary Position Embedding rotates the query and key vectors ...<|im_end|>

# Loss masking: compute loss ONLY on assistant tokens

12.3 DPO — preference learning without a reward model

Direct Preference Optimization re-parametrizes the RLHF objective so you can train directly on (chosen, rejected) pairs against a frozen reference model. No reward model, no PPO, no rollouts. The closed-form loss:

L_DPO = -log σ( β · [ log π(y_w|x)/π_ref(y_w|x) − log π(y_l|x)/π_ref(y_l|x) ] )

Variants:

IPO — fixes the over-fitting failure mode of DPO.
KTO — uses single thumbs-up/down rather than pairs.
ORPO — combines SFT + preference in one stage.
SimPO — reference-free, simpler scaling.

12.4 RLHF — PPO with a learned reward model

Classical recipe (InstructGPT, GPT-3.5): train a reward model on preference data, then optimize the policy with PPO against the RM, with a KL penalty to a reference policy.

Three model copies live in GPU memory at once (policy, reference, reward, critic).
Reward hacking is the primary failure mode — log everything, watch KL.
Largely replaced by DPO + RLVR in 2026, but still used at frontier labs for general chat.

12.5 RLAIF — feed your own model the role of the labeler

Constitutional AI / RLAIF uses a (typically larger) model to generate or rank responses. Useful when human labelers are slow or expensive; quality ceiling = quality of the judge.

13. Reasoning & Long-Horizon Training

The 2024–2026 leap in reasoning (OpenAI o1, DeepSeek-R1, Qwen-QwQ) came from RL on verifiable rewards (RLVR). The model generates long chains of thought, a verifier checks the final answer, and policy gradient pushes the model toward CoTs that lead to correct answers.

13.1 GRPO — Group Relative Policy Optimization

GRPO (introduced in DeepSeekMath, used in R1) samples G rollouts per prompt, computes a group-baseline advantage, and updates with a PPO-style clipped objective — but without a critic, which halves memory.

# Pseudocode
for prompt in batch:
    rollouts = [policy.generate(prompt) for _ in range(G)]
    rewards  = [verifier(r) for r in rollouts]
    adv      = (rewards - mean(rewards)) / (std(rewards) + 1e-8)
    loss     = -mean( min(ratio * adv, clip(ratio, 1-ε, 1+ε) * adv) )
                + β * KL(policy || ref)

13.2 Where verifiable rewards come from

Domain	Verifier
Math	SymPy / numeric equivalence to the gold answer
Code	Unit tests pass / fail
Format / JSON	Schema validator
Multi-step tool use	End-state in a simulator
Translation / summarization	BLEU / BERTScore / LLM-judge (less crisp)

13.3 Practical lessons from R1-style training

Start from a strong SFT model; RLVR on a weak base diverges.
Initial responses should already produce some CoT — bootstrap with synthetic CoT SFT if needed.
Long-CoT RL gradually lengthens responses — budget for 8K–32K output tokens.
Track diversity collapse — entropy bonuses or temperature scheduling help.
Use a smaller policy + larger verifier (or rules) — verifier need not be a neural net.

13.4 Test-time compute as a separate axis

Trained reasoners exchange compute at inference for accuracy: search, self-consistency, MCTS over thought paths, best-of-N with a verifier. Treat inference latency as a tunable, not a fixed budget.

14. Fine-tuning Adapters (LoRA, QLoRA)

Most teams will never touch full-parameter fine-tuning. LoRA trains a low-rank delta (A·B, rank 8–64) injected into chosen weight matrices, freezing the base. QLoRA quantizes the base to 4-bit NF4 first, letting you fine-tune a 70B model on a single 80 GB GPU.

Figure 8. LoRA: low-rank decomposition of the weight update. Inference-time, you can merge A·B back into W for zero overhead.

14.1 LoRA recipe that just works

Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj (all linear layers in attention + MLP).
Rank: 8–16 for style/persona, 32–64 for knowledge injection, 128+ rarely helps.
Alpha = 2 × rank.
Dropout = 0.05.
LR = 1e-4 to 3e-4 (LoRA tolerates much higher LR than full FT).
1–3 epochs on 1K–100K examples.

14.2 QLoRA in practice

from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

bnb = BitsAndBytesConfig(load_in_4bit=True,
                         bnb_4bit_quant_type="nf4",
                         bnb_4bit_compute_dtype=torch.bfloat16)

base = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3-70b",
                                            quantization_config=bnb)

cfg = LoraConfig(r=16, lora_alpha=32, lora_dropout=0.05,
                 target_modules="all-linear", task_type="CAUSAL_LM")
model = get_peft_model(base, cfg)     # ~0.5% params trainable

14.3 When to NOT use LoRA

You need to extend context length significantly — base attention needs full training.
You're changing tokenizer or vocab (requires embedding/lm-head training).
You're doing pre-training-scale knowledge injection (use CPT instead).
Multiple adapters need to compose at inference (consider DARE, TIES merging).

15. Evaluation & Benchmarks

15.1 The eval pyramid

During training

Loss + cheap probes

Validation perplexity (held-out shard), a handful of fast capability probes (HellaSwag, ARC) every 1K steps.

After training

Standard benchmarks

MMLU-Pro, GPQA, HumanEval+, MBPP+, GSM8K, MATH, BBH, IFEval, MT-Bench, AlpacaEval 2.

In production

Task & vibes

Domain-specific eval set, LLM-as-judge, human ratings, online A/B.

15.2 The 2026 benchmark shortlist

What it measures	Use
General knowledge	MMLU-Pro (10-option, harder than MMLU)
Reasoning (graduate science)	GPQA Diamond
Math	MATH-500, AIME, OlympiadBench
Code	HumanEval+, MBPP+, LiveCodeBench (rolling)
Long context	RULER, NIAH, LongBench v2
Instruction following	IFEval, MT-Bench, Arena-Hard
Agentic / tool use	τ-Bench, SWE-Bench Verified, GAIA
Multilingual	MGSM, MMLU-translate, FLORES
Safety	HarmBench, ToxicChat, XSTest

15.3 LLM-as-judge — useful but biased

Known biases

Position bias — first response wins more often. Randomize order, or compute both orderings.
Verbosity bias — longer answers win even when wrong.
Self-preference — models prefer their own style. Always use a different judge family than the model under test.
Format bias — bulleted, formatted answers are preferred over equally correct prose.

15.4 Eval rigor for training

Always report variance over seeds — single-run differences of ~1 point on small benchmarks are noise.
Hold out an internal eval set that is never seen by anyone training.
Run contamination scans on every benchmark you report.
Track regressions on prior capabilities — fine-tuning often gains on the target task and loses elsewhere.

16. Safety, Alignment, Red-Teaming

16.1 Layered defense

Figure 9. Safety is multi-layer; no single layer is sufficient.

16.2 Constitutional AI in one sentence

Use the model itself (or another model) to critique its own responses against a written list of principles, then fine-tune on the revisions. Effective for refusal training without large human-feedback budgets.

16.3 Red-teaming

Internal team + external (HackerOne-style) bounty.
Maintain a growing private dataset of jailbreaks; re-DPO on them.
Test for multi-turn attacks — many jailbreaks unfold over 5–10 turns.
Probe for capability uplift on dual-use domains (cyber, bio).

16.4 Regulation watchlist (2026)

EU AI Act — general-purpose AI model obligations; report training compute > 10²⁵ FLOPs.
US NIST AI RMF + Executive Orders requiring model card disclosures and safety evals.
UK AISI / US AISI evals for frontier models.
Copyright lawsuits (NYT v OpenAI, Stability v Getty) — keep dataset provenance.

17. Infrastructure & Hardware

17.1 Hardware lineup (2026)

Accelerator	HBM	BF16 TFLOPS	Notes
NVIDIA H100 SXM	80 GB	~990	Workhorse; still widely available
NVIDIA H200	141 GB	~990	Same compute, more HBM
NVIDIA B200 (Blackwell)	192 GB	~2,250	FP8/FP4 native; 2026 default
NVIDIA GB200 NVL72	13.5 TB pooled	~80,000 (rack)	72 Blackwell + Grace CPUs, rack-scale
AMD MI300X / MI325X	192 / 256 GB	~1,300 / ~1,300	Strong for inference; software gap narrowing
Google TPU v5p / v6e	95 / 32 GB	~459 / ~918	For JAX/XLA stacks
AWS Trainium 2	96 GB	~650	NeuronSDK; cost competitive on AWS

17.2 Networking is the bottleneck

Compute is rarely what stalls a training run — the interconnect is. Required topology:

Intra-node: NVLink/NVSwitch (900 GB/s between H100s in an 8-GPU node).
Inter-node: InfiniBand NDR (400 Gb/s/port × 8 = 3.2 TB/s per node) or RoCEv2.
Topology: rail-optimized fat tree; oversubscription kills MFU.
NCCL tuning: pin to NIC, set NCCL_IB_HCA, enable SHARP, enable adaptive routing.

17.3 Storage tiers

Tier	Purpose	Tools
Object store	Raw & tokenized data, checkpoints	S3 / GCS / R2 / MinIO
Parallel FS	Hot training data, fast restore	Lustre, WekaFS, JuiceFS, GPFS
Local NVMe	Per-node cache, scratch	NVMe RAID0
Streaming	Avoids landing full corpus	WebDataset, MosaicML StreamingDataset

17.4 Cluster engineering reality

What kills frontier runs

Silent data corruption on HBM — periodic ECC scrubs, end-to-end checksums on activations.
Stragglers — one slow GPU drags the whole synchronous step. Detect with per-rank step timing.
Network flaps — NCCL hangs eat hours. Use timeouts and watchdog kill+restart.
Hot spots in the cluster cause throttling — monitor with DCGM.

18. Inference Optimization & Serving

Inference economics push back into training decisions — vocab size, KV cache shape, model width all matter once you ship.

18.1 The serving stack

Engine	Best for
vLLM	General OSS, PagedAttention, continuous batching
SGLang	Structured generation, fast prefix caching
TensorRT-LLM	Peak throughput on NVIDIA; FP8/INT4 kernels
llama.cpp	CPU/edge, GGUF quantized models
TGI (HF)	HF ecosystem integration

18.2 The optimizations that matter

Continuous batching (iteration-level) — 5–10× throughput vs static batching.
PagedAttention — KV cache in pages, no fragmentation.
Quantization — INT8 (free), INT4 AWQ/GPTQ, FP8 (B200) — 2–4× speedup at <1 pt quality drop.
Speculative decoding — small draft model proposes, big verifies; 2–3× latency.
Prefix caching — reuse KV cache for shared system prompts.
Disaggregated prefill/decode — prefill is compute-bound, decode is memory-bound — run them on separate fleets.

18.3 Distillation — bake a big model into a small one

Generate (prompt, response, top-k log-probs) tuples from the big teacher; train a small student to match the distribution. Used everywhere now (Gemma-2 from Gemini, Llama-3 8B from Llama-3 405B).

KL on full logits > SFT on samples > SFT on argmax.
Match temperature and vocab — distillation across tokenizers is hard.
Combine with on-policy RL for the final 1–2 points.

19. Cost & Economics

19.1 What things actually cost (2026 cloud, list prices)

Activity	Cost	Notes
1B-param scratch pre-train, 20B tokens	$2K–$8K	32× H100 for ~1 day
7B scratch pre-train, 1T tokens	$200K–$500K	256× H100 for ~2 weeks
70B scratch pre-train, 2T tokens	$5M–$15M	1024× H100 for ~1 month
Frontier pre-train (400B dense / MoE 600B)	$100M–$500M+	10K+ H100 for months
SFT on 7B (100K examples, 3 epochs)	$50–$300	8× H100 for hours
QLoRA on 70B (50K examples)	$50–$200	1× H100 for hours
DPO on 7B (100K pairs)	$100–$500	8× H100 for hours
RLVR on 7B (long CoT, GRPO)	$5K–$50K	Days to weeks of rollouts

19.2 Cost levers (in order of impact)

Don't pre-train from scratch. Almost always start from an open base.
Pick the smallest model that meets the bar after distillation/quantization.
Spot/preemptible instances for fine-tuning and most evals — 60–80% off.
Reserved capacity for sustained training — 30–50% off.
Right-size context — 8K vs 128K can be 3× the GPU memory.
Cache aggressively at inference (prefix, response).
Cross-cloud arbitrage — H100 pricing varies 40% between hyperscalers and neoclouds.

Order of operations

Always: cheap base → SFT → DPO → eval. Add RL only if eval bar isn't met and the cost is justified. Most teams discover SFT + good data gets them 80% there.

20. Strategy & 90-Day Roadmap

20.1 First decision: build, post-train, or buy

Figure 10. Build/post-train/buy decision. The default is "don't train" — earn the right to train by exhausting cheaper options first.

20.2 The 90-day plan (for a team starting from open-base fine-tuning)

Days 1–30

Foundations

Pick base model (capability vs license vs cost).
Stand up eval harness (lm-eval-harness, internal eval).
Build data pipeline (ingest, dedupe, filter, decontaminate).
Get a baseline: zero-shot & few-shot on internal tasks.
QLoRA on 1K hand-curated examples — sanity check.

Days 31–60

Iterate

Scale SFT to 10K–100K curated examples.
Add preference data (DPO/IPO).
Build safety eval; add refusal & jailbreak fine-tune.
Distill or quantize for inference cost.
Set up online logging & thumbs-up/down for next round.

Days 61–90

Ship & harden

Canary the model behind a feature flag.
Add input/output safety classifiers.
Track regression on standard benchmarks weekly.
Establish a re-training cadence and a rollback path.
Document model card & data card.

20.3 Org & team shape

Role	Why
Data engineer (×2)	The training run is only as good as the corpus pipeline.
ML researcher	Owns recipe, hyperparameters, ablations.
Distributed-systems engineer	Cluster, NCCL, checkpointing, recovery.
Eval & alignment lead	Owns benchmarks, red-team, judge models.
Inference engineer	vLLM/TRT-LLM, quantization, latency budgets.
Product owner	Defines the bar; vetoes premature training.

21. Pitfalls & Anti-Patterns

Data

Training on the eval set

The classic. Always n-gram decontaminate before reporting numbers.

Data

"More data is always better"

Low-quality web slop can degrade the model. Cleaning > collecting.

Architecture

Changing tokenizer late

Tokenizer is frozen at pre-train. New special tokens require embedding resize and warm-up.

Training

Tuning on validation perplexity alone

Perplexity ≠ quality. Run capability evals too.

Training

No checkpoint hygiene

Lose one node, lose two weeks. Async checkpoint, multi-cloud copy.

Post-train

Catastrophic forgetting

SFT can erase pre-trained capabilities. Mix general data; cap LR; LoRA helps.

Post-train

Reward hacking in RL

Model finds a way to satisfy the reward without doing the task. Add KL penalty, watch behavior, not just reward.

Evals

One-seed comparisons

~1 point on small benchmarks is noise. Report mean ± std over 3+ seeds.

Evals

Trusting LLM-judge wholesale

Triangulate with human review and rule-based checks.

Infra

Single point of failure

Network switches, head nodes, shared FS — all need redundancy.

Cost

Over-training before product fit

Burn capital on training a model nobody uses. Validate prompt + RAG first.

Safety

Relying on the model alone for safety

Always add an input/output classifier and audit log.

22. Tooling Stack & Resources

22.1 Open-source training frameworks

Tool	For
Hugging Face Transformers + Accelerate + TRL	Most fine-tuning, DPO, PPO; the lingua franca
PyTorch FSDP / torchtitan	Reference FSDP for serious pre-training
Megatron-LM / Megatron-Core	3D parallelism, frontier pre-training
NVIDIA NeMo	End-to-end on NVIDIA stack
DeepSpeed	ZeRO, MoE, large-scale training
Lit-GPT / Levanter / MaxText	Cleaner JAX/PyTorch reference implementations
Axolotl / Unsloth / LLaMA-Factory	Fast LoRA/QLoRA fine-tuning recipes
OpenRLHF / TRL / verl	PPO, GRPO, RLVR pipelines

22.2 Data tools

datatrove, text-dedup, dolma, nemo-curator — extract, dedupe, filter at scale.
FineWeb / FineWeb-Edu / DCLM / RedPajama-v2 — open pre-training corpora.
tokenizers (Hugging Face) — Rust-fast BPE/Unigram.
StreamingDataset (MosaicML), WebDataset — high-throughput streaming.

22.3 Eval & observability

lm-evaluation-harness, OpenCompass, OLMES — standard benchmark runners.
Weights & Biases / MLflow / Comet / TensorBoard — experiment tracking.
Inspect AI (UK AISI) — agent & safety evals.
HELM, Chatbot Arena — public leaderboards.

22.4 Reference open base models (May 2026)

Family	Sizes	License
Llama 3 / 4	1B, 3B, 8B, 70B, 405B, Maverick-MoE	Llama Community License
Qwen 3	0.5B–235B (dense + MoE)	Apache 2.0
DeepSeek V3 / R1	67B dense, 671B MoE	MIT-style
Mistral / Mixtral	7B, 8×7B, 8×22B, Large	Mixed (Apache for some)
Gemma 2 / 3	2B, 9B, 27B	Gemma terms
Phi-4	14B (synthetic-heavy)	MIT

22.5 Reading list

Short list for onboarding; the full annotated bibliography is in §23 References & sources.

"Attention Is All You Need" — Vaswani et al., 2017.
"Scaling Laws for Neural LMs" — Kaplan et al., 2020; "Chinchilla" — Hoffmann et al., 2022.
"Llama 3 Herd of Models" — Meta, 2024.
"DeepSeek-V3" and "DeepSeek-R1" technical reports — 2024–25.
"Direct Preference Optimization" — Rafailov et al., 2023.
"GRPO / DeepSeekMath" — Shao et al., 2024.
"FlashAttention 1/2/3" — Dao et al.
"YaRN" — Peng et al., for long context.
"Constitutional AI" — Anthropic, 2022.
The Megatron-LM and FSDP source code — best documentation that exists.

23. References & sources

Annotated bibliography behind every major claim in this note—scaling laws, architecture choices, data pipelines, distributed training, alignment methods, benchmarks, inference stack, and cost bands. Section tags (e.g. §6) point to where the idea is used. Prefer primary papers and official model cards for canonical numbers; leaderboard scores and license terms change weekly.

Scope. Synthesis of peer-reviewed work, technical reports, and widely used open tooling (May 2026). Diagrams and workflow ordering are original unless noted. Dollar ranges and KPI strips are planning heuristics, not industry averages. Not legal, safety-certification, or compliance advice—confirm regulatory obligations with qualified counsel.

How to read: citations are numbered continuously [1]–[n] within this section. Cross-check arXiv versions against conference proceedings when you cite formally.

Landscape, scaling & open base models (§1, §6, §22)

Kaplan et al., “Scaling Laws for Neural Language Models.” NeurIPS 2020. Foundational compute–loss–size relationships. arxiv.org/abs/2001.08361 — §6.
Hoffmann et al., “Training Compute-Optimal Large Language Models” (Chinchilla). NeurIPS 2022. ~20× tokens/param rule of thumb; optimal N and D under fixed compute. arxiv.org/abs/2203.15556 — §1, §6–7.
Yang et al., “Tensor Programs V: Tuning Large Neural Networks via Small” (μP). 2022. Width/depth transfer for stable large-model training. arxiv.org/abs/2203.03466 — §9.
Brown et al., “Language Models are Few-Shot Learners” (GPT-3). NeurIPS 2020. Scale and in-context learning baseline. arxiv.org/abs/2005.14165 — §1.
Touvron et al., “LLaMA: Open and Efficient Foundation Language Models.” 2023. Open-weights pre-training recipe. arxiv.org/abs/2302.13971 — §1, §22.
Touvron et al., “Llama 2: Open Foundation and Fine-Tuned Chat Models.” 2023. arxiv.org/abs/2307.09288 — §1.
Meta AI, “The Llama 3 Herd of Models.” 2024. Annealing, data mix, long-context and post-training details. arxiv.org/abs/2407.21783 — §1, §3, §11, §22.
DeepSeek-AI, “DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model.” 2024. MLA attention; MoE routing. arxiv.org/abs/2405.04434 — §5.
DeepSeek-AI, “DeepSeek-V3 Technical Report.” 2024. Large MoE pre-train; FP8 training notes. arxiv.org/abs/2412.19437 — §1, §5, §10, §22.
DeepSeek-AI, “DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.” 2025. RLVR / long-CoT reasoning. arxiv.org/abs/2501.12948 — §1, §13.
Jiang et al., “Mistral 7B.” 2023. Sliding-window attention; efficient dense baseline. arxiv.org/abs/2310.06825 — §1, §22.
Jiang et al., “Mixtral of Experts.” 2024. Sparse MoE at 8×7B scale. arxiv.org/abs/2401.04088 — §1, §5, §22.
Qwen Team, “Qwen2 Technical Report.” 2024. Multilingual + code scaling. arxiv.org/abs/2407.10671 — §1, §4, §22.
Team Gemma et al., “Gemma 2: Improving Open Language Models at a Practical Size.” 2024. arxiv.org/abs/2408.00118 — §4, §22.
Abdin et al., “Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone.” 2024. Synthetic-heavy small-model training. arxiv.org/abs/2404.14219 — §3, §22.
Epoch AI, “Compute Trends Across AI.” Living database of training FLOPs and cluster scale. epoch.ai/data — §1 KPIs, §19.

Architecture, attention & tokenization (§4–5)

Vaswani et al., “Attention Is All You Need.” NeurIPS 2017. Transformer baseline. arxiv.org/abs/1706.03762 — §5.
Su et al., “RoFormer: Enhanced Transformer with Rotary Position Embedding (RoPE).” 2021. arxiv.org/abs/2104.09864 — §5, §11.
Zhang & Sennrich, “Root Mean Square Layer Normalization.” 2019. RMSNorm used in Llama-class stacks. arxiv.org/abs/1910.07467 — §5.
Shazeer, “GLU Variants Improve Transformer.” 2020. SwiGLU activation family. arxiv.org/abs/2002.05202 — §5.
Shazeer, “Fast Transformer Decoding: One Write-Head is All You Need” (MQA). 2019. arxiv.org/abs/1911.02150 — §5.
Ainslie et al., “GQA: Training Generalized Multi-Query Transformer Models.” 2023. arxiv.org/abs/2305.13245 — §5.
Dao et al., “FlashAttention: Fast and Memory-Efficient Exact Attention.” NeurIPS 2022; FlashAttention-2/3 follow-ons on same line of work. arxiv.org/abs/2205.14135 — §5, §8, §10.
Fedus et al., “Switch Transformers: Scaling to Trillion Parameter Models.” JMLR 2022. MoE routing at scale. arxiv.org/abs/2101.03961 — §1, §5.
Lepikhin et al., “GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding.” 2020. Expert parallelism patterns. arxiv.org/abs/2006.16668 — §5, §8.
Sennrich et al., “Neural Machine Translation of Rare Words with Subword Units” (BPE). 2016. Byte-level BPE lineage. arxiv.org/abs/1508.07909 — §4.
Kudo & Richardson, “SentencePiece: A simple and language independent approach to subword tokenization.” 2018. Unigram/BPE tooling. arxiv.org/abs/1808.06258 — §4.
Peng et al., “YaRN: Efficient Context Window Extension of Large Language Models.” 2023. Long-context RoPE scaling. arxiv.org/abs/2309.00071 — §1, §11.
Liu et al., “Ring Attention with Blockwise Transformers for Near-Infinite Context.” 2023. Sequence/context parallel for 128K+ training. arxiv.org/abs/2310.01889 — §8.

Data sourcing, cleaning & mixture design (§3)

Penedo et al., “The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale.” 2024. Open pre-training corpus; cleaning ratios cited in §3 funnel. arxiv.org/abs/2406.17557 — §3, §22.
Li et al., “DataComp-LM: In search of the next generation of training sets for language models” (DCLM). 2024. arxiv.org/abs/2406.11794 — §3, §22.
Soldaini et al., “Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research.” 2024. arxiv.org/abs/2403.07606 — §22.
Together AI, “RedPajama: An Open Source Recipe to Reproduce LLaMA Training Dataset.” 2023. github.com/togethercomputer/RedPajama-Data — §3, §22.
Penedo et al., “The RefinedWeb Dataset for Falcon LLM.” 2023. Large-scale web filter pipeline. arxiv.org/abs/2306.01116 — §3.
Xie et al., “DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining.” NeurIPS 2023. Learned domain weights. arxiv.org/abs/2305.10429 — §3.
Lee et al., “Deduplicating Training Data Makes Language Models Better.” ACL 2022. Near-dup removal motivation. arxiv.org/abs/2107.06423 — §3.
Brown et al., “Don’t Count on Luck: Mitigating Benchmark Contamination in LLM Evaluation.” 2024. Contamination index mindset in §3. arxiv.org/abs/2406.11252 — §3, §15.
Shumailov et al., “The Curse of Recursion: Training on Generated Data Makes Models Forget.” 2024. Synthetic-data / model-collapse risks in §3. arxiv.org/abs/2305.17493 — §3.
Common Crawl Foundation. Web crawl snapshots (raw input to most open corpora). commoncrawl.org — §3.

Pre-training loop, optimization & distributed systems (§7–10)

Loshchilov & Hutter, “Decoupled Weight Decay Regularization” (AdamW). ICLR 2019. Default LLM optimizer. arxiv.org/abs/1711.05101 — §7, §9.
Chen et al., “Training Deep Nets with Sublinear Memory Cost” (activation checkpointing). 2016. arxiv.org/abs/1604.06983 — §8, §10.
Rajbhandari et al., “ZeRO: Memory Optimizations Toward Training Trillion Parameter Models.” SC 2020. ZeRO-1/2/3 sharding. arxiv.org/abs/1910.02054 — §8.
Narayanan et al., “Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM.” 2021. Tensor + pipeline parallel. arxiv.org/abs/2104.04473 — §8, §22.
Shoeybi et al., “Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism.” 2019. arxiv.org/abs/1909.08053 — §8.
Zhao et al., “PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel.” 2023. arxiv.org/abs/2304.11277 — §8, §22.
Microsoft DeepSpeed Team. ZeRO, MoE, and large-cluster training docs. deepspeed.ai — §8, §22.
PyTorch Team, “torchtitan” reference pre-training stack. github.com/pytorch/torchtitan — §22.
Micikevicius et al., “FP8 Formats for Deep Learning.” 2022. FP8 training background in §10. arxiv.org/abs/2209.05433 — §10.
NVIDIA Corporation. H100/B200 datasheets and NeMo/Megatron-Core documentation for cluster sizing in §17. nvidia.com/data-center — §17, §19.

Mid-training, post-training & alignment (§11–13)

Ouyang et al., “Training language models to follow instructions with human feedback” (InstructGPT / RLHF). NeurIPS 2022. arxiv.org/abs/2203.02155 — §12.
Schulman et al., “Proximal Policy Optimization Algorithms.” 2017. PPO baseline for RLHF. arxiv.org/abs/1707.06347 — §12.
Rafailov et al., “Direct Preference Optimization: Your Language Model is Secretly a Reward Model.” NeurIPS 2023. arxiv.org/abs/2305.18290 — §1, §12.
Azar et al., “A General Theoretical Paradigm to Understand Learning from Human Preferences” (IPO). 2024. DPO overfitting fix. arxiv.org/abs/2310.12036 — §12.
Ethayarajh et al., “KTO: Model Alignment as Prospect Theoretic Optimization.” 2024. Binary feedback without pairs. arxiv.org/abs/2402.01306 — §12.
Bai et al., “Constitutional AI: Harmlessness from AI Feedback.” 2022. RLAIF lineage. arxiv.org/abs/2212.08073 — §12, §16.
Zhou et al., “LIMA: Less Is More for Alignment.” 2023. High-quality SFT sample efficiency. arxiv.org/abs/2305.11206 — §12.
Shao et al., “DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models” (GRPO). 2024. Group-relative RL without critic. arxiv.org/abs/2402.03300 — §1, §13.
OpenAI, “Learning to Reason with LLMs” (o1 system card). 2024. Test-time compute / long CoT product framing. openai.com — §13.
McKinney et al., “Avoiding Catastrophic Forgetting in Continual Learning of New Languages.” 2022. Mixing general data during CPT (§11). arxiv.org/abs/2205.09357 — §11, §21.

Parameter-efficient fine-tuning (§14)

Hu et al., “LoRA: Low-Rank Adaptation of Large Language Models.” ICLR 2022. arxiv.org/abs/2106.09685 — §14.
Dettmers et al., “QLoRA: Efficient Finetuning of Quantized LLMs.” NeurIPS 2023. NF4 + LoRA recipe in §14 code block. arxiv.org/abs/2305.14314 — §1 KPI, §14.
Yadav et al., “TIES-Merging: Resolving Interference When Merging Models.” 2023. Adapter merge context in §14. arxiv.org/abs/2306.01708 — §14.

Evaluation, benchmarks & LLM-as-judge (§15)

Hendrycks et al., “Measuring Massive Multitask Language Understanding” (MMLU). ICLR 2021. arxiv.org/abs/2009.03300 — §15.
Wang et al., “MMLU-Pro: A More Robust and Challenging Multi-Task Language Model Benchmark.” 2024. arxiv.org/abs/2406.01574 — §15.
Rein et al., “GPQA: A Graduate-Level Google-Proof Q&A Benchmark.” 2023. arxiv.org/abs/2311.12022 — §15.
Chen et al., “Evaluating Large Language Models Trained on Code” (HumanEval). 2021. arxiv.org/abs/2107.03374 — §15.
Cobbe et al., “Training Verifiers to Solve Math Word Problems” (GSM8K). 2021. arxiv.org/abs/2110.14168 — §15.
Hendrycks et al., “Measuring Mathematical Problem Solving With the MATH Dataset.” NeurIPS 2021. arxiv.org/abs/2103.03874 — §15.
Suzgun et al., “Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them” (BBH). 2022. arxiv.org/abs/2210.09261 — §15.
Zhou et al., “Instruction-Following Evaluation for Large Language Models” (IFEval). 2023. arxiv.org/abs/2311.07911 — §15.
Liang et al., “Holistic Evaluation of Language Models (HELM).” 2022. arxiv.org/abs/2211.09110 — §22.
Zheng et al., “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.” 2023. Position/verbosity bias notes in §15.3. arxiv.org/abs/2306.05685 — §15.
Gao et al., “A framework for few-shot language model evaluation” (lm-evaluation-harness). EleutherAI. github.com/EleutherAI/lm-evaluation-harness — §15, §22.
OpenCompass Contributors. Open benchmark runner used alongside harness. github.com/open-compass/opencompass — §22.
Hsieh et al., “RULER: What’s the Real Context Size of Your Long-Context Language Models?” 2024. arxiv.org/abs/2404.06654 — §15.

Safety, red-teaming & governance (§16)

Mazeika et al., “HarmBench: A Standardized Evaluation Framework for Automated Red Teaming.” 2024. arxiv.org/abs/2402.04249 — §15–16.
Anthropic, “Responsible Scaling Policy” and model cards. Frontier safety governance framing. anthropic.com/news — §16.
UK AI Safety Institute, “Inspect” evaluation framework. Agent and capability evals in §22. github.com/UKGovernmentBEIS/inspect_ai — §22.
NIST, AI Risk Management Framework (AI RMF 1.0). 2023. nist.gov/ai-rmf — §16.
European Union, Regulation (EU) 2024/1689 (Artificial Intelligence Act). EUR-Lex — §16.

Inference, quantization & distillation (§18)

Kwon et al., “Efficient Memory Management for Large Language Model Serving with PagedAttention” (vLLM). SOSP 2023. arxiv.org/abs/2309.06180 — §18, §22.
Leviathan et al., “Fast Inference from Transformers via Speculative Decoding.” ICML 2023. arxiv.org/abs/2211.17192 — §18.
Frantar et al., “GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers.” 2023. arxiv.org/abs/2210.17323 — §18.
Lin et al., “AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration.” 2023. arxiv.org/abs/2306.00978 — §18.
Hinton et al., “Distilling the Knowledge in a Neural Network.” NeurIPS 2015 Deep Learning Workshop. Teacher–student framing in §18. arxiv.org/abs/1503.02531 — §18.
Sanh et al., “DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter.” 2019. Classic distillation baseline. arxiv.org/abs/1910.01108 — §18.
ggml / llama.cpp project. CPU/edge GGUF inference in §18, §22. github.com/ggerganov/llama.cpp — §22.

Tooling, frameworks & datasets (§22)

Hugging Face, Transformers, Accelerate, TRL, PEFT, and tokenizers documentation. Default fine-tuning and alignment stack. huggingface.co/docs/transformers — §14, §22.
Axolotl, Unsloth, LLaMA-Factory. Community LoRA/QLoRA recipes referenced in §22.1. github.com/OpenAccess-AI-Collective/axolotl — §22.
OpenRLHF, verl. PPO/GRPO/RLVR pipelines in §22.1. github.com/volcengine/verl — §13, §22.
MosaicML Streaming, WebDataset. High-throughput sharded data loading. github.com/webdataset/webdataset — §22.
Allen AI, OLMo / Dolma tooling (datatrove, text-dedup lineage). github.com/allenai/OLMo — §22.
Weights & Biases, MLflow, TensorBoard. Experiment tracking named in §22.3. Vendor docs at respective sites — §22.

KPI strip & cost tables (§1, §19)

Figures such as 15T+ frontier tokens, ~20× Chinchilla ratio, $10M–$500M pre-train spend, and <$100 LoRA-on-7B are synthesis bands drawn from the Chinchilla, DeepSeek-V3, Epoch AI, and QLoRA sources above plus operator reports—not one published table. Re-derive any budget from your own FLOPs accounting and current GPU list pricing before you quote externally.