Training a Large Language Model (LLM) — end-to-end
How I think about building and shipping an LLM in 2026: data pipelines, transformer internals, distributed training (FSDP / 3D parallelism), post-training (SFT, DPO, RLHF, RLVR), evals, infra, and the cost math that kills projects when you ignore it.
In 2026, I treat training an LLM less as a single act and more as a pipeline of three pipelines: a pre-training run that turns raw web-scale text into a base model, a mid-training stage that injects domain and reasoning data, and a post-training stage where the model learns instructions, preferences, and verifiable reasoning. Most teams will not pre-train from scratch — they will fine-tune or post-train an open base (Llama, Mistral, Qwen, DeepSeek, Gemma). The math, the data, and the engineering, however, are the same.
15T+
Pre-training tokens (frontier)
~20×
Chinchilla tokens/param ratio
$10M–$500M
Frontier pre-train spend
<$100
LoRA SFT on a 7B (cloud)
Frontier spend and LoRA cost bands are directional—see §23 and primary sources before you quote them.
What changed since 2023
Scale
Data > params
Compute-optimal trends past Chinchilla: train smaller models on far more tokens for cheaper inference.
Before tearing into any single stage, see the whole flow. Every box below has its own section.
Figure 1. The canonical LLM training pipeline from raw data to served model. Stage 2 is where 95% of the FLOPs go; stage 4 is where 95% of the perceived quality is decided.
Scope
I focus on training—data, model, optimizer, loss, infra, alignment, evals. Inference appears only where it constrains training (vocab, context, quantization, distillation). RAG and agents are out of scope here.
3. Data: Sourcing, Cleaning, Mixing
Data is the most under-appreciated lever in LLM training. A 2024 result still holds in 2026: better data beats more parameters, often by a wide margin. Most of the gap between an open base and a frontier closed model is data quality, not compute.
Math & STEM (OpenMathInstruct, ProofPile, FineMath)
2–5%
Boosts reasoning
Hard to dedupe vs benchmarks (contamination)
Multilingual (MADLAD-400, CulturaX)
10–30%
Cross-lingual transfer
Quality varies wildly by language
Synthetic (model-generated, distilled)
5–30%
Controllable, fills gaps
Mode collapse, model-collapse if recursive
3.2 The cleaning pipeline
Figure 2. Cleaning funnel for a single CommonCrawl dump. Numbers are order-of-magnitude; FineWeb-Edu, DCLM and RedPajama-v2 publish the actual ratios.
3.3 Decontamination — the silent killer
Decontaminate before you train, not after
Strip n-gram overlap (13-gram is the common cutoff) against every benchmark you plan to report — MMLU, HumanEval, GSM8K, MATH, BBH, IFEval, AGIEval, the lot. Contamination is the most common reason a model "looks SOTA" in the paper and falls over in production. Maintain a frozen contamination index in the data team's repo.
3.4 Data mixing & curriculum
The mixture weight per source is itself a hyperparameter. Two practical recipes:
DoReMi (Domain Reweighting with Minimax) — train a small proxy model, learn per-domain weights that maximize worst-domain perplexity, then train the big model with those weights.
Curriculum/annealing — start with diverse web data, gradually shift toward high-quality math/code/reasoning in the last 10–20% of tokens (the "annealing" phase popularized by Llama 3 and DeepSeek). LR is decayed in parallel.
4. Tokenization & Vocab
The tokenizer freezes before pre-training and is effectively unchangeable. A bad tokenizer leaves performance on the table forever — especially for non-English, code, and math.
4.1 Algorithm choice
Algorithm
Used by
Notes
Byte-Pair Encoding (BPE)
GPT-2/3/4, Llama, Mistral
Default. Use byte-level BPE (no UNK).
SentencePiece Unigram
T5, Gemma, mT5
Probabilistic; good for many languages
WordPiece
BERT
Legacy for new LLMs
Tiktoken (cl100k/o200k)
OpenAI models
Highly tuned BPE; ~100K–200K vocab
4.2 Vocab size trade-offs
Small (32K)
Llama 2 era
Cheaper embed/lm-head, more tokens per word, slower for non-English.
Mid (128K)
Llama 3 / Qwen
Sweet spot for multilingual + code. ~15% fewer tokens per doc vs 32K.
Large (200K+)
GPT-4o / Gemini
Excellent compression, but embedding table eats memory and may bloat small models.
4.3 Critical details devs miss
Digit splitting: split each digit into its own token (Llama-style) — dramatically helps arithmetic.
Pre-tokenization regex: the GPT-4 split pattern (contractions, whitespace, numbers) is widely copied; use it.
Byte fallback: guarantees no UNK tokens; mandatory for code and arbitrary Unicode.
Special tokens: reserve <|im_start|>, <|im_end|>, tool tags, thinking tags now — adding them later requires embedding surgery.
Train the tokenizer on the data mix you will train on — including code and target languages, not just English web.
Almost every modern LLM is a decoder-only transformer with a handful of upgrades. Below is the canonical block, then the modifications that actually matter in 2026.
Figure 3. Modern decoder-only transformer block. Llama-3/Qwen-3/Mistral all variations of this skeleton.
5.3 Mixture of Experts (MoE) — when to reach for it
MoE replaces the MLP in each block with E experts and a router that activates top-k per token (typically k=1 or 2). Active parameters per token are far below total parameters, so you train a "huge" model at the FLOPs of a small one.
Pro: 4–10× more parameters at the same training cost; better quality/FLOP.
Con: hard to fine-tune; needs expert-parallel comms; load-balancing loss; memory-heavy at inference.
5.4 Non-transformer contenders (know them, don't always use them)
Mamba / SSM hybrids (Jamba, Zamba) — linear-time long context, but ecosystem still thin.
RWKV / RetNet — RNN-flavored with parallel training; niche but useful for ultra-long sequences.
Hybrid attention+SSM blocks — best of both, increasingly mainstream in 2026.
6. Scaling Laws & Compute Budgets
6.1 The Chinchilla rule of thumb
Given a fixed compute budget C (FLOPs), loss is minimized when the number of training tokens D and the number of parameters N are scaled in roughly equal proportion, with D ≈ 20 × N. The familiar 6× factor relates them to FLOPs:
C ≈ 6 · N · D
For inference-heavy products, push D/N far above 20 (Llama-3-8B used ~1900). You pay more to train, but every served token is cheaper.
Figure 4. Order-of-magnitude compute budget by model size. MFU (Model FLOPs Utilization) of 35–55% is realistic on a healthy cluster.
6.2 Inference-aware scaling
Llama-3 and DeepSeek both deliberately broke Chinchilla — training 8B and 67B models on 15T+ tokens. The reasoning: lifetime inference dwarfs training, so a slightly under-parameterized but over-trained model wins on total cost of ownership.
Pick D/N by where the dollars live
If you serve more than ~10²² inference FLOPs (large product), over-train. If the model is research-only or short-lived, stick near Chinchilla optimum.
7. Pre-training: Objectives & Loop
7.1 The objective
Canonical pre-training loss is causal language modeling — predict the next token given all previous tokens, averaged over the corpus. Sequences are packed (concatenated with separators) to keep GPUs busy.
Pack multiple short documents into one fixed-length sequence with <|endoftext|> separators.
Use document-level attention masking (sometimes called "intra-document masking") so attention doesn't bleed across documents — this matters for clean long-context training.
FlashAttention 2/3 supports variable-length packed sequences via cu_seqlens.
7.3 The training loop, end-to-end
Figure 5. Pre-training loop. Loss spikes are normal; well-engineered runs rewind by ~100 steps and skip the offending batch.
7.4 Batch size: tokens, not examples
Think in tokens per step, not sequences. Frontier runs use 4M–16M tokens per global batch. Why so large? Adam's noise scale and tensor parallel efficiency both prefer big batches; small batches stall training and destabilize loss.
8. Distributed Training (DP, TP, PP, FSDP)
A 7B model in BF16 is ~14 GB of weights, but training needs ~6× that (weights + grads + Adam moments + activations). For anything above 1B, you split the model across devices.
8.1 The four kinds of parallelism
Data Parallel (DDP)
Replicate, split batch
Each GPU has a full model copy and processes a slice of the batch. Simplest. Caps at ~10B before memory wall.
Tensor Parallel (TP)
Split each matmul
Megatron-style: split weight matrices across GPUs intra-layer. Heavy all-reduce traffic — only across NVLink/NVSwitch in one node.
Pipeline Parallel (PP)
Split layer stack
Different GPUs hold different layers. Use 1F1B / interleaved schedules to minimize bubbles.
Sequence / Context Parallel
Split the sequence
Ring attention spreads long-context attention across GPUs. Essential for 128K+ training.
8.2 ZeRO / FSDP — sharded data parallel
PyTorch FSDP (and DeepSpeed ZeRO) shards optimizer state, gradients, and parameters across data-parallel ranks. It is the default for < 70B models because it's almost as simple as DDP but scales to 100B+.
Stage
Shards
Memory savings
Comms cost
ZeRO-1 / FSDP SHARD_GRAD_OP
Optimizer state
~4×
Low
ZeRO-2
+ Gradients
~8×
Medium
ZeRO-3 / FSDP FULL_SHARD
+ Parameters
~N× (N = world size)
High (all-gather per layer)
8.3 3D parallelism — the frontier recipe
Figure 6. 3D parallelism on a 64-GPU pod. Real frontier runs nest this further with sequence parallel and expert parallel.
8.4 Picking a strategy by model size
Model
Recommended
<1B
Plain DDP
1B – 13B
FSDP (ZeRO-3) with selective activation checkpointing
13B – 70B
FSDP + tensor parallel within a node (TP=2/4/8)
70B – 400B dense
3D parallelism (TP + PP + DP) + sequence parallel for long context
MoE at any scale
Add expert parallel; tune all-to-all communication
WSD schedule (warmup-stable-decay) increasingly popular
Grad clip
1.0 (global L2)
Drop offending batch if spikes persist
Init
Normal(0, 0.02) scaled by 1/√(2·N_layers) on residual projections
Mu-Parametrization (μP) lets you tune at small scale and transfer
Dropout
0.0 in pre-training
Use only in fine-tune if overfitting
9.2 The optimizer of 2026
AdamW remains the default, but watch:
Lion / Sophia / Shampoo — sometimes faster, less validated.
Distributed Shampoo — used in production at Google; 30–40% step reduction at scale.
Adafactor — memory-light, popular for very large MoE.
9.3 μP — tune small, train big
Maximal Update Parametrization re-scales initialization, LR, and projection so that the same hyperparameters transfer from a tiny proxy model up to the full-scale run. It collapses the hyperparameter search from "burn 1% of your training budget" to "burn 0.01%".
10. Stability, Mixed Precision, Memory
10.1 Precision recipes
Format
Use for
Notes
BF16
All forward / backward activations & weights
Wider exponent than FP16; no loss-scaling needed
FP32
Optimizer master copy of weights, loss reduction
Avoids drift over millions of steps
FP8 (E4M3 / E5M2)
H100/H200/B200 matmuls, with per-tensor scaling
~2× throughput; needs careful scaling, used by DeepSeek-V3
INT8 / INT4
Post-training quantization for inference
Not for training
10.2 The memory budget per GPU (BF16 training)
# Per-parameter cost
weight (BF16) : 2 bytes
grad (BF16) : 2 bytes
adam (FP32 m, v, mast) : 12 bytes
────
16 bytes / param
# 7B model on one GPU: 7e9 * 16 = 112 GB (won't fit on H100)# With FSDP-FullShard over 8 GPUs: 14 GB / GPU for state# Activations dominate the rest — use activation checkpointing
10.3 Loss spike playbook
When the loss explodes at step 47,328
Check grad norm — if it spiked, ratio of last few norms tells you how bad.
Rewind to the last clean checkpoint (~100 steps back).
Skip the data shard / batch that caused it.
If recurring: reduce peak LR by 20% and resume.
If still happening: switch one layer of attention to FP32 softmax, or check for bad init in a specific layer.
10.4 Activation checkpointing
Recompute activations during backward instead of storing them. Costs ~30% extra compute, saves ~10× activation memory. Use selective checkpointing (only the attention and MLP outputs) to balance speed and memory.
RoPE in FP32 for the sin/cos table — small fix that prevents long-context drift.
Z-loss (auxiliary loss on logsumexp) — stabilizes the LM head; used by PaLM and many frontier runs.
QK-norm — RMSNorm on Q and K before attention; tames softmax outliers.
11. Mid-training & Continued Pre-training
Between base pre-training and post-training there is now a distinct phase — variously called annealing, mid-training, or continued pre-training — where the model is exposed to higher-quality data with a lower learning rate.
11.1 What goes in
High-quality reasoning corpora (textbooks, curated arXiv, math proofs).
Code repositories with passing tests.
Long documents for context extension (concatenated books, long codebases).
Multilingual upsampling.
Tool-use traces (function calls, JSON outputs).
11.2 Long-context extension
To extend RoPE-based attention from 8K to 128K/1M:
Increase RoPE base (the θ parameter) — e.g. 10,000 → 500,000.
Apply YaRN or NTK-aware scaling at the same time for better extrapolation.
Continue training on long documents for ~1–10B tokens.
Evaluate on needle-in-a-haystack, RULER, LongBench v2 to confirm.
11.3 Continued pre-training on your domain (CPT)
If you have a corpus that is mostly out-of-distribution for the base model (legal, biomedical, niche programming language), CPT can lift downstream performance more cheaply than re-pretraining. Rules:
Keep some % (10–30%) of the original mix to prevent catastrophic forgetting.
Use 0.1× the original peak LR.
Train for tokens roughly equal to 0.1–1× the new-domain corpus size.
Evaluate on both old and new domain to catch regression.
12. Post-training: SFT, DPO, RLHF, RLVR
12.1 The pipeline
Figure 7. Post-training pipeline. Many teams stop at SFT + DPO; RL adds another quality jump but is expensive and finicky.
12.2 SFT (Supervised Fine-Tuning) — turn a base into a chat model
SFT teaches format and basic instruction-following. Quality of examples >> quantity. 5K LIMA-style hand-curated examples can beat 500K low-quality ones.
# Conversation template (ChatML-flavored)
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
Explain RoPE in one paragraph.<|im_end|>
<|im_start|>assistant
Rotary Position Embedding rotates the query and key vectors ...<|im_end|>
# Loss masking: compute loss ONLY on assistant tokens
12.3 DPO — preference learning without a reward model
Direct Preference Optimization re-parametrizes the RLHF objective so you can train directly on (chosen, rejected) pairs against a frozen reference model. No reward model, no PPO, no rollouts. The closed-form loss:
KTO — uses single thumbs-up/down rather than pairs.
ORPO — combines SFT + preference in one stage.
SimPO — reference-free, simpler scaling.
12.4 RLHF — PPO with a learned reward model
Classical recipe (InstructGPT, GPT-3.5): train a reward model on preference data, then optimize the policy with PPO against the RM, with a KL penalty to a reference policy.
Three model copies live in GPU memory at once (policy, reference, reward, critic).
Reward hacking is the primary failure mode — log everything, watch KL.
Largely replaced by DPO + RLVR in 2026, but still used at frontier labs for general chat.
12.5 RLAIF — feed your own model the role of the labeler
Constitutional AI / RLAIF uses a (typically larger) model to generate or rank responses. Useful when human labelers are slow or expensive; quality ceiling = quality of the judge.
13. Reasoning & Long-Horizon Training
The 2024–2026 leap in reasoning (OpenAI o1, DeepSeek-R1, Qwen-QwQ) came from RL on verifiable rewards (RLVR). The model generates long chains of thought, a verifier checks the final answer, and policy gradient pushes the model toward CoTs that lead to correct answers.
13.1 GRPO — Group Relative Policy Optimization
GRPO (introduced in DeepSeekMath, used in R1) samples G rollouts per prompt, computes a group-baseline advantage, and updates with a PPO-style clipped objective — but without a critic, which halves memory.
# Pseudocodefor prompt in batch:
rollouts = [policy.generate(prompt) for _ in range(G)]
rewards = [verifier(r) for r in rollouts]
adv = (rewards - mean(rewards)) / (std(rewards) + 1e-8)
loss = -mean( min(ratio * adv, clip(ratio, 1-ε, 1+ε) * adv) )
+ β * KL(policy || ref)
13.2 Where verifiable rewards come from
Domain
Verifier
Math
SymPy / numeric equivalence to the gold answer
Code
Unit tests pass / fail
Format / JSON
Schema validator
Multi-step tool use
End-state in a simulator
Translation / summarization
BLEU / BERTScore / LLM-judge (less crisp)
13.3 Practical lessons from R1-style training
Start from a strong SFT model; RLVR on a weak base diverges.
Initial responses should already produce some CoT — bootstrap with synthetic CoT SFT if needed.
Track diversity collapse — entropy bonuses or temperature scheduling help.
Use a smaller policy + larger verifier (or rules) — verifier need not be a neural net.
13.4 Test-time compute as a separate axis
Trained reasoners exchange compute at inference for accuracy: search, self-consistency, MCTS over thought paths, best-of-N with a verifier. Treat inference latency as a tunable, not a fixed budget.
14. Fine-tuning Adapters (LoRA, QLoRA)
Most teams will never touch full-parameter fine-tuning. LoRA trains a low-rank delta (A·B, rank 8–64) injected into chosen weight matrices, freezing the base. QLoRA quantizes the base to 4-bit NF4 first, letting you fine-tune a 70B model on a single 80 GB GPU.
Figure 8. LoRA: low-rank decomposition of the weight update. Inference-time, you can merge A·B back into W for zero overhead.
14.1 LoRA recipe that just works
Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj (all linear layers in attention + MLP).
Rank: 8–16 for style/persona, 32–64 for knowledge injection, 128+ rarely helps.
Alpha = 2 × rank.
Dropout = 0.05.
LR = 1e-4 to 3e-4 (LoRA tolerates much higher LR than full FT).
1–3 epochs on 1K–100K examples.
14.2 QLoRA in practice
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
bnb = BitsAndBytesConfig(load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16)
base = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3-70b",
quantization_config=bnb)
cfg = LoraConfig(r=16, lora_alpha=32, lora_dropout=0.05,
target_modules="all-linear", task_type="CAUSAL_LM")
model = get_peft_model(base, cfg) # ~0.5% params trainable
14.3 When to NOT use LoRA
You need to extend context length significantly — base attention needs full training.
You're changing tokenizer or vocab (requires embedding/lm-head training).
Domain-specific eval set, LLM-as-judge, human ratings, online A/B.
15.2 The 2026 benchmark shortlist
What it measures
Use
General knowledge
MMLU-Pro (10-option, harder than MMLU)
Reasoning (graduate science)
GPQA Diamond
Math
MATH-500, AIME, OlympiadBench
Code
HumanEval+, MBPP+, LiveCodeBench (rolling)
Long context
RULER, NIAH, LongBench v2
Instruction following
IFEval, MT-Bench, Arena-Hard
Agentic / tool use
τ-Bench, SWE-Bench Verified, GAIA
Multilingual
MGSM, MMLU-translate, FLORES
Safety
HarmBench, ToxicChat, XSTest
15.3 LLM-as-judge — useful but biased
Known biases
Position bias — first response wins more often. Randomize order, or compute both orderings.
Verbosity bias — longer answers win even when wrong.
Self-preference — models prefer their own style. Always use a different judge family than the model under test.
Format bias — bulleted, formatted answers are preferred over equally correct prose.
15.4 Eval rigor for training
Always report variance over seeds — single-run differences of ~1 point on small benchmarks are noise.
Hold out an internal eval set that is never seen by anyone training.
Run contamination scans on every benchmark you report.
Track regressions on prior capabilities — fine-tuning often gains on the target task and loses elsewhere.
16. Safety, Alignment, Red-Teaming
16.1 Layered defense
Figure 9. Safety is multi-layer; no single layer is sufficient.
16.2 Constitutional AI in one sentence
Use the model itself (or another model) to critique its own responses against a written list of principles, then fine-tune on the revisions. Effective for refusal training without large human-feedback budgets.
16.3 Red-teaming
Internal team + external (HackerOne-style) bounty.
Maintain a growing private dataset of jailbreaks; re-DPO on them.
Test for multi-turn attacks — many jailbreaks unfold over 5–10 turns.
Probe for capability uplift on dual-use domains (cyber, bio).
16.4 Regulation watchlist (2026)
EU AI Act — general-purpose AI model obligations; report training compute > 10²⁵ FLOPs.
US NIST AI RMF + Executive Orders requiring model card disclosures and safety evals.
UK AISI / US AISI evals for frontier models.
Copyright lawsuits (NYT v OpenAI, Stability v Getty) — keep dataset provenance.
17. Infrastructure & Hardware
17.1 Hardware lineup (2026)
Accelerator
HBM
BF16 TFLOPS
Notes
NVIDIA H100 SXM
80 GB
~990
Workhorse; still widely available
NVIDIA H200
141 GB
~990
Same compute, more HBM
NVIDIA B200 (Blackwell)
192 GB
~2,250
FP8/FP4 native; 2026 default
NVIDIA GB200 NVL72
13.5 TB pooled
~80,000 (rack)
72 Blackwell + Grace CPUs, rack-scale
AMD MI300X / MI325X
192 / 256 GB
~1,300 / ~1,300
Strong for inference; software gap narrowing
Google TPU v5p / v6e
95 / 32 GB
~459 / ~918
For JAX/XLA stacks
AWS Trainium 2
96 GB
~650
NeuronSDK; cost competitive on AWS
17.2 Networking is the bottleneck
Compute is rarely what stalls a training run — the interconnect is. Required topology:
Intra-node: NVLink/NVSwitch (900 GB/s between H100s in an 8-GPU node).
Inter-node: InfiniBand NDR (400 Gb/s/port × 8 = 3.2 TB/s per node) or RoCEv2.
Speculative decoding — small draft model proposes, big verifies; 2–3× latency.
Prefix caching — reuse KV cache for shared system prompts.
Disaggregated prefill/decode — prefill is compute-bound, decode is memory-bound — run them on separate fleets.
18.3 Distillation — bake a big model into a small one
Generate (prompt, response, top-k log-probs) tuples from the big teacher; train a small student to match the distribution. Used everywhere now (Gemma-2 from Gemini, Llama-3 8B from Llama-3 405B).
KL on full logits > SFT on samples > SFT on argmax.
Match temperature and vocab — distillation across tokenizers is hard.
Combine with on-policy RL for the final 1–2 points.
19. Cost & Economics
19.1 What things actually cost (2026 cloud, list prices)
Activity
Cost
Notes
1B-param scratch pre-train, 20B tokens
$2K–$8K
32× H100 for ~1 day
7B scratch pre-train, 1T tokens
$200K–$500K
256× H100 for ~2 weeks
70B scratch pre-train, 2T tokens
$5M–$15M
1024× H100 for ~1 month
Frontier pre-train (400B dense / MoE 600B)
$100M–$500M+
10K+ H100 for months
SFT on 7B (100K examples, 3 epochs)
$50–$300
8× H100 for hours
QLoRA on 70B (50K examples)
$50–$200
1× H100 for hours
DPO on 7B (100K pairs)
$100–$500
8× H100 for hours
RLVR on 7B (long CoT, GRPO)
$5K–$50K
Days to weeks of rollouts
19.2 Cost levers (in order of impact)
Don't pre-train from scratch. Almost always start from an open base.
Pick the smallest model that meets the bar after distillation/quantization.
Spot/preemptible instances for fine-tuning and most evals — 60–80% off.
Reserved capacity for sustained training — 30–50% off.
Right-size context — 8K vs 128K can be 3× the GPU memory.
Cache aggressively at inference (prefix, response).
Cross-cloud arbitrage — H100 pricing varies 40% between hyperscalers and neoclouds.
Order of operations
Always: cheap base → SFT → DPO → eval. Add RL only if eval bar isn't met and the cost is justified. Most teams discover SFT + good data gets them 80% there.
20. Strategy & 90-Day Roadmap
20.1 First decision: build, post-train, or buy
Figure 10. Build/post-train/buy decision. The default is "don't train" — earn the right to train by exhausting cheaper options first.
20.2 The 90-day plan (for a team starting from open-base fine-tuning)
Days 1–30
Foundations
Pick base model (capability vs license vs cost).
Stand up eval harness (lm-eval-harness, internal eval).
Build data pipeline (ingest, dedupe, filter, decontaminate).
Get a baseline: zero-shot & few-shot on internal tasks.
"Attention Is All You Need" — Vaswani et al., 2017.
"Scaling Laws for Neural LMs" — Kaplan et al., 2020; "Chinchilla" — Hoffmann et al., 2022.
"Llama 3 Herd of Models" — Meta, 2024.
"DeepSeek-V3" and "DeepSeek-R1" technical reports — 2024–25.
"Direct Preference Optimization" — Rafailov et al., 2023.
"GRPO / DeepSeekMath" — Shao et al., 2024.
"FlashAttention 1/2/3" — Dao et al.
"YaRN" — Peng et al., for long context.
"Constitutional AI" — Anthropic, 2022.
The Megatron-LM and FSDP source code — best documentation that exists.
23. References & sources
Annotated bibliography behind every major claim in this note—scaling laws, architecture choices, data pipelines, distributed training, alignment methods, benchmarks, inference stack, and cost bands. Section tags (e.g. §6) point to where the idea is used. Prefer primary papers and official model cards for canonical numbers; leaderboard scores and license terms change weekly.
Scope. Synthesis of peer-reviewed work, technical reports, and widely used open tooling (May 2026). Diagrams and workflow ordering are original unless noted. Dollar ranges and KPI strips are planning heuristics, not industry averages. Not legal, safety-certification, or compliance advice—confirm regulatory obligations with qualified counsel.
How to read: citations are numbered continuously [1]–[n] within this section. Cross-check arXiv versions against conference proceedings when you cite formally.
Landscape, scaling & open base models (§1, §6, §22)
Kaplan et al., “Scaling Laws for Neural Language Models.” NeurIPS 2020. Foundational compute–loss–size relationships. arxiv.org/abs/2001.08361 — §6.
Hoffmann et al., “Training Compute-Optimal Large Language Models” (Chinchilla). NeurIPS 2022. ~20× tokens/param rule of thumb; optimal N and D under fixed compute. arxiv.org/abs/2203.15556 — §1, §6–7.
Yang et al., “Tensor Programs V: Tuning Large Neural Networks via Small” (μP). 2022. Width/depth transfer for stable large-model training. arxiv.org/abs/2203.03466 — §9.
Brown et al., “Language Models are Few-Shot Learners” (GPT-3). NeurIPS 2020. Scale and in-context learning baseline. arxiv.org/abs/2005.14165 — §1.
Touvron et al., “LLaMA: Open and Efficient Foundation Language Models.” 2023. Open-weights pre-training recipe. arxiv.org/abs/2302.13971 — §1, §22.
Touvron et al., “Llama 2: Open Foundation and Fine-Tuned Chat Models.” 2023. arxiv.org/abs/2307.09288 — §1.
Meta AI, “The Llama 3 Herd of Models.” 2024. Annealing, data mix, long-context and post-training details. arxiv.org/abs/2407.21783 — §1, §3, §11, §22.
DeepSeek-AI, “DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model.” 2024. MLA attention; MoE routing. arxiv.org/abs/2405.04434 — §5.
DeepSeek-AI, “DeepSeek-V3 Technical Report.” 2024. Large MoE pre-train; FP8 training notes. arxiv.org/abs/2412.19437 — §1, §5, §10, §22.
DeepSeek-AI, “DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.” 2025. RLVR / long-CoT reasoning. arxiv.org/abs/2501.12948 — §1, §13.
Team Gemma et al., “Gemma 2: Improving Open Language Models at a Practical Size.” 2024. arxiv.org/abs/2408.00118 — §4, §22.
Abdin et al., “Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone.” 2024. Synthetic-heavy small-model training. arxiv.org/abs/2404.14219 — §3, §22.
Epoch AI, “Compute Trends Across AI.” Living database of training FLOPs and cluster scale. epoch.ai/data — §1 KPIs, §19.
Architecture, attention & tokenization (§4–5)
Vaswani et al., “Attention Is All You Need.” NeurIPS 2017. Transformer baseline. arxiv.org/abs/1706.03762 — §5.
Su et al., “RoFormer: Enhanced Transformer with Rotary Position Embedding (RoPE).” 2021. arxiv.org/abs/2104.09864 — §5, §11.
Zhang & Sennrich, “Root Mean Square Layer Normalization.” 2019. RMSNorm used in Llama-class stacks. arxiv.org/abs/1910.07467 — §5.
Shazeer, “Fast Transformer Decoding: One Write-Head is All You Need” (MQA). 2019. arxiv.org/abs/1911.02150 — §5.
Ainslie et al., “GQA: Training Generalized Multi-Query Transformer Models.” 2023. arxiv.org/abs/2305.13245 — §5.
Dao et al., “FlashAttention: Fast and Memory-Efficient Exact Attention.” NeurIPS 2022; FlashAttention-2/3 follow-ons on same line of work. arxiv.org/abs/2205.14135 — §5, §8, §10.
Fedus et al., “Switch Transformers: Scaling to Trillion Parameter Models.” JMLR 2022. MoE routing at scale. arxiv.org/abs/2101.03961 — §1, §5.
Lepikhin et al., “GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding.” 2020. Expert parallelism patterns. arxiv.org/abs/2006.16668 — §5, §8.
Sennrich et al., “Neural Machine Translation of Rare Words with Subword Units” (BPE). 2016. Byte-level BPE lineage. arxiv.org/abs/1508.07909 — §4.
Kudo & Richardson, “SentencePiece: A simple and language independent approach to subword tokenization.” 2018. Unigram/BPE tooling. arxiv.org/abs/1808.06258 — §4.
Peng et al., “YaRN: Efficient Context Window Extension of Large Language Models.” 2023. Long-context RoPE scaling. arxiv.org/abs/2309.00071 — §1, §11.
Liu et al., “Ring Attention with Blockwise Transformers for Near-Infinite Context.” 2023. Sequence/context parallel for 128K+ training. arxiv.org/abs/2310.01889 — §8.
Data sourcing, cleaning & mixture design (§3)
Penedo et al., “The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale.” 2024. Open pre-training corpus; cleaning ratios cited in §3 funnel. arxiv.org/abs/2406.17557 — §3, §22.
Li et al., “DataComp-LM: In search of the next generation of training sets for language models” (DCLM). 2024. arxiv.org/abs/2406.11794 — §3, §22.
Soldaini et al., “Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research.” 2024. arxiv.org/abs/2403.07606 — §22.
Penedo et al., “The RefinedWeb Dataset for Falcon LLM.” 2023. Large-scale web filter pipeline. arxiv.org/abs/2306.01116 — §3.
Xie et al., “DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining.” NeurIPS 2023. Learned domain weights. arxiv.org/abs/2305.10429 — §3.
Lee et al., “Deduplicating Training Data Makes Language Models Better.” ACL 2022. Near-dup removal motivation. arxiv.org/abs/2107.06423 — §3.
Brown et al., “Don’t Count on Luck: Mitigating Benchmark Contamination in LLM Evaluation.” 2024. Contamination index mindset in §3. arxiv.org/abs/2406.11252 — §3, §15.
Shumailov et al., “The Curse of Recursion: Training on Generated Data Makes Models Forget.” 2024. Synthetic-data / model-collapse risks in §3. arxiv.org/abs/2305.17493 — §3.
Common Crawl Foundation. Web crawl snapshots (raw input to most open corpora). commoncrawl.org — §3.
Pre-training loop, optimization & distributed systems (§7–10)
Chen et al., “Training Deep Nets with Sublinear Memory Cost” (activation checkpointing). 2016. arxiv.org/abs/1604.06983 — §8, §10.
Rajbhandari et al., “ZeRO: Memory Optimizations Toward Training Trillion Parameter Models.” SC 2020. ZeRO-1/2/3 sharding. arxiv.org/abs/1910.02054 — §8.
Narayanan et al., “Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM.” 2021. Tensor + pipeline parallel. arxiv.org/abs/2104.04473 — §8, §22.
Shoeybi et al., “Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism.” 2019. arxiv.org/abs/1909.08053 — §8.
Zhao et al., “PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel.” 2023. arxiv.org/abs/2304.11277 — §8, §22.
Microsoft DeepSpeed Team. ZeRO, MoE, and large-cluster training docs. deepspeed.ai — §8, §22.
Micikevicius et al., “FP8 Formats for Deep Learning.” 2022. FP8 training background in §10. arxiv.org/abs/2209.05433 — §10.
NVIDIA Corporation. H100/B200 datasheets and NeMo/Megatron-Core documentation for cluster sizing in §17. nvidia.com/data-center — §17, §19.
Mid-training, post-training & alignment (§11–13)
Ouyang et al., “Training language models to follow instructions with human feedback” (InstructGPT / RLHF). NeurIPS 2022. arxiv.org/abs/2203.02155 — §12.
Schulman et al., “Proximal Policy Optimization Algorithms.” 2017. PPO baseline for RLHF. arxiv.org/abs/1707.06347 — §12.
Rafailov et al., “Direct Preference Optimization: Your Language Model is Secretly a Reward Model.” NeurIPS 2023. arxiv.org/abs/2305.18290 — §1, §12.
Azar et al., “A General Theoretical Paradigm to Understand Learning from Human Preferences” (IPO). 2024. DPO overfitting fix. arxiv.org/abs/2310.12036 — §12.
Ethayarajh et al., “KTO: Model Alignment as Prospect Theoretic Optimization.” 2024. Binary feedback without pairs. arxiv.org/abs/2402.01306 — §12.
Bai et al., “Constitutional AI: Harmlessness from AI Feedback.” 2022. RLAIF lineage. arxiv.org/abs/2212.08073 — §12, §16.
Zhou et al., “LIMA: Less Is More for Alignment.” 2023. High-quality SFT sample efficiency. arxiv.org/abs/2305.11206 — §12.
Shao et al., “DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models” (GRPO). 2024. Group-relative RL without critic. arxiv.org/abs/2402.03300 — §1, §13.
OpenAI, “Learning to Reason with LLMs” (o1 system card). 2024. Test-time compute / long CoT product framing. openai.com — §13.
McKinney et al., “Avoiding Catastrophic Forgetting in Continual Learning of New Languages.” 2022. Mixing general data during CPT (§11). arxiv.org/abs/2205.09357 — §11, §21.
Parameter-efficient fine-tuning (§14)
Hu et al., “LoRA: Low-Rank Adaptation of Large Language Models.” ICLR 2022. arxiv.org/abs/2106.09685 — §14.
Dettmers et al., “QLoRA: Efficient Finetuning of Quantized LLMs.” NeurIPS 2023. NF4 + LoRA recipe in §14 code block. arxiv.org/abs/2305.14314 — §1 KPI, §14.
Yadav et al., “TIES-Merging: Resolving Interference When Merging Models.” 2023. Adapter merge context in §14. arxiv.org/abs/2306.01708 — §14.
Evaluation, benchmarks & LLM-as-judge (§15)
Hendrycks et al., “Measuring Massive Multitask Language Understanding” (MMLU). ICLR 2021. arxiv.org/abs/2009.03300 — §15.
Wang et al., “MMLU-Pro: A More Robust and Challenging Multi-Task Language Model Benchmark.” 2024. arxiv.org/abs/2406.01574 — §15.
Rein et al., “GPQA: A Graduate-Level Google-Proof Q&A Benchmark.” 2023. arxiv.org/abs/2311.12022 — §15.
Chen et al., “Evaluating Large Language Models Trained on Code” (HumanEval). 2021. arxiv.org/abs/2107.03374 — §15.
Cobbe et al., “Training Verifiers to Solve Math Word Problems” (GSM8K). 2021. arxiv.org/abs/2110.14168 — §15.
Hendrycks et al., “Measuring Mathematical Problem Solving With the MATH Dataset.” NeurIPS 2021. arxiv.org/abs/2103.03874 — §15.
Suzgun et al., “Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them” (BBH). 2022. arxiv.org/abs/2210.09261 — §15.
Zhou et al., “Instruction-Following Evaluation for Large Language Models” (IFEval). 2023. arxiv.org/abs/2311.07911 — §15.
Liang et al., “Holistic Evaluation of Language Models (HELM).” 2022. arxiv.org/abs/2211.09110 — §22.
Zheng et al., “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.” 2023. Position/verbosity bias notes in §15.3. arxiv.org/abs/2306.05685 — §15.
European Union, Regulation (EU) 2024/1689 (Artificial Intelligence Act).EUR-Lex — §16.
Inference, quantization & distillation (§18)
Kwon et al., “Efficient Memory Management for Large Language Model Serving with PagedAttention” (vLLM). SOSP 2023. arxiv.org/abs/2309.06180 — §18, §22.
Leviathan et al., “Fast Inference from Transformers via Speculative Decoding.” ICML 2023. arxiv.org/abs/2211.17192 — §18.
Frantar et al., “GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers.” 2023. arxiv.org/abs/2210.17323 — §18.
Lin et al., “AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration.” 2023. arxiv.org/abs/2306.00978 — §18.
Hinton et al., “Distilling the Knowledge in a Neural Network.” NeurIPS 2015 Deep Learning Workshop. Teacher–student framing in §18. arxiv.org/abs/1503.02531 — §18.
Sanh et al., “DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter.” 2019. Classic distillation baseline. arxiv.org/abs/1910.01108 — §18.
Weights & Biases, MLflow, TensorBoard. Experiment tracking named in §22.3. Vendor docs at respective sites — §22.
KPI strip & cost tables (§1, §19)
Figures such as 15T+ frontier tokens, ~20× Chinchilla ratio, $10M–$500M pre-train spend, and <$100 LoRA-on-7B are synthesis bands drawn from the Chinchilla, DeepSeek-V3, Epoch AI, and QLoRA sources above plus operator reports—not one published table. Re-derive any budget from your own FLOPs accounting and current GPU list pricing before you quote externally.