Linh Truong · LLM training · May 2026

Training a Large Language Model (LLM) — end-to-end

How I think about building and shipping an LLM in 2026: data pipelines, transformer internals, distributed training (FSDP / 3D parallelism), post-training (SFT, DPO, RLHF, RLVR), evals, infra, and the cost math that kills projects when you ignore it.

Last revised: May 2026 For: ML engineers, infra engineers, applied researchers
Pre-trainingTokenizationTransformer FSDP3D ParallelismMoE RLHFDPORLVR LoRAEvalsInference

1. The 2026 LLM Training Landscape

In 2026, I treat training an LLM less as a single act and more as a pipeline of three pipelines: a pre-training run that turns raw web-scale text into a base model, a mid-training stage that injects domain and reasoning data, and a post-training stage where the model learns instructions, preferences, and verifiable reasoning. Most teams will not pre-train from scratch — they will fine-tune or post-train an open base (Llama, Mistral, Qwen, DeepSeek, Gemma). The math, the data, and the engineering, however, are the same.

15T+
Pre-training tokens (frontier)
~20×
Chinchilla tokens/param ratio
$10M–$500M
Frontier pre-train spend
<$100
LoRA SFT on a 7B (cloud)

Frontier spend and LoRA cost bands are directional—see §23 and primary sources before you quote them.

What changed since 2023

Scale

Data > params

Compute-optimal trends past Chinchilla: train smaller models on far more tokens for cheaper inference.

Architecture

MoE is default at scale

Sparse Mixture-of-Experts (DeepSeek-V3, Mixtral, GPT-class) gives 4–10× active-param efficiency.

Alignment

RLVR & preference learning

DPO and verifier-based RL (GRPO, RLVR) largely replaced classic PPO-RLHF for most teams.

Context

1M+ tokens routine

YaRN, RoPE scaling, and ring/flash attention make 128K–1M practical without quality collapse.

Reasoning

Test-time compute

o1/R1-style models trained with long chains-of-thought via RL on verifiable rewards.

Open weights

Strong open bases

Llama 3/4, Qwen 2.5/3, DeepSeek-V3, Mistral, Gemma 2/3 — most apps start here.

2. The End-to-End Training Pipeline

Before tearing into any single stage, see the whole flow. Every box below has its own section.

Stage 1 · Data Engineering Crawl & Source CommonCrawl, Code, Books Clean & Dedupe MinHash, fastText filter Mix & Curate Domain weighting Tokenize BPE / Unigram Shard & Stream WebDataset / Mosaic Stage 2 · Pre-training (compute-heavy) Causal LM loss on trillions of tokens · FSDP / 3D parallelism · BF16 + AdamW Compute budget → Chinchilla-optimal N (params) and D (tokens) · Warmup → cosine LR → cooldown Checkpoint every N steps · monitor loss spikes · resume on hardware failure Output: base model (raw next-token predictor, not yet helpful) Stage 3 · Mid-training Domain Continued Pre-training · Long-context Extension · High-quality Annealing Inject math, code, reasoning traces · extend RoPE base · LR cooldown on premium data Stage 4 · Post-training (alignment) SFT Instruction tuning, persona Preference (DPO / IPO) Pairwise chosen/rejected RLHF / RLAIF PPO with reward model RLVR / GRPO Verifiable rewards Stage 5 · Eval, Red-team, Safety filters Stage 6 · Quantize · Distill · Serve (vLLM/TRT-LLM)
Figure 1. The canonical LLM training pipeline from raw data to served model. Stage 2 is where 95% of the FLOPs go; stage 4 is where 95% of the perceived quality is decided.

Scope

I focus on training—data, model, optimizer, loss, infra, alignment, evals. Inference appears only where it constrains training (vocab, context, quantization, distillation). RAG and agents are out of scope here.

3. Data: Sourcing, Cleaning, Mixing

Data is the most under-appreciated lever in LLM training. A 2024 result still holds in 2026: better data beats more parameters, often by a wide margin. Most of the gap between an open base and a frontier closed model is data quality, not compute.

3.1 Sources (and their gotchas)

SourceTypical shareProsWatch out for
Web crawl (CommonCrawl, FineWeb, RedPajama, DCLM)50–70%Massive, diverseToxicity, PII, boilerplate, near-duplicates, machine-generated slop
Code (The Stack v2, GitHub mirror, StackExchange)10–20%Improves reasoning, structureLicense hygiene (GPL/AGPL), exfiltrated secrets
Books & papers (Books3-like, arXiv, PubMed, Project Gutenberg)5–10%Long-form coherenceCopyright exposure, OCR artifacts
Wikipedia & reference2–5%High signal, factualSmall absolute volume — often upsampled
Math & STEM (OpenMathInstruct, ProofPile, FineMath)2–5%Boosts reasoningHard to dedupe vs benchmarks (contamination)
Multilingual (MADLAD-400, CulturaX)10–30%Cross-lingual transferQuality varies wildly by language
Synthetic (model-generated, distilled)5–30%Controllable, fills gapsMode collapse, model-collapse if recursive

3.2 The cleaning pipeline

1. Extract WARC → text trafilatura, resiliparse 2. Language ID fastText / CLD3 drop < threshold 3. Quality filter heuristics + classifier (DCLM) 4. Dedupe MinHash-LSH exact + fuzzy 5. PII / Toxicity regex + NER + classifier redact or drop 6. Decontaminate remove eval-set n-gram overlap Typical funnel: raw → final ~250 TB raw HTML CommonCrawl 1 dump ~30 TB text after extract+lang ~5 TB filtered quality + dedupe ~1.5 TB tokenized ~400B tokens Mix & shard to training cluster
Figure 2. Cleaning funnel for a single CommonCrawl dump. Numbers are order-of-magnitude; FineWeb-Edu, DCLM and RedPajama-v2 publish the actual ratios.

3.3 Decontamination — the silent killer

Decontaminate before you train, not after

Strip n-gram overlap (13-gram is the common cutoff) against every benchmark you plan to report — MMLU, HumanEval, GSM8K, MATH, BBH, IFEval, AGIEval, the lot. Contamination is the most common reason a model "looks SOTA" in the paper and falls over in production. Maintain a frozen contamination index in the data team's repo.

3.4 Data mixing & curriculum

The mixture weight per source is itself a hyperparameter. Two practical recipes:

4. Tokenization & Vocab

The tokenizer freezes before pre-training and is effectively unchangeable. A bad tokenizer leaves performance on the table forever — especially for non-English, code, and math.

4.1 Algorithm choice

AlgorithmUsed byNotes
Byte-Pair Encoding (BPE)GPT-2/3/4, Llama, MistralDefault. Use byte-level BPE (no UNK).
SentencePiece UnigramT5, Gemma, mT5Probabilistic; good for many languages
WordPieceBERTLegacy for new LLMs
Tiktoken (cl100k/o200k)OpenAI modelsHighly tuned BPE; ~100K–200K vocab

4.2 Vocab size trade-offs

Small (32K)

Llama 2 era

Cheaper embed/lm-head, more tokens per word, slower for non-English.

Mid (128K)

Llama 3 / Qwen

Sweet spot for multilingual + code. ~15% fewer tokens per doc vs 32K.

Large (200K+)

GPT-4o / Gemini

Excellent compression, but embedding table eats memory and may bloat small models.

4.3 Critical details devs miss

4.4 Reference snippet — train a BPE tokenizer

from tokenizers import Tokenizer, models, trainers, pre_tokenizers, decoders

tok = Tokenizer(models.BPE(unk_token=None))
tok.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=False)
tok.decoder       = decoders.ByteLevel()

trainer = trainers.BpeTrainer(
    vocab_size=128_000,
    min_frequency=2,
    initial_alphabet=pre_tokenizers.ByteLevel.alphabet(),
    special_tokens=["<|endoftext|>", "<|im_start|>", "<|im_end|>"],
)
tok.train(files=["shard_0.txt", "shard_1.txt", ...], trainer=trainer)
tok.save("tokenizer.json")

5. Architecture: Transformer & Beyond

Almost every modern LLM is a decoder-only transformer with a handful of upgrades. Below is the canonical block, then the modifications that actually matter in 2026.

Tokens + RoPE positions Transformer block × N (e.g. 32 for 7B, 80 for 70B) RMSNorm Grouped-Query Attention Q heads ≫ K/V heads · FlashAttention + Residual RMSNorm SwiGLU MLP (or MoE) d_ff ≈ 3 × d_model + Residual Final RMSNorm → LM head (tied) Loss: next-token cross-entropy (causal mask, shifted labels)
Figure 3. Modern decoder-only transformer block. Llama-3/Qwen-3/Mistral all variations of this skeleton.

5.1 Component-by-component cheatsheet

Component2020 default2026 defaultWhy
Position encodingLearned absoluteRoPE (+ YaRN/NTK scaling)Relative, extrapolates, length-generalizes
NormalizationLayerNorm postRMSNorm pre-normStable training, fewer params
ActivationGELUSwiGLU~1% perplexity gain
AttentionMHAGQA (or MLA in DeepSeek)Cuts KV cache 4–8×
Bias termsEverywhereNone (or QK only)Cleaner, marginally faster
Tied embeddingsSometimesOften tied (small models)Saves vocab×d_model params
MoERareMixture of Experts at scaleSparse compute, top-k routing

5.2 The "shape" math (sanity-check your config)

# Llama-3-style 7B reference shape
n_layers   = 32
d_model    = 4096
n_heads    = 32
n_kv_heads = 8          # GQA: 4 Q heads share one KV head
d_ff       = 14336       # ≈ 3.5 × d_model (SwiGLU uses 2 matrices)
vocab      = 128_000

# Parameter count (rough):
# embeddings  : vocab * d_model                   = 524M
# per layer   : 4 * d_model^2  (attn proj)
#             + 3 * d_model * d_ff  (SwiGLU MLP)  ≈ 218M
# total       : 524M + 32 * 218M  ≈ 7.5B

5.3 Mixture of Experts (MoE) — when to reach for it

MoE replaces the MLP in each block with E experts and a router that activates top-k per token (typically k=1 or 2). Active parameters per token are far below total parameters, so you train a "huge" model at the FLOPs of a small one.

5.4 Non-transformer contenders (know them, don't always use them)

6. Scaling Laws & Compute Budgets

6.1 The Chinchilla rule of thumb

Given a fixed compute budget C (FLOPs), loss is minimized when the number of training tokens D and the number of parameters N are scaled in roughly equal proportion, with D ≈ 20 × N. The familiar 6× factor relates them to FLOPs:

C ≈ 6 · N · D

For inference-heavy products, push D/N far above 20 (Llama-3-8B used ~1900). You pay more to train, but every served token is cheaper.

Params (N) Chinchilla tokens (≈20N) FLOPs H100-days @ 50% MFU Approx cloud $ 1.3B 26B 2.0e20 ~30 H100-days $2K–6K 7B 140B 5.9e21 ~900 H100-days $60K–150K 13B 260B 2.0e22 ~3,100 H100-days $200K–500K 70B 1.4T 5.9e23 ~90,000 H100-days $8M–20M 405B (frontier dense) 15T (over-trained) 3.6e25 ~5.5M H100-days $400M+ Numbers are illustrative; actual costs depend on MFU, hardware mix, network, and storage.
Figure 4. Order-of-magnitude compute budget by model size. MFU (Model FLOPs Utilization) of 35–55% is realistic on a healthy cluster.

6.2 Inference-aware scaling

Llama-3 and DeepSeek both deliberately broke Chinchilla — training 8B and 67B models on 15T+ tokens. The reasoning: lifetime inference dwarfs training, so a slightly under-parameterized but over-trained model wins on total cost of ownership.

Pick D/N by where the dollars live

If you serve more than ~10²² inference FLOPs (large product), over-train. If the model is research-only or short-lived, stick near Chinchilla optimum.

7. Pre-training: Objectives & Loop

7.1 The objective

Canonical pre-training loss is causal language modeling — predict the next token given all previous tokens, averaged over the corpus. Sequences are packed (concatenated with separators) to keep GPUs busy.

# Conceptual training step (PyTorch flavor)
inputs = batch["input_ids"]              # [B, T]
labels = inputs.clone()
labels[:, :-1] = inputs[:, 1:]            # shift
labels[:, -1]   = -100                  # ignore last

logits = model(inputs)                       # [B, T, V]
loss   = F.cross_entropy(
    logits.view(-1, V).float(),
    labels.view(-1),
    ignore_index=-100,
)
loss.backward()

7.2 Document packing & attention masking

7.3 The training loop, end-to-end

Streaming loader shards → batches Forward BF16, activation ckpt Backward FSDP / ZeRO-3 Optimizer step AdamW (FP32 master) LR + grad clip warmup→cosine→cool Log loss, gradnorm Every ~1K steps eval perplexity, downstream Every ~5K steps checkpoint to object store On loss spike rewind & skip batch On node fail resume from ckpt LR schedule: warmup cosine decay cooldown / anneal
Figure 5. Pre-training loop. Loss spikes are normal; well-engineered runs rewind by ~100 steps and skip the offending batch.

7.4 Batch size: tokens, not examples

Think in tokens per step, not sequences. Frontier runs use 4M–16M tokens per global batch. Why so large? Adam's noise scale and tensor parallel efficiency both prefer big batches; small batches stall training and destabilize loss.

8. Distributed Training (DP, TP, PP, FSDP)

A 7B model in BF16 is ~14 GB of weights, but training needs ~6× that (weights + grads + Adam moments + activations). For anything above 1B, you split the model across devices.

8.1 The four kinds of parallelism

Data Parallel (DDP)

Replicate, split batch

Each GPU has a full model copy and processes a slice of the batch. Simplest. Caps at ~10B before memory wall.

Tensor Parallel (TP)

Split each matmul

Megatron-style: split weight matrices across GPUs intra-layer. Heavy all-reduce traffic — only across NVLink/NVSwitch in one node.

Pipeline Parallel (PP)

Split layer stack

Different GPUs hold different layers. Use 1F1B / interleaved schedules to minimize bubbles.

Sequence / Context Parallel

Split the sequence

Ring attention spreads long-context attention across GPUs. Essential for 128K+ training.

8.2 ZeRO / FSDP — sharded data parallel

PyTorch FSDP (and DeepSpeed ZeRO) shards optimizer state, gradients, and parameters across data-parallel ranks. It is the default for < 70B models because it's almost as simple as DDP but scales to 100B+.

StageShardsMemory savingsComms cost
ZeRO-1 / FSDP SHARD_GRAD_OPOptimizer state~4×Low
ZeRO-2+ Gradients~8×Medium
ZeRO-3 / FSDP FULL_SHARD+ Parameters~N× (N = world size)High (all-gather per layer)

8.3 3D parallelism — the frontier recipe

3D Parallelism on a 64-GPU pod (DP=4, PP=4, TP=4) Pipeline stage 1 (layers 1–8) 4 TP × 4 DP = 16 GPUs Stage 2 (layers 9–16) 16 GPUs Stage 3 (layers 17–24) 16 GPUs Stage 4 (layers 25–32) 16 GPUs TP → across NVLink (intra-node) PP → across InfiniBand (inter-node) DP → across replicas (everything)
Figure 6. 3D parallelism on a 64-GPU pod. Real frontier runs nest this further with sequence parallel and expert parallel.

8.4 Picking a strategy by model size

ModelRecommended
<1BPlain DDP
1B – 13BFSDP (ZeRO-3) with selective activation checkpointing
13B – 70BFSDP + tensor parallel within a node (TP=2/4/8)
70B – 400B dense3D parallelism (TP + PP + DP) + sequence parallel for long context
MoE at any scaleAdd expert parallel; tune all-to-all communication

8.5 Reference: launching FSDP on 8 GPUs

# torchrun handles process group setup
torchrun --standalone --nproc_per_node=8 train.py \
    --model_size 7B \
    --fsdp full_shard \
    --activation_checkpointing \
    --bf16 \
    --seq_len 8192 \
    --micro_batch 2 \
    --grad_accum 8

9. Hyperparameters & Optimizer

9.1 Sensible defaults that almost always work

KnobDefaultNotes
OptimizerAdamW (β₁=0.9, β₂=0.95, ε=1e-8)β₂=0.95 (not 0.999) is the LLM-specific tweak
Weight decay0.1Apply to weights only, never to norms or biases
Peak LR (dense, <10B)3e-4Scale roughly as 1/√width as model grows
Peak LR (70B+)1.5e-4 → 6e-5Bigger models tolerate less LR
Warmup~2000 steps or 1% of totalCritical for stability
ScheduleCosine → 10% of peak, then constant cooldownWSD schedule (warmup-stable-decay) increasingly popular
Grad clip1.0 (global L2)Drop offending batch if spikes persist
InitNormal(0, 0.02) scaled by 1/√(2·N_layers) on residual projectionsMu-Parametrization (μP) lets you tune at small scale and transfer
Dropout0.0 in pre-trainingUse only in fine-tune if overfitting

9.2 The optimizer of 2026

AdamW remains the default, but watch:

9.3 μP — tune small, train big

Maximal Update Parametrization re-scales initialization, LR, and projection so that the same hyperparameters transfer from a tiny proxy model up to the full-scale run. It collapses the hyperparameter search from "burn 1% of your training budget" to "burn 0.01%".

10. Stability, Mixed Precision, Memory

10.1 Precision recipes

FormatUse forNotes
BF16All forward / backward activations & weightsWider exponent than FP16; no loss-scaling needed
FP32Optimizer master copy of weights, loss reductionAvoids drift over millions of steps
FP8 (E4M3 / E5M2)H100/H200/B200 matmuls, with per-tensor scaling~2× throughput; needs careful scaling, used by DeepSeek-V3
INT8 / INT4Post-training quantization for inferenceNot for training

10.2 The memory budget per GPU (BF16 training)

# Per-parameter cost
weight (BF16)            : 2 bytes
grad   (BF16)            : 2 bytes
adam   (FP32 m, v, mast) : 12 bytes
                          ────
                           16 bytes / param

# 7B model on one GPU:  7e9 * 16 = 112 GB  (won't fit on H100)
# With FSDP-FullShard over 8 GPUs: 14 GB / GPU for state
# Activations dominate the rest — use activation checkpointing

10.3 Loss spike playbook

When the loss explodes at step 47,328

  1. Check grad norm — if it spiked, ratio of last few norms tells you how bad.
  2. Rewind to the last clean checkpoint (~100 steps back).
  3. Skip the data shard / batch that caused it.
  4. If recurring: reduce peak LR by 20% and resume.
  5. If still happening: switch one layer of attention to FP32 softmax, or check for bad init in a specific layer.

10.4 Activation checkpointing

Recompute activations during backward instead of storing them. Costs ~30% extra compute, saves ~10× activation memory. Use selective checkpointing (only the attention and MLP outputs) to balance speed and memory.

10.5 Other essentials

11. Mid-training & Continued Pre-training

Between base pre-training and post-training there is now a distinct phase — variously called annealing, mid-training, or continued pre-training — where the model is exposed to higher-quality data with a lower learning rate.

11.1 What goes in

11.2 Long-context extension

To extend RoPE-based attention from 8K to 128K/1M:

  1. Increase RoPE base (the θ parameter) — e.g. 10,000 → 500,000.
  2. Apply YaRN or NTK-aware scaling at the same time for better extrapolation.
  3. Continue training on long documents for ~1–10B tokens.
  4. Evaluate on needle-in-a-haystack, RULER, LongBench v2 to confirm.

11.3 Continued pre-training on your domain (CPT)

If you have a corpus that is mostly out-of-distribution for the base model (legal, biomedical, niche programming language), CPT can lift downstream performance more cheaply than re-pretraining. Rules:

12. Post-training: SFT, DPO, RLHF, RLVR

12.1 The pipeline

Base model next-token predictor no chat ability yet Supervised Fine-Tuning instruction · response pairs teaches format & persona Preference Optimization DPO / IPO / KTO chosen vs rejected pairs RL: RLHF / RLVR PPO with RM, or GRPO with verifier Order of magnitude (typical for a serious post-train) SFT: 50K–1M examples · 1–3 epochs · LR 1e-5 to 5e-5 · cost: hours on a node DPO: 10K–200K pairs · 1 epoch · β = 0.1–0.5 · cost: hours RLHF / GRPO: 50K–500K prompts · many rollouts each · cost: days to weeks
Figure 7. Post-training pipeline. Many teams stop at SFT + DPO; RL adds another quality jump but is expensive and finicky.

12.2 SFT (Supervised Fine-Tuning) — turn a base into a chat model

SFT teaches format and basic instruction-following. Quality of examples >> quantity. 5K LIMA-style hand-curated examples can beat 500K low-quality ones.

# Conversation template (ChatML-flavored)
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
Explain RoPE in one paragraph.<|im_end|>
<|im_start|>assistant
Rotary Position Embedding rotates the query and key vectors ...<|im_end|>

# Loss masking: compute loss ONLY on assistant tokens

12.3 DPO — preference learning without a reward model

Direct Preference Optimization re-parametrizes the RLHF objective so you can train directly on (chosen, rejected) pairs against a frozen reference model. No reward model, no PPO, no rollouts. The closed-form loss:

L_DPO = -log σ( β · [ log π(y_w|x)/π_ref(y_w|x) − log π(y_l|x)/π_ref(y_l|x) ] )

Variants:

12.4 RLHF — PPO with a learned reward model

Classical recipe (InstructGPT, GPT-3.5): train a reward model on preference data, then optimize the policy with PPO against the RM, with a KL penalty to a reference policy.

12.5 RLAIF — feed your own model the role of the labeler

Constitutional AI / RLAIF uses a (typically larger) model to generate or rank responses. Useful when human labelers are slow or expensive; quality ceiling = quality of the judge.

13. Reasoning & Long-Horizon Training

The 2024–2026 leap in reasoning (OpenAI o1, DeepSeek-R1, Qwen-QwQ) came from RL on verifiable rewards (RLVR). The model generates long chains of thought, a verifier checks the final answer, and policy gradient pushes the model toward CoTs that lead to correct answers.

13.1 GRPO — Group Relative Policy Optimization

GRPO (introduced in DeepSeekMath, used in R1) samples G rollouts per prompt, computes a group-baseline advantage, and updates with a PPO-style clipped objective — but without a critic, which halves memory.

# Pseudocode
for prompt in batch:
    rollouts = [policy.generate(prompt) for _ in range(G)]
    rewards  = [verifier(r) for r in rollouts]
    adv      = (rewards - mean(rewards)) / (std(rewards) + 1e-8)
    loss     = -mean( min(ratio * adv, clip(ratio, 1-ε, 1+ε) * adv) )
                + β * KL(policy || ref)

13.2 Where verifiable rewards come from

DomainVerifier
MathSymPy / numeric equivalence to the gold answer
CodeUnit tests pass / fail
Format / JSONSchema validator
Multi-step tool useEnd-state in a simulator
Translation / summarizationBLEU / BERTScore / LLM-judge (less crisp)

13.3 Practical lessons from R1-style training

13.4 Test-time compute as a separate axis

Trained reasoners exchange compute at inference for accuracy: search, self-consistency, MCTS over thought paths, best-of-N with a verifier. Treat inference latency as a tunable, not a fixed budget.

14. Fine-tuning Adapters (LoRA, QLoRA)

Most teams will never touch full-parameter fine-tuning. LoRA trains a low-rank delta (A·B, rank 8–64) injected into chosen weight matrices, freezing the base. QLoRA quantizes the base to 4-bit NF4 first, letting you fine-tune a 70B model on a single 80 GB GPU.

W (frozen) d × d e.g. 4096 × 4096 = 16M params + B d × r · A r × d → trainable: 2 · d · r e.g. r=16 → 131K params (0.8%) Base weight stays frozen and shared across many adapters
Figure 8. LoRA: low-rank decomposition of the weight update. Inference-time, you can merge A·B back into W for zero overhead.

14.1 LoRA recipe that just works

14.2 QLoRA in practice

from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

bnb = BitsAndBytesConfig(load_in_4bit=True,
                         bnb_4bit_quant_type="nf4",
                         bnb_4bit_compute_dtype=torch.bfloat16)

base = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3-70b",
                                            quantization_config=bnb)

cfg = LoraConfig(r=16, lora_alpha=32, lora_dropout=0.05,
                 target_modules="all-linear", task_type="CAUSAL_LM")
model = get_peft_model(base, cfg)     # ~0.5% params trainable

14.3 When to NOT use LoRA

15. Evaluation & Benchmarks

15.1 The eval pyramid

During training

Loss + cheap probes

Validation perplexity (held-out shard), a handful of fast capability probes (HellaSwag, ARC) every 1K steps.

After training

Standard benchmarks

MMLU-Pro, GPQA, HumanEval+, MBPP+, GSM8K, MATH, BBH, IFEval, MT-Bench, AlpacaEval 2.

In production

Task & vibes

Domain-specific eval set, LLM-as-judge, human ratings, online A/B.

15.2 The 2026 benchmark shortlist

What it measuresUse
General knowledgeMMLU-Pro (10-option, harder than MMLU)
Reasoning (graduate science)GPQA Diamond
MathMATH-500, AIME, OlympiadBench
CodeHumanEval+, MBPP+, LiveCodeBench (rolling)
Long contextRULER, NIAH, LongBench v2
Instruction followingIFEval, MT-Bench, Arena-Hard
Agentic / tool useτ-Bench, SWE-Bench Verified, GAIA
MultilingualMGSM, MMLU-translate, FLORES
SafetyHarmBench, ToxicChat, XSTest

15.3 LLM-as-judge — useful but biased

Known biases

  • Position bias — first response wins more often. Randomize order, or compute both orderings.
  • Verbosity bias — longer answers win even when wrong.
  • Self-preference — models prefer their own style. Always use a different judge family than the model under test.
  • Format bias — bulleted, formatted answers are preferred over equally correct prose.

15.4 Eval rigor for training

16. Safety, Alignment, Red-Teaming

16.1 Layered defense

1 · Pre-train PII redact Toxic filter Data licensing CSAM removal 2 · Post-train Refusal training Constitutional AI Harmfulness DPO Jailbreak hardening 3 · Inference Input classifier Output classifier System prompt Rate limits 4 · Application Tool sandboxing Permission scopes Human review Audit log 5 · Ops Red-team Bug bounty Incident Disclosure
Figure 9. Safety is multi-layer; no single layer is sufficient.

16.2 Constitutional AI in one sentence

Use the model itself (or another model) to critique its own responses against a written list of principles, then fine-tune on the revisions. Effective for refusal training without large human-feedback budgets.

16.3 Red-teaming

16.4 Regulation watchlist (2026)

17. Infrastructure & Hardware

17.1 Hardware lineup (2026)

AcceleratorHBMBF16 TFLOPSNotes
NVIDIA H100 SXM80 GB~990Workhorse; still widely available
NVIDIA H200141 GB~990Same compute, more HBM
NVIDIA B200 (Blackwell)192 GB~2,250FP8/FP4 native; 2026 default
NVIDIA GB200 NVL7213.5 TB pooled~80,000 (rack)72 Blackwell + Grace CPUs, rack-scale
AMD MI300X / MI325X192 / 256 GB~1,300 / ~1,300Strong for inference; software gap narrowing
Google TPU v5p / v6e95 / 32 GB~459 / ~918For JAX/XLA stacks
AWS Trainium 296 GB~650NeuronSDK; cost competitive on AWS

17.2 Networking is the bottleneck

Compute is rarely what stalls a training run — the interconnect is. Required topology:

17.3 Storage tiers

TierPurposeTools
Object storeRaw & tokenized data, checkpointsS3 / GCS / R2 / MinIO
Parallel FSHot training data, fast restoreLustre, WekaFS, JuiceFS, GPFS
Local NVMePer-node cache, scratchNVMe RAID0
StreamingAvoids landing full corpusWebDataset, MosaicML StreamingDataset

17.4 Cluster engineering reality

What kills frontier runs

  • Silent data corruption on HBM — periodic ECC scrubs, end-to-end checksums on activations.
  • Stragglers — one slow GPU drags the whole synchronous step. Detect with per-rank step timing.
  • Network flaps — NCCL hangs eat hours. Use timeouts and watchdog kill+restart.
  • Hot spots in the cluster cause throttling — monitor with DCGM.

18. Inference Optimization & Serving

Inference economics push back into training decisions — vocab size, KV cache shape, model width all matter once you ship.

18.1 The serving stack

EngineBest for
vLLMGeneral OSS, PagedAttention, continuous batching
SGLangStructured generation, fast prefix caching
TensorRT-LLMPeak throughput on NVIDIA; FP8/INT4 kernels
llama.cppCPU/edge, GGUF quantized models
TGI (HF)HF ecosystem integration

18.2 The optimizations that matter

18.3 Distillation — bake a big model into a small one

Generate (prompt, response, top-k log-probs) tuples from the big teacher; train a small student to match the distribution. Used everywhere now (Gemma-2 from Gemini, Llama-3 8B from Llama-3 405B).

19. Cost & Economics

19.1 What things actually cost (2026 cloud, list prices)

ActivityCostNotes
1B-param scratch pre-train, 20B tokens$2K–$8K32× H100 for ~1 day
7B scratch pre-train, 1T tokens$200K–$500K256× H100 for ~2 weeks
70B scratch pre-train, 2T tokens$5M–$15M1024× H100 for ~1 month
Frontier pre-train (400B dense / MoE 600B)$100M–$500M+10K+ H100 for months
SFT on 7B (100K examples, 3 epochs)$50–$3008× H100 for hours
QLoRA on 70B (50K examples)$50–$2001× H100 for hours
DPO on 7B (100K pairs)$100–$5008× H100 for hours
RLVR on 7B (long CoT, GRPO)$5K–$50KDays to weeks of rollouts

19.2 Cost levers (in order of impact)

  1. Don't pre-train from scratch. Almost always start from an open base.
  2. Pick the smallest model that meets the bar after distillation/quantization.
  3. Spot/preemptible instances for fine-tuning and most evals — 60–80% off.
  4. Reserved capacity for sustained training — 30–50% off.
  5. Right-size context — 8K vs 128K can be 3× the GPU memory.
  6. Cache aggressively at inference (prefix, response).
  7. Cross-cloud arbitrage — H100 pricing varies 40% between hyperscalers and neoclouds.

Order of operations

Always: cheap base → SFT → DPO → eval. Add RL only if eval bar isn't met and the cost is justified. Most teams discover SFT + good data gets them 80% there.

20. Strategy & 90-Day Roadmap

20.1 First decision: build, post-train, or buy

Do you need a custom LLM? Use API + RAG 90% of products start here LoRA / QLoRA on open base domain style, 1–10K examples Full SFT + DPO on open base substantial domain shift Pre-train only with $5M+ no, just better prompting style / format / tone behavior change novel modality / data Heuristic If a frontier API + good prompting + retrieval can't reach the quality bar after 2 weeks of iteration → consider fine-tuning. If fine-tuning a strong open base + good SFT data can't reach it → consider RLVR or scaling the base. If all of the above can't reach it AND you have unique data + budget → consider continued pre-training. Pre-training from scratch is rare.
Figure 10. Build/post-train/buy decision. The default is "don't train" — earn the right to train by exhausting cheaper options first.

20.2 The 90-day plan (for a team starting from open-base fine-tuning)

Days 1–30

Foundations

  • Pick base model (capability vs license vs cost).
  • Stand up eval harness (lm-eval-harness, internal eval).
  • Build data pipeline (ingest, dedupe, filter, decontaminate).
  • Get a baseline: zero-shot & few-shot on internal tasks.
  • QLoRA on 1K hand-curated examples — sanity check.
Days 31–60

Iterate

  • Scale SFT to 10K–100K curated examples.
  • Add preference data (DPO/IPO).
  • Build safety eval; add refusal & jailbreak fine-tune.
  • Distill or quantize for inference cost.
  • Set up online logging & thumbs-up/down for next round.
Days 61–90

Ship & harden

  • Canary the model behind a feature flag.
  • Add input/output safety classifiers.
  • Track regression on standard benchmarks weekly.
  • Establish a re-training cadence and a rollback path.
  • Document model card & data card.

20.3 Org & team shape

RoleWhy
Data engineer (×2)The training run is only as good as the corpus pipeline.
ML researcherOwns recipe, hyperparameters, ablations.
Distributed-systems engineerCluster, NCCL, checkpointing, recovery.
Eval & alignment leadOwns benchmarks, red-team, judge models.
Inference engineervLLM/TRT-LLM, quantization, latency budgets.
Product ownerDefines the bar; vetoes premature training.

21. Pitfalls & Anti-Patterns

Data

Training on the eval set

The classic. Always n-gram decontaminate before reporting numbers.

Data

"More data is always better"

Low-quality web slop can degrade the model. Cleaning > collecting.

Architecture

Changing tokenizer late

Tokenizer is frozen at pre-train. New special tokens require embedding resize and warm-up.

Training

Tuning on validation perplexity alone

Perplexity ≠ quality. Run capability evals too.

Training

No checkpoint hygiene

Lose one node, lose two weeks. Async checkpoint, multi-cloud copy.

Post-train

Catastrophic forgetting

SFT can erase pre-trained capabilities. Mix general data; cap LR; LoRA helps.

Post-train

Reward hacking in RL

Model finds a way to satisfy the reward without doing the task. Add KL penalty, watch behavior, not just reward.

Evals

One-seed comparisons

~1 point on small benchmarks is noise. Report mean ± std over 3+ seeds.

Evals

Trusting LLM-judge wholesale

Triangulate with human review and rule-based checks.

Infra

Single point of failure

Network switches, head nodes, shared FS — all need redundancy.

Cost

Over-training before product fit

Burn capital on training a model nobody uses. Validate prompt + RAG first.

Safety

Relying on the model alone for safety

Always add an input/output classifier and audit log.

22. Tooling Stack & Resources

22.1 Open-source training frameworks

ToolFor
Hugging Face Transformers + Accelerate + TRLMost fine-tuning, DPO, PPO; the lingua franca
PyTorch FSDP / torchtitanReference FSDP for serious pre-training
Megatron-LM / Megatron-Core3D parallelism, frontier pre-training
NVIDIA NeMoEnd-to-end on NVIDIA stack
DeepSpeedZeRO, MoE, large-scale training
Lit-GPT / Levanter / MaxTextCleaner JAX/PyTorch reference implementations
Axolotl / Unsloth / LLaMA-FactoryFast LoRA/QLoRA fine-tuning recipes
OpenRLHF / TRL / verlPPO, GRPO, RLVR pipelines

22.2 Data tools

22.3 Eval & observability

22.4 Reference open base models (May 2026)

FamilySizesLicense
Llama 3 / 41B, 3B, 8B, 70B, 405B, Maverick-MoELlama Community License
Qwen 30.5B–235B (dense + MoE)Apache 2.0
DeepSeek V3 / R167B dense, 671B MoEMIT-style
Mistral / Mixtral7B, 8×7B, 8×22B, LargeMixed (Apache for some)
Gemma 2 / 32B, 9B, 27BGemma terms
Phi-414B (synthetic-heavy)MIT

22.5 Reading list

Short list for onboarding; the full annotated bibliography is in §23 References & sources.

23. References & sources

Annotated bibliography behind every major claim in this note—scaling laws, architecture choices, data pipelines, distributed training, alignment methods, benchmarks, inference stack, and cost bands. Section tags (e.g. §6) point to where the idea is used. Prefer primary papers and official model cards for canonical numbers; leaderboard scores and license terms change weekly.

Scope. Synthesis of peer-reviewed work, technical reports, and widely used open tooling (May 2026). Diagrams and workflow ordering are original unless noted. Dollar ranges and KPI strips are planning heuristics, not industry averages. Not legal, safety-certification, or compliance advice—confirm regulatory obligations with qualified counsel.

How to read: citations are numbered continuously [1]–[n] within this section. Cross-check arXiv versions against conference proceedings when you cite formally.

Landscape, scaling & open base models (§1, §6, §22)

  1. Kaplan et al., “Scaling Laws for Neural Language Models.” NeurIPS 2020. Foundational compute–loss–size relationships. arxiv.org/abs/2001.08361 — §6.
  2. Hoffmann et al., “Training Compute-Optimal Large Language Models” (Chinchilla). NeurIPS 2022. ~20× tokens/param rule of thumb; optimal N and D under fixed compute. arxiv.org/abs/2203.15556 — §1, §6–7.
  3. Yang et al., “Tensor Programs V: Tuning Large Neural Networks via Small” (μP). 2022. Width/depth transfer for stable large-model training. arxiv.org/abs/2203.03466 — §9.
  4. Brown et al., “Language Models are Few-Shot Learners” (GPT-3). NeurIPS 2020. Scale and in-context learning baseline. arxiv.org/abs/2005.14165 — §1.
  5. Touvron et al., “LLaMA: Open and Efficient Foundation Language Models.” 2023. Open-weights pre-training recipe. arxiv.org/abs/2302.13971 — §1, §22.
  6. Touvron et al., “Llama 2: Open Foundation and Fine-Tuned Chat Models.” 2023. arxiv.org/abs/2307.09288 — §1.
  7. Meta AI, “The Llama 3 Herd of Models.” 2024. Annealing, data mix, long-context and post-training details. arxiv.org/abs/2407.21783 — §1, §3, §11, §22.
  8. DeepSeek-AI, “DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model.” 2024. MLA attention; MoE routing. arxiv.org/abs/2405.04434 — §5.
  9. DeepSeek-AI, “DeepSeek-V3 Technical Report.” 2024. Large MoE pre-train; FP8 training notes. arxiv.org/abs/2412.19437 — §1, §5, §10, §22.
  10. DeepSeek-AI, “DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.” 2025. RLVR / long-CoT reasoning. arxiv.org/abs/2501.12948 — §1, §13.
  11. Jiang et al., “Mistral 7B.” 2023. Sliding-window attention; efficient dense baseline. arxiv.org/abs/2310.06825 — §1, §22.
  12. Jiang et al., “Mixtral of Experts.” 2024. Sparse MoE at 8×7B scale. arxiv.org/abs/2401.04088 — §1, §5, §22.
  13. Qwen Team, “Qwen2 Technical Report.” 2024. Multilingual + code scaling. arxiv.org/abs/2407.10671 — §1, §4, §22.
  14. Team Gemma et al., “Gemma 2: Improving Open Language Models at a Practical Size.” 2024. arxiv.org/abs/2408.00118 — §4, §22.
  15. Abdin et al., “Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone.” 2024. Synthetic-heavy small-model training. arxiv.org/abs/2404.14219 — §3, §22.
  16. Epoch AI, “Compute Trends Across AI.” Living database of training FLOPs and cluster scale. epoch.ai/data — §1 KPIs, §19.

Architecture, attention & tokenization (§4–5)

  1. Vaswani et al., “Attention Is All You Need.” NeurIPS 2017. Transformer baseline. arxiv.org/abs/1706.03762 — §5.
  2. Su et al., “RoFormer: Enhanced Transformer with Rotary Position Embedding (RoPE).” 2021. arxiv.org/abs/2104.09864 — §5, §11.
  3. Zhang & Sennrich, “Root Mean Square Layer Normalization.” 2019. RMSNorm used in Llama-class stacks. arxiv.org/abs/1910.07467 — §5.
  4. Shazeer, “GLU Variants Improve Transformer.” 2020. SwiGLU activation family. arxiv.org/abs/2002.05202 — §5.
  5. Shazeer, “Fast Transformer Decoding: One Write-Head is All You Need” (MQA). 2019. arxiv.org/abs/1911.02150 — §5.
  6. Ainslie et al., “GQA: Training Generalized Multi-Query Transformer Models.” 2023. arxiv.org/abs/2305.13245 — §5.
  7. Dao et al., “FlashAttention: Fast and Memory-Efficient Exact Attention.” NeurIPS 2022; FlashAttention-2/3 follow-ons on same line of work. arxiv.org/abs/2205.14135 — §5, §8, §10.
  8. Fedus et al., “Switch Transformers: Scaling to Trillion Parameter Models.” JMLR 2022. MoE routing at scale. arxiv.org/abs/2101.03961 — §1, §5.
  9. Lepikhin et al., “GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding.” 2020. Expert parallelism patterns. arxiv.org/abs/2006.16668 — §5, §8.
  10. Sennrich et al., “Neural Machine Translation of Rare Words with Subword Units” (BPE). 2016. Byte-level BPE lineage. arxiv.org/abs/1508.07909 — §4.
  11. Kudo & Richardson, “SentencePiece: A simple and language independent approach to subword tokenization.” 2018. Unigram/BPE tooling. arxiv.org/abs/1808.06258 — §4.
  12. Peng et al., “YaRN: Efficient Context Window Extension of Large Language Models.” 2023. Long-context RoPE scaling. arxiv.org/abs/2309.00071 — §1, §11.
  13. Liu et al., “Ring Attention with Blockwise Transformers for Near-Infinite Context.” 2023. Sequence/context parallel for 128K+ training. arxiv.org/abs/2310.01889 — §8.

Data sourcing, cleaning & mixture design (§3)

  1. Penedo et al., “The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale.” 2024. Open pre-training corpus; cleaning ratios cited in §3 funnel. arxiv.org/abs/2406.17557 — §3, §22.
  2. Li et al., “DataComp-LM: In search of the next generation of training sets for language models” (DCLM). 2024. arxiv.org/abs/2406.11794 — §3, §22.
  3. Soldaini et al., “Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research.” 2024. arxiv.org/abs/2403.07606 — §22.
  4. Together AI, “RedPajama: An Open Source Recipe to Reproduce LLaMA Training Dataset.” 2023. github.com/togethercomputer/RedPajama-Data — §3, §22.
  5. Penedo et al., “The RefinedWeb Dataset for Falcon LLM.” 2023. Large-scale web filter pipeline. arxiv.org/abs/2306.01116 — §3.
  6. Xie et al., “DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining.” NeurIPS 2023. Learned domain weights. arxiv.org/abs/2305.10429 — §3.
  7. Lee et al., “Deduplicating Training Data Makes Language Models Better.” ACL 2022. Near-dup removal motivation. arxiv.org/abs/2107.06423 — §3.
  8. Brown et al., “Don’t Count on Luck: Mitigating Benchmark Contamination in LLM Evaluation.” 2024. Contamination index mindset in §3. arxiv.org/abs/2406.11252 — §3, §15.
  9. Shumailov et al., “The Curse of Recursion: Training on Generated Data Makes Models Forget.” 2024. Synthetic-data / model-collapse risks in §3. arxiv.org/abs/2305.17493 — §3.
  10. Common Crawl Foundation. Web crawl snapshots (raw input to most open corpora). commoncrawl.org — §3.

Pre-training loop, optimization & distributed systems (§7–10)

  1. Loshchilov & Hutter, “Decoupled Weight Decay Regularization” (AdamW). ICLR 2019. Default LLM optimizer. arxiv.org/abs/1711.05101 — §7, §9.
  2. Chen et al., “Training Deep Nets with Sublinear Memory Cost” (activation checkpointing). 2016. arxiv.org/abs/1604.06983 — §8, §10.
  3. Rajbhandari et al., “ZeRO: Memory Optimizations Toward Training Trillion Parameter Models.” SC 2020. ZeRO-1/2/3 sharding. arxiv.org/abs/1910.02054 — §8.
  4. Narayanan et al., “Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM.” 2021. Tensor + pipeline parallel. arxiv.org/abs/2104.04473 — §8, §22.
  5. Shoeybi et al., “Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism.” 2019. arxiv.org/abs/1909.08053 — §8.
  6. Zhao et al., “PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel.” 2023. arxiv.org/abs/2304.11277 — §8, §22.
  7. Microsoft DeepSpeed Team. ZeRO, MoE, and large-cluster training docs. deepspeed.ai — §8, §22.
  8. PyTorch Team, “torchtitan” reference pre-training stack. github.com/pytorch/torchtitan — §22.
  9. Micikevicius et al., “FP8 Formats for Deep Learning.” 2022. FP8 training background in §10. arxiv.org/abs/2209.05433 — §10.
  10. NVIDIA Corporation. H100/B200 datasheets and NeMo/Megatron-Core documentation for cluster sizing in §17. nvidia.com/data-center — §17, §19.

Mid-training, post-training & alignment (§11–13)

  1. Ouyang et al., “Training language models to follow instructions with human feedback” (InstructGPT / RLHF). NeurIPS 2022. arxiv.org/abs/2203.02155 — §12.
  2. Schulman et al., “Proximal Policy Optimization Algorithms.” 2017. PPO baseline for RLHF. arxiv.org/abs/1707.06347 — §12.
  3. Rafailov et al., “Direct Preference Optimization: Your Language Model is Secretly a Reward Model.” NeurIPS 2023. arxiv.org/abs/2305.18290 — §1, §12.
  4. Azar et al., “A General Theoretical Paradigm to Understand Learning from Human Preferences” (IPO). 2024. DPO overfitting fix. arxiv.org/abs/2310.12036 — §12.
  5. Ethayarajh et al., “KTO: Model Alignment as Prospect Theoretic Optimization.” 2024. Binary feedback without pairs. arxiv.org/abs/2402.01306 — §12.
  6. Bai et al., “Constitutional AI: Harmlessness from AI Feedback.” 2022. RLAIF lineage. arxiv.org/abs/2212.08073 — §12, §16.
  7. Zhou et al., “LIMA: Less Is More for Alignment.” 2023. High-quality SFT sample efficiency. arxiv.org/abs/2305.11206 — §12.
  8. Shao et al., “DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models” (GRPO). 2024. Group-relative RL without critic. arxiv.org/abs/2402.03300 — §1, §13.
  9. OpenAI, “Learning to Reason with LLMs” (o1 system card). 2024. Test-time compute / long CoT product framing. openai.com — §13.
  10. McKinney et al., “Avoiding Catastrophic Forgetting in Continual Learning of New Languages.” 2022. Mixing general data during CPT (§11). arxiv.org/abs/2205.09357 — §11, §21.

Parameter-efficient fine-tuning (§14)

  1. Hu et al., “LoRA: Low-Rank Adaptation of Large Language Models.” ICLR 2022. arxiv.org/abs/2106.09685 — §14.
  2. Dettmers et al., “QLoRA: Efficient Finetuning of Quantized LLMs.” NeurIPS 2023. NF4 + LoRA recipe in §14 code block. arxiv.org/abs/2305.14314 — §1 KPI, §14.
  3. Yadav et al., “TIES-Merging: Resolving Interference When Merging Models.” 2023. Adapter merge context in §14. arxiv.org/abs/2306.01708 — §14.

Evaluation, benchmarks & LLM-as-judge (§15)

  1. Hendrycks et al., “Measuring Massive Multitask Language Understanding” (MMLU). ICLR 2021. arxiv.org/abs/2009.03300 — §15.
  2. Wang et al., “MMLU-Pro: A More Robust and Challenging Multi-Task Language Model Benchmark.” 2024. arxiv.org/abs/2406.01574 — §15.
  3. Rein et al., “GPQA: A Graduate-Level Google-Proof Q&A Benchmark.” 2023. arxiv.org/abs/2311.12022 — §15.
  4. Chen et al., “Evaluating Large Language Models Trained on Code” (HumanEval). 2021. arxiv.org/abs/2107.03374 — §15.
  5. Cobbe et al., “Training Verifiers to Solve Math Word Problems” (GSM8K). 2021. arxiv.org/abs/2110.14168 — §15.
  6. Hendrycks et al., “Measuring Mathematical Problem Solving With the MATH Dataset.” NeurIPS 2021. arxiv.org/abs/2103.03874 — §15.
  7. Suzgun et al., “Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them” (BBH). 2022. arxiv.org/abs/2210.09261 — §15.
  8. Zhou et al., “Instruction-Following Evaluation for Large Language Models” (IFEval). 2023. arxiv.org/abs/2311.07911 — §15.
  9. Liang et al., “Holistic Evaluation of Language Models (HELM).” 2022. arxiv.org/abs/2211.09110 — §22.
  10. Zheng et al., “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.” 2023. Position/verbosity bias notes in §15.3. arxiv.org/abs/2306.05685 — §15.
  11. Gao et al., “A framework for few-shot language model evaluation” (lm-evaluation-harness). EleutherAI. github.com/EleutherAI/lm-evaluation-harness — §15, §22.
  12. OpenCompass Contributors. Open benchmark runner used alongside harness. github.com/open-compass/opencompass — §22.
  13. Hsieh et al., “RULER: What’s the Real Context Size of Your Long-Context Language Models?” 2024. arxiv.org/abs/2404.06654 — §15.

Safety, red-teaming & governance (§16)

  1. Mazeika et al., “HarmBench: A Standardized Evaluation Framework for Automated Red Teaming.” 2024. arxiv.org/abs/2402.04249 — §15–16.
  2. Anthropic, “Responsible Scaling Policy” and model cards. Frontier safety governance framing. anthropic.com/news — §16.
  3. UK AI Safety Institute, “Inspect” evaluation framework. Agent and capability evals in §22. github.com/UKGovernmentBEIS/inspect_ai — §22.
  4. NIST, AI Risk Management Framework (AI RMF 1.0). 2023. nist.gov/ai-rmf — §16.
  5. European Union, Regulation (EU) 2024/1689 (Artificial Intelligence Act). EUR-Lex — §16.

Inference, quantization & distillation (§18)

  1. Kwon et al., “Efficient Memory Management for Large Language Model Serving with PagedAttention” (vLLM). SOSP 2023. arxiv.org/abs/2309.06180 — §18, §22.
  2. Leviathan et al., “Fast Inference from Transformers via Speculative Decoding.” ICML 2023. arxiv.org/abs/2211.17192 — §18.
  3. Frantar et al., “GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers.” 2023. arxiv.org/abs/2210.17323 — §18.
  4. Lin et al., “AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration.” 2023. arxiv.org/abs/2306.00978 — §18.
  5. Hinton et al., “Distilling the Knowledge in a Neural Network.” NeurIPS 2015 Deep Learning Workshop. Teacher–student framing in §18. arxiv.org/abs/1503.02531 — §18.
  6. Sanh et al., “DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter.” 2019. Classic distillation baseline. arxiv.org/abs/1910.01108 — §18.
  7. ggml / llama.cpp project. CPU/edge GGUF inference in §18, §22. github.com/ggerganov/llama.cpp — §22.

Tooling, frameworks & datasets (§22)

  1. Hugging Face, Transformers, Accelerate, TRL, PEFT, and tokenizers documentation. Default fine-tuning and alignment stack. huggingface.co/docs/transformers — §14, §22.
  2. Axolotl, Unsloth, LLaMA-Factory. Community LoRA/QLoRA recipes referenced in §22.1. github.com/OpenAccess-AI-Collective/axolotl — §22.
  3. OpenRLHF, verl. PPO/GRPO/RLVR pipelines in §22.1. github.com/volcengine/verl — §13, §22.
  4. MosaicML Streaming, WebDataset. High-throughput sharded data loading. github.com/webdataset/webdataset — §22.
  5. Allen AI, OLMo / Dolma tooling (datatrove, text-dedup lineage). github.com/allenai/OLMo — §22.
  6. Weights & Biases, MLflow, TensorBoard. Experiment tracking named in §22.3. Vendor docs at respective sites — §22.

KPI strip & cost tables (§1, §19)

Figures such as 15T+ frontier tokens, ~20× Chinchilla ratio, $10M–$500M pre-train spend, and <$100 LoRA-on-7B are synthesis bands drawn from the Chinchilla, DeepSeek-V3, Epoch AI, and QLoRA sources above plus operator reports—not one published table. Re-derive any budget from your own FLOPs accounting and current GPU list pricing before you quote externally.