Personal notes · May 2026

Vision-Language Models (VLM)

My working notes on multimodal systems — how VLMs are built, where they break in production, and how I choose between frontier APIs and open weights when latency, cost, or privacy matter. I wrote this for myself while shipping document AI and agent-style workloads; treat the model names and benchmarks as snapshots, not gospel.

Author: Linh Truong, MA (Harvard), MBA · Source: LinhTruong.com · Email: Linh@Alumni.Harvard.edu

The question I keep returning to: where does the model actually fail on my images — preprocessing, token budget, or reasoning — and what would I measure before betting a product on it?

Covers: Architecture · Training · Inference · Eval · Production Written: May 2026 ✍️ By: Linh Truong

01Where Things Stand (May 2026)

Vision-Language Models (VLMs) — I still hear Multimodal Large Language Models (MLLMs) in papers — jointly read pixels and text. Between 2023 and 2026 they went from lab demos (BLIP-2, MiniGPT-4) to something I reach for in production: document extraction, GUI agents, robotics, everyday chat. The pattern I see winning is native multimodal pretraining: one transformer ingests interleaved image, video, and text tokens, with a vision encoder (or pixel patchifier) feeding the same decoder that writes language.

My read As of May 2026, frontier VLMs (GPT-5, Claude Opus 4.x, Gemini 2.5 Pro) are strong enough on static benchmarks that I stop arguing about "can the model see?" and start arguing about cost, latency, and failure modes. The interesting edge is long video, fine-grained spatial grounding, GUI/computer-use agents, and vision-language-action (VLA) stacks for robotics. My default split: frontier API when reasoning is the bottleneck; open weights (Qwen3-VL, Llama 4 Vision, InternVL3, Pixtral, Molmo) when I need privacy, fine-tuning, or unit economics.

Architecture — what I optimize first

The connector between vision encoder and LLM is the highest-leverage design choice. MLP projectors win on simplicity and quality; cross-attention (Flamingo/Llama 3.2 V) wins on token economy; native pixel-into-decoder (Fuyu, Chameleon, Pixtral) wins on flexibility.

Training — what actually moves the needle

Data quality dominates model size. Synthetic instruction data (LLaVA-NeXT, Cambrian, ShareGPT4V) and visual-RL/DPO are now table stakes.

Inference — where the bill shows up

Visual tokens dominate prefill cost. Resolution tiling, token pruning (FastV, VisionZip), and prefix caching of system images are where real money is saved.

Production — what breaks before the model does

Most VLM incidents I have seen are not "the model is dumb" — they are preprocessing, prompt formatting, and missing OCR fallback bugs. I build an eval harness before I pick a model.

02What Is a Vision-Language Model

Working definition A Vision-Language Model (VLM) is a neural network that conditions language generation on visual inputs (images, video frames, or screenshots) and optionally produces grounded outputs (bounding boxes, masks, points, actions). Modern VLMs share weights between vision and language reasoning rather than treating vision as a separate captioning step.

Capabilities I treat as baseline in 2026

Visual question answering — open-ended Q&A about an image.
Document understanding — read PDFs, invoices, forms, slides without OCR pre-step.
Chart, diagram, table reading — numerical extraction with reasoning.
OCR + handwriting — generally on par with dedicated OCR for printed text; weaker on stylized handwriting.
Visual grounding — output coordinates / bounding boxes for referred objects (frontier & many open models).
Multi-image & in-context vision — compare, diff, follow visual instructions.
Video understanding — up to minutes of video on frontier; seconds-to-minutes on open weights.
GUI / computer use — observe screenshots and emit click/type/scroll actions.
Embodied / robotics (VLA) — emit motor or end-effector actions from vision + instructions.

Capabilities Still Fragile

Counting many small objects (> 20) reliably.
Precise pixel-accurate localization without specialized heads.
Long video (> 1 hour) without sampling artifacts.
Reasoning about novel diagrams that require domain symbol knowledge.
Subtle factual visual hallucinations (color, count, relative position).

03Anatomy & Architecture

At its core, every modern VLM is a four-stage pipeline:

Figure 1 — Canonical VLM Pipeline

3.1 The Vision Encoder

The encoder maps an image into a sequence of embeddings, one per spatial patch. Three lineages dominate:

Family	Examples	Why it wins	Where it loses
CLIP-style	OpenAI CLIP, EVA-CLIP, OpenCLIP	Strong zero-shot alignment from contrastive pretraining at scale.	Lower-resolution defaults; weaker on dense OCR.
SigLIP / SigLIP 2	Google SigLIP-SO400M	Sigmoid loss → better stability, smaller batches, strong fine-grained.	Slightly worse on retrieval-style tasks vs CLIP at parity.
Native ViT-only	InternViT-6B, AIM, DINOv2, DINOv3	Self-supervised features generalize; great for grounding & segmentation.	Requires its own alignment step into the LLM.
Encoder-free / Pixel-in	Fuyu-8B, Chameleon, Pixtral	Arbitrary resolution, simpler stack, true native multimodal.	Higher token cost; needs more pretraining data.

3.2 The Language Model

Almost every open VLM in 2026 is built on top of an existing LLM backbone — typically a Llama, Qwen, Mistral, Gemma, or DeepSeek variant. The choice of backbone constrains: context length, tool-use formatting, system-prompt behavior, license, and quantization support. A common mistake is assuming the VLM inherits the base LLM's full text quality; instruction-tuning on multimodal data often regresses pure-text performance by 3–8 % on hard reasoning benchmarks.

3.3 The Connector (a.k.a. Projector, Adapter, Resampler)

The connector compresses or projects vision-encoder outputs into the LLM's embedding space. This is where most engineering tradeoffs live; see Section 4.

04Connector Patterns

Figure 2 — Four Connector Patterns at a Glance

4.1 When to pick which

You care most about	Pick	Notes
Best general quality on a fixed budget	MLP projector + dynamic-tiling encoder	The LLaVA-NeXT/Qwen2-VL/InternVL recipe. Default for 2026.
Many images or long video	Q-Former or cross-attention	Constant token budget per frame; Gemini-style architectures lean here.
Arbitrary resolution, simplicity	Encoder-free / pixel-in	Pixtral, Fuyu, Chameleon. Requires more pretraining compute.
Tight retrofit onto a strong text-LLM	Cross-attention	Lets you freeze the base LLM and add capability. Llama 3.2 Vision uses this.

05Training Lifecycle

Figure 3 — Modern VLM Training Stages

5.1 Data sources that matter

Image-caption webscale — LAION-5B, COYO-700M, DataComp; the workhorses for alignment.
Interleaved doc corpora — MMC4, OBELICS — teaches multi-image reasoning.
Synthetic instruction — LLaVA-Instruct, ShareGPT4V, Cambrian-7M, Cauldron; GPT-4V-generated dialogues remain the single highest-leverage SFT data.
OCR & document — IDL, PDFA, DocVQA, ChartQA, InfographicVQA.
Grounding — RefCOCO, GRIT, Visual Genome with normalized box format.
Video — VideoChatGPT, NeXT-QA, ActivityNet, Ego4D, LLaVA-Video-178K.
GUI — SeeClick, OS-Atlas, Mind2Web, AndroidControl, Computer-Use traces.

Tip When fine-tuning a VLM on your domain, mix 10–20 % general-domain instruction data back in. Pure-domain SFT reliably breaks instruction-following and OCR on out-of-domain images.

5.2 LoRA, QLoRA, and parameter-efficient fine-tuning

For most production use cases, fine-tune only the connector + LoRA adapters on the LLM. Rule of thumb:

Goal	What to unfreeze	Typical recipe
Style / format only	LLM LoRA (r=16–32)	1–2 epochs, 5k–50k examples
New visual domain (medical, satellite)	Connector full + LLM LoRA + last 2 ViT layers	3 epochs, 100k+ examples, mix general data
New output modality (boxes, masks)	Connector + LLM LoRA + format tokens	Add ~5% synthetic grounding examples
Latency-critical domain model	Distill from frontier API → small open VLM	2-stage: caption distillation, then task SFT

062026 Model Landscape

I bucket the market into frontier closed, frontier open, and specialist models. The table below is my shortlist when I am choosing what to pilot — May 2026 snapshot, re-check release notes before you commit.

6.1 Frontier closed (API-only)

Model	Provider	Strengths	Watch-outs
GPT-5 / GPT-5 mini	OpenAI	Strongest general multimodal reasoning; native voice + vision; computer-use mode.	Token costs; image rate limits; latency on heavy reasoning.
Claude Opus 4.x / Sonnet 4.x	Anthropic	Best-in-class document, chart, and PDF understanding; computer-use API; long-context (1M+).	No native video at parity yet; image input billed in tokens.
Gemini 2.5 Pro / Flash	Google	Long video (1–2h with caching), 2M-token context, native audio & video.	Behavior varies across regions; thinking-budget knob requires care.
Grok 3 / 4 Vision	xAI	Strong on real-world image reasoning; fast inference.	Newer ecosystem; fewer integrations.

6.2 Frontier open-weight (self-hosting viable)

Model	Sizes	Strengths	License
Qwen3-VL / Qwen2.5-VL	2B → 72B (dense & MoE)	Best open OCR & document; strong grounding; dynamic resolution.	Tongyi Qianwen (commercial OK with conditions)
Llama 4 Vision	Scout · Maverick · Behemoth class	Native MoE; long context; ecosystem support; strong English.	Llama Community
InternVL 3 / 2.5	1B → 78B	Best-in-class multilingual OCR; matches GPT-4o on MMMU at the 78B tier.	MIT
Pixtral 12B / Large	12B, 124B	Encoder-free, native arbitrary-resolution; clean format.	Apache 2.0 (12B)
Molmo	1B / 7B / 72B	Open data + open weights; strong pointing/grounding via PixMo.	Apache 2.0
DeepSeek-VL2	3B / 16B / 27B MoE	Efficient MoE; strong reasoning per active param.	DeepSeek License
Phi-4-Multimodal	~5–6B	Small-but-mighty; audio + vision + text; edge-friendly.	MIT
PaliGemma 2	3B / 10B / 28B	Task-specialist; best for fine-tuning to narrow vision tasks.	Gemma
NVLM / NVILA	7B – 72B	NVIDIA recipes; strong out-of-box on document & chart.	NVIDIA OneWay (research-leaning)
Aria	~25B MoE (3.9B active)	Long video & document MoE.	Apache 2.0

6.3 Specialists worth knowing

Document & OCR

GOT-OCR 2.0, Nougat, MinerU, SmolDocling, olmOCR — purpose-built for layout-aware extraction. Use as fallback or preprocessor when a generalist VLM under-performs.

GUI / Computer Use

UI-TARS, OS-Atlas, ShowUI, SeeClick, Claude Computer Use, GPT-5 Operator. These output click/keystroke actions from screenshots.

Vision-Language-Action (Robotics)

RT-2, OpenVLA, π0 / π0.5, Gemini Robotics, Helix. Emit end-effector or motor commands.

Embedding / Retrieval

SigLIP 2, Jina CLIP v2, NV-Embed-MM, ColPali, VisRAG — image+text embeddings for multimodal RAG.

Edge / Mobile

MobileVLM v2, MiniCPM-V 2.6 / 4.0, SmolVLM 2, Phi-4 mini-multimodal. Sub-3B models that run on phones, Jetsons, and laptops.

Open Foundational

Chameleon, Emu3, Janus-Pro — early/native-token models that can also generate images. Useful for understanding-generation unification research.

07Image Tokenization & Resolution Strategies

The single most expensive variable at inference time is the number of vision tokens per image. A 4K screenshot in a naive 14×14 ViT becomes ~85k tokens — exhausting context windows and budgets. Modern VLMs use one of three strategies:

Figure 4 — Resolution Strategies

7.1 Token-budget rules of thumb

For chat-style use, cap per-image tokens at 1500–2000. Quality plateaus above that for everything except small-text OCR.
For document Q&A, use per-page tiling and pass pages serially in a structured prompt — never one giant image.
For multi-image input, prefer uniform sizing across turns; mixed sizes break some caching layers.
Token-dropping (FastV, VisionZip, PruMerge, SparseVLM) reliably saves 40–70 % of vision tokens with <1 % quality loss on standard benchmarks; test on your data.

08Beyond Single Images — Video, Documents, Charts, UI

8.1 Video

Three practical approaches:

Uniform frame sampling (1–2 fps) — simplest; works to ~5 minutes on most open VLMs.
Keyframe / scene-cut sampling — pre-process with a shot detector; pass 16–64 frames.
Native video tokens (Qwen3-VL, Gemini 2.5, GPT-5) — model expects a video tensor or URI and handles temporal compression internally.

For long video, combine retrieval (Video-RAG) with frame sampling — embed clips, retrieve relevant windows, then pass only those to the VLM. M-RoPE / temporal positional embeddings let modern VLMs maintain absolute-time awareness across long sequences.

8.2 Documents and PDFs

The document pipeline I use when accuracy matters more than demo speed:

Render each page to image at 150–200 DPI (higher only for fine-print).
Tile aggressively for dense pages (Qwen3-VL or InternVL handle this natively).
Pair with layout tokens from a lightweight layout model (e.g., DocLayout-YOLO) for structured extraction.
Use a generalist VLM (Claude, GPT-5, Qwen3-VL-72B) for end-to-end Q&A and citation; reserve specialist OCR (GOT-OCR 2.0, MinerU) for archival extraction at scale.

8.3 Charts, tables, math

Charts: ChartQA, ChartGemma, ChartLlama; emit table + answer to reduce hallucination.
Tables: render → VLM → JSON schema, validated with Pydantic.
Math/diagrams: MathVista, MathVerse evaluation; chain-of-thought prompting helps; for diagrammatic geometry, code-execution agents outperform pure VLMs.

8.4 GUI and screenshots

For agents that drive UIs, the screenshot resolution policy is critical: too small → can't read buttons, too large → cost explodes. Use a two-pass recipe — low-res for planning, high-res crops for clicking. Specialist GUI models (UI-TARS, OS-Atlas) ship with click-coordinate output formats; align your harness to that schema.

09Inference & Serving

Figure 5 — Inference Path & Optimization Surface

9.1 Recommended serving stacks (2026)

Stack	Best for	Notes
vLLM (≥0.7)	General OSS VLM serving	First-class multimodal: Qwen2/3-VL, Llama 3.2/4 Vision, Pixtral, InternVL, MiniCPM, Phi-4. Prefix caching for images.
SGLang	Complex multi-turn / structured-output	RadixAttention for prefix caching across turns and images; great with tool use.
TensorRT-LLM	NVIDIA-optimized production	Best raw throughput on H100/H200/B100; longer integration tail per new architecture.
TGI (HF)	HuggingFace-native pipelines	Easy to deploy; slightly behind on multimodal feature parity.
llama.cpp / Ollama / MLX	Edge, laptop, on-prem dev	GGUF-quantized Qwen2-VL, MiniCPM-V, SmolVLM, LLaVA — viable on a MacBook.
Triton + custom	Hyperscale custom	Lean here only when single-team ownership justifies the operational cost.

9.2 Prefix and image caching

Tip If your application repeatedly sends the same system image (logo template, base UI), serve it with a server that supports prefix caching (vLLM, SGLang, Anthropic prompt cache, Gemini context cache, OpenAI prompt caching). It is typically the single biggest cost win, often 10× on heavy-image workloads.

9.3 Batching strategy

VLM batching is dominated by prefill, not decode. Use continuous batching with chunked prefill so large vision-token prompts don't starve other requests.
For mixed text-only + image traffic, run two pools or use a router — image traffic will otherwise create head-of-line blocking.
Track P50/P95 vision tokens per request as a first-class metric.

10Quantization & Optimization

Technique	Quality impact	When to use
BF16 / FP16	Baseline	Default for training and high-throughput H100 inference.
FP8 (E4M3 / E5M2)	≤1 % drop on most benchmarks	Hopper/Blackwell production inference. TensorRT-LLM & vLLM both support it.
INT8 W8A8	~1–2 % drop	A100/L4/L40S deployments.
INT4 (AWQ, GPTQ, GGUF Q4)	1–4 % drop; OCR/chart can regress more	Memory-constrained or edge. Test on your visual tasks before shipping.
FP4 / NVFP4	~1 % with calibration	Blackwell-only; emerging for VLMs.
Vision-token pruning	0–2 % drop	FastV, VisionZip, SparseVLM — drop after layer 2–4. Free throughput.
Speculative decoding	None (lossless)	Draft-model based; great for long answers.

Warn Always benchmark quantized VLMs on your OCR / document / chart tasks. Quantization affects fine-grained visual reasoning more than general VQA — a model that scores within 1 % on MMBench can drop 10 % on a small-text invoice task.

11Evaluation & Benchmarks

11.1 Public benchmarks worth tracking

Benchmark	What it measures	Signal quality
MMMU / MMMU-Pro	College-level multimodal reasoning across 30+ disciplines	High — top public score still discriminating in 2026
MathVista, MathVerse	Math + diagrams	High; frontier-only
MMBench / MMBench-v1.1	General VQA across abilities	Medium; saturated on frontier
MMStar	VQA with leak-resistant questions	High
DocVQA, ChartQA, InfographicVQA	Document & chart reading	High for vertical apps
OCRBench / OCRBench v2	OCR accuracy across languages, fonts	High; pair with your own samples
RealWorldQA, BLINK, MMVP	Perceptual edge cases & pairs	High; cheap to run
RefCOCO/+/g, ReasonSeg	Visual grounding	High for agents
VideoMME, MVBench, LongVideoBench	Video understanding short→long	High; gold standard for video VLMs
POPE, HallusionBench, MMHal	Hallucination rate	Critical for production decisions
ScreenSpot, ScreenSpot-Pro, OSWorld	GUI grounding & computer use	High for agent products
VLMEvalKit / lmms-eval	Harness running 50+ benchmarks reproducibly	Use this; do not roll your own

11.2 Build your own eval set

Public benchmarks tell you whether a model is generally capable. They do not tell you whether it will work on your images. Every team shipping VLMs should build:

200–500 golden examples drawn from real production traffic (with consent & redaction).
Pairwise human or LLM-judge scoring with explicit rubrics (grounding, hallucination, format).
Adversarial slices: low-light, occluded, multi-lingual, low-DPI scans, rotated.
Tracked over time in a dashboard with diff-vs-baseline.

Rule Eval first, model second. A team without an eval harness picks the wrong model 60 % of the time and never finds out.

12Hallucinations & Safety

12.1 Common failure modes

Object hallucination — describing objects that aren't there (especially in low-confidence regions or after long context).
Color/count/spatial errors — left/right confusion, miscount of similar objects, color drift.
Hyper-confident OCR — fabricating plausible numbers when the image is unreadable.
Prompt-induced confirmation bias — leading questions ("Is the cat on the table?") nudge the model to confirm.
Cross-image leakage — info from earlier turns bleeds into the current answer.

12.2 Mitigations

Use neutral framing in prompts; avoid presupposing answers.
Ask for citations to image regions (bounding boxes or "according to the top-right panel").
Run self-consistency at temperature 0 + 1 high-temp sample; flag disagreement for human review.
Cascade: VLM → verifier. A second model (or a rule) checks the answer's compatibility with the image.
Use models trained with visual RL/DPO for hallucination (POVID, RLHF-V, VLFeedback).
For numbers, force a show-your-work output (extract the cell, then compute).

12.3 Safety surface specific to VLMs

Image-as-jailbreak — text rendered inside an image bypasses some text safety filters. Defense: run prompt-injection detection on extracted image text.
PII leakage — VLMs willingly transcribe IDs, license plates, faces. Filter or redact before display.
Bias in perception — counts, ages, attributes can be systematically biased; audit per-cohort.
CSAM / illegal content — gateway your image input with hash + classifier checks; this is non-negotiable.

13Agents, Computer Use & Vision-Language-Action

Figure 6 — Agentic VLM Loop

13.1 Computer Use

Claude Computer Use, GPT-5 Operator, and open-weights stacks like UI-TARS, OS-Atlas, and ShowUI let a VLM drive a desktop or browser. What I have learned in practice:

Click accuracy gates success — use a specialist grounding model if your generalist's coordinates jitter.
Add a visual verifier that confirms the post-action screenshot matches intent before continuing.
Cap step counts and add a circuit breaker — runaway loops are the #1 incident class.
Sandbox by default. The model will follow instructions written in webpages and emails.

13.2 Vision-Language-Action (VLA) for robotics

VLA models output low-level robot commands instead of text. RT-2, OpenVLA, π0 / π0.5, Gemini Robotics, and Figure's Helix dominate 2026 demos. The architectural twist: action tokens are interleaved with language tokens in the same decoder. When I wire up a VLA stack, I plan for:

Discrete action vocabularies (binned end-effector deltas) — easy to log and replay.
50–200 Hz action loops; the VLM runs at lower rate with a fast low-level controller.
Safety bounds enforced outside the model (workspace limits, force ceilings).

14Shipping VLMs in Production

14.1 Pattern library

Visual Q&A

simple Single API call. Structured output schema. Prompt cache the system image if reusable.

Visual RAG

medium CLIP/SigLIP/ColPali embeddings → retrieve top-k images → VLM answers with citations. Best for knowledge bases that are images (slides, catalogs, manuals).

Doc Extraction Pipeline

medium Layout model → VLM per region → JSON schema → validator → human-in-loop for low confidence. Always pair with a deterministic format check.

Vision Agent

hard Loop: observe → plan → act → verify. Specialist GUI model + generalist planner is the strongest combo today.

Edge VLM

simple→medium MiniCPM-V 4.0 / SmolVLM 2 / Phi-4 mini on device for latency, privacy, or offline. Bigger model in cloud for fallback.

Distillation Pipeline

hard Frontier API labels → small open VLM SFT → in-house deployment. ROI shows up at >1M monthly requests.

14.2 Prompting patterns I use in production

Put the task before the image. Most models do better when the question precedes the image; some (Claude) accept either, but consistency reduces cache misses.
Be explicit about output format. Ask for JSON or XML and validate. Free-form prose invites hallucination.
Force grounding with phrases like "Quote the exact text from the image" or "Give the bounding box [x1, y1, x2, y2] for each item."
Restrict the answer space. "Choose from A/B/C/D/Unknown" reduces wandering and produces measurable accuracy.
Allow refusal. Include "If you cannot see X, reply 'NOT_VISIBLE'." This dramatically reduces fabrication.

// Production prompt skeleton
SYSTEM: You are a document understanding assistant.
        Read the image carefully. If a value is illegible,
        return "ILLEGIBLE" — never guess.

USER:   [IMAGE]
        Extract the following fields as JSON:
        - invoice_number (string)
        - total_amount (number, no currency symbol)
        - line_items (array of {description, quantity, unit_price})
        For each value, also provide bbox: [x1,y1,x2,y2] in 0–1000 coords.
        Output ONLY valid JSON. No prose, no markdown.

14.3 Observability

Log: model id, image hash, vision-token count, latency (TTFT, TPS), output schema validity, refusal flag.
Sample 1–5 % of requests for human review.
Track drift weekly: same eval set, same model, watch for regressions tied to provider updates.
Tag-aware dashboards by image type (PDF page, screenshot, photo) — failure modes differ wildly.

15Cost & Throughput Engineering

15.1 Cost levers ranked by impact

Prefix cache reusable images (logo, base template, system shot) — typical 5–15× cost reduction.
Right-size resolution per task — most chat-style use needs ≤1024×1024.
Cascade: small open VLM screens, frontier handles only ambiguous 5–20 %.
Vision-token pruning (FastV, VisionZip) on self-hosted — 40–70 % prefill saving.
FP8 / INT4 weights — 2–4× throughput on the same GPU.
Speculative decoding — 1.5–3× decode speedup, lossless.
Batched async — never block on a single request when your traffic is mixed.

15.2 GPU memory budgeting (rough)

Model class	BF16 weights	FP8 weights	Min GPU for serving
~2–4B (Phi-4, SmolVLM, MiniCPM)	~6–9 GB	~3–5 GB	RTX 4060 / Jetson Orin
~7–8B (LLaVA-NeXT, Qwen2-VL-7B)	~16 GB	~9 GB	L4 / 24 GB
~12–14B (Pixtral 12B, Phi-4 14B)	~26 GB	~14 GB	A10G / L40S / single 4090
~30–40B	~70 GB	~38 GB	A100 80 GB or H100
~70–78B (Qwen3-VL-72B, InternVL-78B)	~150 GB	~80 GB	2× H100 / 1× H200 / 1× B100

Add ~10–25% headroom for KV cache and activations; more if you accept many concurrent long-context image requests.

16What I Am Watching Next

What is shipping in 2026

Native any-modality decoders — one model emits text, audio, and image tokens (GPT-5o, Gemini 2.5, Chameleon/Janus lineage).
Long video — 1+ hour reasoning becoming mainstream via temporal compression + caching.
Test-time scaling for vision — visual chain-of-thought, search over visual hypotheses (o-series, R1-Vision style).
Generalist VLA — robotics policies that transfer across embodiments.
Multimodal RAG with grounded citations — answer + bounding-box source by default.

Still open

Reliable counting and spatial reasoning at high object density.
Compositional generalization on unusual object–attribute combinations.
Continuous video memory beyond a few hours without summarization tricks.
Faithfulness under adversarial images (typographic, OCR-injection attacks).
Efficient on-device inference with full quality preserved for OCR.
Standardized grounding output formats across providers — every model still emits a different schema.

17How I Choose a Model

Figure 7 — Model Selection Decision Tree

17.1 Mistakes I have seen (or made)

Picking a model without an eval set. I have watched teams learn this in production — painfully.
Sending raw 4K images. I cap resolution early; extra pixels rarely buy accuracy on chat-style tasks.
Ignoring prefix caching. Most repeating-image workloads are 5–15× over budget without it.
Free-form output. Always demand JSON / XML with a validator.
One-shot agent loops. Add a verifier and a circuit breaker; runaway loops are the most common incident.
Trusting general benchmarks. MMMU does not predict your invoice-reading accuracy.
Skipping quantization checks on visual tasks. INT4 can silently degrade OCR by 10 %+.

18Glossary & References

18.1 Glossary

VLM / MLLM	Vision-Language Model / Multimodal LLM
ViT	Vision Transformer — patch-based image encoder.
CLIP / SigLIP	Contrastive image-text encoders; SigLIP uses sigmoid loss.
Connector / Projector	Module bridging vision encoder to LLM embedding space.
Q-Former	Querying transformer (BLIP-2) producing fixed-size visual token budget.
Cross-attention adapter	Gated attention layers inside LLM that read vision K/V (Flamingo).
AnyRes / Dynamic tiling	Variable-resolution image input via tile + thumbnail strategy.
M-RoPE	Multimodal Rotary Position Embedding — extends RoPE over (t, h, w).
Vision-token pruning	Dropping less-informative visual tokens at intermediate LLM layers.
Visual instruction tuning	SFT on (image, instruction, response) triples; LLaVA-style.
VLA	Vision-Language-Action — VLMs that emit motor commands.
Computer Use	VLM-driven control of a screen via screenshots + click/keystroke actions.
Multimodal RAG	Retrieval-Augmented Generation where retrieved items are images or mixed.
Prefix cache	Reusing KV-cache for repeated prompt prefixes (often a system image).
Hallucination	Output that contradicts the input image; the dominant production failure.

18.2 Reference papers and projects

Radford et al., CLIP (2021) — contrastive image-text pretraining.
Alayrac et al., Flamingo (2022) — gated cross-attention, in-context multimodal.
Li et al., BLIP-2 (2023) — Q-Former alignment.
Liu et al., LLaVA & LLaVA-NeXT (2023–24) — open visual instruction tuning recipe.
Zhai et al., SigLIP / SigLIP 2 (2023–25) — sigmoid loss image encoders.
Bai et al., Qwen-VL · Qwen2-VL · Qwen3-VL (2023–26) — dynamic-resolution VLMs.
Chen et al., InternVL · InternVL 2.5 · InternVL 3 (2024–26).
Mistral AI, Pixtral 12B · Pixtral Large (2024–25) — encoder-free design.
Allen AI, Molmo & PixMo (2024) — open data, open weights, pointing.
Microsoft, Phi-3.5 Vision · Phi-4 Multimodal (2024–25) — small-but-mighty.
Anthropic, Claude 3.5/4 with vision, Computer Use API.
OpenAI, GPT-4V / GPT-4o / GPT-5 system cards.
Google DeepMind, Gemini 1.5/2.0/2.5 technical reports.
Brohan et al., RT-2 (2023); Kim et al., OpenVLA (2024); Physical Intelligence π0/π0.5 (2024–25) — VLA models.
Chen et al., FastV (2024); Yang et al., VisionZip (2024) — visual-token pruning.
VLMEvalKit, lmms-eval — community evaluation harnesses.

How I maintain this I refresh these notes after major release cycles — roughly quarterly. Public leaderboards lag what I see in production by a few months, so I keep a private eval set and re-run it when a provider ships a new vision model.