My working notes on multimodal systems — how VLMs are built, where they break in production,
and how I choose between frontier APIs and open weights when latency, cost, or privacy matter.
I wrote this for myself while shipping document AI and agent-style workloads; treat the model
names and benchmarks as snapshots, not gospel.
The question I keep returning to: where does the model actually fail on my images — preprocessing, token budget, or reasoning — and what would I measure before betting a product on it?
Covers: Architecture · Training · Inference · Eval · ProductionWritten: May 2026✍️ By: Linh Truong
01Where Things Stand (May 2026)
Vision-Language Models (VLMs) — I still hear Multimodal Large Language Models (MLLMs) in papers —
jointly read pixels and text. Between 2023 and 2026 they went from lab demos (BLIP-2, MiniGPT-4) to
something I reach for in production: document extraction, GUI agents, robotics, everyday chat.
The pattern I see winning is native multimodal pretraining: one transformer ingests interleaved
image, video, and text tokens, with a vision encoder (or pixel patchifier) feeding the same decoder
that writes language.
My read
As of May 2026, frontier VLMs (GPT-5, Claude Opus 4.x, Gemini 2.5 Pro) are strong enough on static
benchmarks that I stop arguing about "can the model see?" and start arguing about cost, latency, and
failure modes. The interesting edge is long video, fine-grained spatial grounding,
GUI/computer-use agents, and vision-language-action (VLA) stacks for robotics.
My default split: frontier API when reasoning is the bottleneck; open weights
(Qwen3-VL, Llama 4 Vision, InternVL3, Pixtral, Molmo) when I need privacy, fine-tuning, or unit economics.
Architecture — what I optimize first
The connector between vision encoder and LLM is the highest-leverage design choice.
MLP projectors win on simplicity and quality; cross-attention (Flamingo/Llama 3.2 V) wins on
token economy; native pixel-into-decoder (Fuyu, Chameleon, Pixtral) wins on flexibility.
Training — what actually moves the needle
Data quality dominates model size. Synthetic instruction data
(LLaVA-NeXT, Cambrian, ShareGPT4V) and visual-RL/DPO are now table stakes.
Inference — where the bill shows up
Visual tokens dominate prefill cost. Resolution tiling, token pruning (FastV, VisionZip),
and prefix caching of system images are where real money is saved.
Production — what breaks before the model does
Most VLM incidents I have seen are not "the model is dumb" — they are
preprocessing, prompt formatting, and missing OCR fallback bugs.
I build an eval harness before I pick a model.
02What Is a Vision-Language Model
Working definition
A Vision-Language Model (VLM) is a neural network that conditions language generation
on visual inputs (images, video frames, or screenshots) and optionally produces grounded
outputs (bounding boxes, masks, points, actions). Modern VLMs share weights between vision
and language reasoning rather than treating vision as a separate captioning step.
Capabilities I treat as baseline in 2026
Visual question answering — open-ended Q&A about an image.
Almost every open VLM in 2026 is built on top of an existing LLM backbone — typically a
Llama, Qwen, Mistral, Gemma, or DeepSeek variant. The choice of backbone constrains:
context length, tool-use formatting, system-prompt behavior, license, and quantization support.
A common mistake is assuming the VLM inherits the base LLM's full text quality; instruction-tuning
on multimodal data often regresses pure-text performance by 3–8 % on hard reasoning benchmarks.
3.3 The Connector (a.k.a. Projector, Adapter, Resampler)
The connector compresses or projects vision-encoder outputs into the LLM's embedding space.
This is where most engineering tradeoffs live; see Section 4.
04Connector Patterns
Figure 2 — Four Connector Patterns at a Glance
4.1 When to pick which
You care most about
Pick
Notes
Best general quality on a fixed budget
MLP projector + dynamic-tiling encoder
The LLaVA-NeXT/Qwen2-VL/InternVL recipe. Default for 2026.
Many images or long video
Q-Former or cross-attention
Constant token budget per frame; Gemini-style architectures lean here.
Arbitrary resolution, simplicity
Encoder-free / pixel-in
Pixtral, Fuyu, Chameleon. Requires more pretraining compute.
Tight retrofit onto a strong text-LLM
Cross-attention
Lets you freeze the base LLM and add capability. Llama 3.2 Vision uses this.
05Training Lifecycle
Figure 3 — Modern VLM Training Stages
5.1 Data sources that matter
Image-caption webscale — LAION-5B, COYO-700M, DataComp; the workhorses for alignment.
Tip
When fine-tuning a VLM on your domain, mix 10–20 % general-domain instruction data
back in. Pure-domain SFT reliably breaks instruction-following and OCR on out-of-domain images.
5.2 LoRA, QLoRA, and parameter-efficient fine-tuning
For most production use cases, fine-tune only the connector + LoRA adapters on the LLM. Rule of thumb:
Goal
What to unfreeze
Typical recipe
Style / format only
LLM LoRA (r=16–32)
1–2 epochs, 5k–50k examples
New visual domain (medical, satellite)
Connector full + LLM LoRA + last 2 ViT layers
3 epochs, 100k+ examples, mix general data
New output modality (boxes, masks)
Connector + LLM LoRA + format tokens
Add ~5% synthetic grounding examples
Latency-critical domain model
Distill from frontier API → small open VLM
2-stage: caption distillation, then task SFT
062026 Model Landscape
I bucket the market into frontier closed, frontier open, and specialist models. The table below is my shortlist when I am choosing what to pilot — May 2026 snapshot, re-check release notes before you commit.
6.1 Frontier closed (API-only)
Model
Provider
Strengths
Watch-outs
GPT-5 / GPT-5 mini
OpenAI
Strongest general multimodal reasoning; native voice + vision; computer-use mode.
Token costs; image rate limits; latency on heavy reasoning.
Claude Opus 4.x / Sonnet 4.x
Anthropic
Best-in-class document, chart, and PDF understanding; computer-use API; long-context (1M+).
No native video at parity yet; image input billed in tokens.
Gemini 2.5 Pro / Flash
Google
Long video (1–2h with caching), 2M-token context, native audio & video.
Behavior varies across regions; thinking-budget knob requires care.
Grok 3 / 4 Vision
xAI
Strong on real-world image reasoning; fast inference.
Newer ecosystem; fewer integrations.
6.2 Frontier open-weight (self-hosting viable)
Model
Sizes
Strengths
License
Qwen3-VL / Qwen2.5-VL
2B → 72B (dense & MoE)
Best open OCR & document; strong grounding; dynamic resolution.
Tongyi Qianwen (commercial OK with conditions)
Llama 4 Vision
Scout · Maverick · Behemoth class
Native MoE; long context; ecosystem support; strong English.
Llama Community
InternVL 3 / 2.5
1B → 78B
Best-in-class multilingual OCR; matches GPT-4o on MMMU at the 78B tier.
Task-specialist; best for fine-tuning to narrow vision tasks.
Gemma
NVLM / NVILA
7B – 72B
NVIDIA recipes; strong out-of-box on document & chart.
NVIDIA OneWay (research-leaning)
Aria
~25B MoE (3.9B active)
Long video & document MoE.
Apache 2.0
6.3 Specialists worth knowing
Document & OCR
GOT-OCR 2.0, Nougat, MinerU, SmolDocling, olmOCR — purpose-built for layout-aware extraction. Use as fallback or preprocessor when a generalist VLM under-performs.
GUI / Computer Use
UI-TARS, OS-Atlas, ShowUI, SeeClick, Claude Computer Use, GPT-5 Operator. These output click/keystroke actions from screenshots.
Vision-Language-Action (Robotics)
RT-2, OpenVLA, π0 / π0.5, Gemini Robotics, Helix. Emit end-effector or motor commands.
Embedding / Retrieval
SigLIP 2, Jina CLIP v2, NV-Embed-MM, ColPali, VisRAG — image+text embeddings for multimodal RAG.
Edge / Mobile
MobileVLM v2, MiniCPM-V 2.6 / 4.0, SmolVLM 2, Phi-4 mini-multimodal. Sub-3B models that run on phones, Jetsons, and laptops.
Open Foundational
Chameleon, Emu3, Janus-Pro — early/native-token models that can also generate images. Useful for understanding-generation unification research.
07Image Tokenization & Resolution Strategies
The single most expensive variable at inference time is the number of vision tokens per image. A 4K screenshot in a naive 14×14 ViT becomes ~85k tokens — exhausting context windows and budgets. Modern VLMs use one of three strategies:
Figure 4 — Resolution Strategies
7.1 Token-budget rules of thumb
For chat-style use, cap per-image tokens at 1500–2000. Quality plateaus above that for everything except small-text OCR.
For document Q&A, use per-page tiling and pass pages serially in a structured prompt — never one giant image.
For multi-image input, prefer uniform sizing across turns; mixed sizes break some caching layers.
Token-dropping (FastV, VisionZip, PruMerge, SparseVLM) reliably saves 40–70 % of vision tokens with <1 % quality loss on standard benchmarks; test on your data.
08Beyond Single Images — Video, Documents, Charts, UI
8.1 Video
Three practical approaches:
Uniform frame sampling (1–2 fps) — simplest; works to ~5 minutes on most open VLMs.
Keyframe / scene-cut sampling — pre-process with a shot detector; pass 16–64 frames.
Native video tokens (Qwen3-VL, Gemini 2.5, GPT-5) — model expects a video tensor or URI and handles temporal compression internally.
For long video, combine retrieval (Video-RAG) with frame sampling — embed clips, retrieve relevant windows, then pass only those to the VLM. M-RoPE / temporal positional embeddings let modern VLMs maintain absolute-time awareness across long sequences.
8.2 Documents and PDFs
The document pipeline I use when accuracy matters more than demo speed:
Render each page to image at 150–200 DPI (higher only for fine-print).
Tile aggressively for dense pages (Qwen3-VL or InternVL handle this natively).
Pair with layout tokens from a lightweight layout model (e.g., DocLayout-YOLO) for structured extraction.
Use a generalist VLM (Claude, GPT-5, Qwen3-VL-72B) for end-to-end Q&A and citation; reserve specialist OCR (GOT-OCR 2.0, MinerU) for archival extraction at scale.
Tables: render → VLM → JSON schema, validated with Pydantic.
Math/diagrams: MathVista, MathVerse evaluation; chain-of-thought prompting helps; for diagrammatic geometry, code-execution agents outperform pure VLMs.
8.4 GUI and screenshots
For agents that drive UIs, the screenshot resolution policy is critical: too small → can't read buttons, too large → cost explodes. Use a two-pass recipe — low-res for planning, high-res crops for clicking. Specialist GUI models (UI-TARS, OS-Atlas) ship with click-coordinate output formats; align your harness to that schema.
RadixAttention for prefix caching across turns and images; great with tool use.
TensorRT-LLM
NVIDIA-optimized production
Best raw throughput on H100/H200/B100; longer integration tail per new architecture.
TGI (HF)
HuggingFace-native pipelines
Easy to deploy; slightly behind on multimodal feature parity.
llama.cpp / Ollama / MLX
Edge, laptop, on-prem dev
GGUF-quantized Qwen2-VL, MiniCPM-V, SmolVLM, LLaVA — viable on a MacBook.
Triton + custom
Hyperscale custom
Lean here only when single-team ownership justifies the operational cost.
9.2 Prefix and image caching
Tip
If your application repeatedly sends the same system image (logo template, base UI),
serve it with a server that supports prefix caching (vLLM, SGLang, Anthropic prompt cache,
Gemini context cache, OpenAI prompt caching). It is typically the single biggest cost win,
often 10× on heavy-image workloads.
9.3 Batching strategy
VLM batching is dominated by prefill, not decode. Use continuous batching with chunked prefill so large vision-token prompts don't starve other requests.
For mixed text-only + image traffic, run two pools or use a router — image traffic will otherwise create head-of-line blocking.
Track P50/P95 vision tokens per request as a first-class metric.
10Quantization & Optimization
Technique
Quality impact
When to use
BF16 / FP16
Baseline
Default for training and high-throughput H100 inference.
FP8 (E4M3 / E5M2)
≤1 % drop on most benchmarks
Hopper/Blackwell production inference. TensorRT-LLM & vLLM both support it.
INT8 W8A8
~1–2 % drop
A100/L4/L40S deployments.
INT4 (AWQ, GPTQ, GGUF Q4)
1–4 % drop; OCR/chart can regress more
Memory-constrained or edge. Test on your visual tasks before shipping.
FP4 / NVFP4
~1 % with calibration
Blackwell-only; emerging for VLMs.
Vision-token pruning
0–2 % drop
FastV, VisionZip, SparseVLM — drop after layer 2–4. Free throughput.
Speculative decoding
None (lossless)
Draft-model based; great for long answers.
Warn
Always benchmark quantized VLMs on your OCR / document / chart tasks. Quantization affects
fine-grained visual reasoning more than general VQA — a model that scores within 1 % on MMBench
can drop 10 % on a small-text invoice task.
11Evaluation & Benchmarks
11.1 Public benchmarks worth tracking
Benchmark
What it measures
Signal quality
MMMU / MMMU-Pro
College-level multimodal reasoning across 30+ disciplines
High — top public score still discriminating in 2026
MathVista, MathVerse
Math + diagrams
High; frontier-only
MMBench / MMBench-v1.1
General VQA across abilities
Medium; saturated on frontier
MMStar
VQA with leak-resistant questions
High
DocVQA, ChartQA, InfographicVQA
Document & chart reading
High for vertical apps
OCRBench / OCRBench v2
OCR accuracy across languages, fonts
High; pair with your own samples
RealWorldQA, BLINK, MMVP
Perceptual edge cases & pairs
High; cheap to run
RefCOCO/+/g, ReasonSeg
Visual grounding
High for agents
VideoMME, MVBench, LongVideoBench
Video understanding short→long
High; gold standard for video VLMs
POPE, HallusionBench, MMHal
Hallucination rate
Critical for production decisions
ScreenSpot, ScreenSpot-Pro, OSWorld
GUI grounding & computer use
High for agent products
VLMEvalKit / lmms-eval
Harness running 50+ benchmarks reproducibly
Use this; do not roll your own
11.2 Build your own eval set
Public benchmarks tell you whether a model is generally capable. They do not tell you whether it will work on your images. Every team shipping VLMs should build:
200–500 golden examples drawn from real production traffic (with consent & redaction).
Pairwise human or LLM-judge scoring with explicit rubrics (grounding, hallucination, format).
Tracked over time in a dashboard with diff-vs-baseline.
Rule
Eval first, model second. A team without an eval harness picks the wrong model 60 % of the time and never finds out.
12Hallucinations & Safety
12.1 Common failure modes
Object hallucination — describing objects that aren't there (especially in low-confidence regions or after long context).
Color/count/spatial errors — left/right confusion, miscount of similar objects, color drift.
Hyper-confident OCR — fabricating plausible numbers when the image is unreadable.
Prompt-induced confirmation bias — leading questions ("Is the cat on the table?") nudge the model to confirm.
Cross-image leakage — info from earlier turns bleeds into the current answer.
12.2 Mitigations
Use neutral framing in prompts; avoid presupposing answers.
Ask for citations to image regions (bounding boxes or "according to the top-right panel").
Run self-consistency at temperature 0 + 1 high-temp sample; flag disagreement for human review.
Cascade: VLM → verifier. A second model (or a rule) checks the answer's compatibility with the image.
Use models trained with visual RL/DPO for hallucination (POVID, RLHF-V, VLFeedback).
For numbers, force a show-your-work output (extract the cell, then compute).
12.3 Safety surface specific to VLMs
Image-as-jailbreak — text rendered inside an image bypasses some text safety filters. Defense: run prompt-injection detection on extracted image text.
PII leakage — VLMs willingly transcribe IDs, license plates, faces. Filter or redact before display.
Bias in perception — counts, ages, attributes can be systematically biased; audit per-cohort.
CSAM / illegal content — gateway your image input with hash + classifier checks; this is non-negotiable.
13Agents, Computer Use & Vision-Language-Action
Figure 6 — Agentic VLM Loop
13.1 Computer Use
Claude Computer Use, GPT-5 Operator, and open-weights stacks like UI-TARS, OS-Atlas, and ShowUI let a VLM drive a desktop or browser. What I have learned in practice:
Click accuracy gates success — use a specialist grounding model if your generalist's coordinates jitter.
Add a visual verifier that confirms the post-action screenshot matches intent before continuing.
Cap step counts and add a circuit breaker — runaway loops are the #1 incident class.
Sandbox by default. The model will follow instructions written in webpages and emails.
13.2 Vision-Language-Action (VLA) for robotics
VLA models output low-level robot commands instead of text. RT-2, OpenVLA, π0 / π0.5, Gemini Robotics, and Figure's Helix dominate 2026 demos. The architectural twist: action tokens are interleaved with language tokens in the same decoder. When I wire up a VLA stack, I plan for:
Discrete action vocabularies (binned end-effector deltas) — easy to log and replay.
50–200 Hz action loops; the VLM runs at lower rate with a fast low-level controller.
Safety bounds enforced outside the model (workspace limits, force ceilings).
14Shipping VLMs in Production
14.1 Pattern library
Visual Q&A
simple Single API call. Structured output schema. Prompt cache the system image if reusable.
Visual RAG
medium CLIP/SigLIP/ColPali embeddings → retrieve top-k images → VLM answers with citations. Best for knowledge bases that are images (slides, catalogs, manuals).
Doc Extraction Pipeline
medium Layout model → VLM per region → JSON schema → validator → human-in-loop for low confidence. Always pair with a deterministic format check.
Vision Agent
hard Loop: observe → plan → act → verify. Specialist GUI model + generalist planner is the strongest combo today.
Edge VLM
simple→medium MiniCPM-V 4.0 / SmolVLM 2 / Phi-4 mini on device for latency, privacy, or offline. Bigger model in cloud for fallback.
Distillation Pipeline
hard Frontier API labels → small open VLM SFT → in-house deployment. ROI shows up at >1M monthly requests.
14.2 Prompting patterns I use in production
Put the task before the image. Most models do better when the question precedes the image; some (Claude) accept either, but consistency reduces cache misses.
Be explicit about output format. Ask for JSON or XML and validate. Free-form prose invites hallucination.
Force grounding with phrases like "Quote the exact text from the image" or "Give the bounding box [x1, y1, x2, y2] for each item."
Restrict the answer space. "Choose from A/B/C/D/Unknown" reduces wandering and produces measurable accuracy.
Allow refusal. Include "If you cannot see X, reply 'NOT_VISIBLE'." This dramatically reduces fabrication.
// Production prompt skeleton
SYSTEM: You are a document understanding assistant.
Read the image carefully. If a value is illegible,
return "ILLEGIBLE" — never guess.
USER: [IMAGE]
Extract the following fields as JSON:
- invoice_number (string)
- total_amount (number, no currency symbol)
- line_items (array of {description, quantity, unit_price})
For each value, also provide bbox: [x1,y1,x2,y2] in 0–1000 coords.
Output ONLY valid JSON. No prose, no markdown.
Anthropic, Claude 3.5/4 with vision, Computer Use API.
OpenAI, GPT-4V / GPT-4o / GPT-5 system cards.
Google DeepMind, Gemini 1.5/2.0/2.5 technical reports.
Brohan et al., RT-2 (2023); Kim et al., OpenVLA (2024); Physical Intelligence π0/π0.5 (2024–25) — VLA models.
Chen et al., FastV (2024); Yang et al., VisionZip (2024) — visual-token pruning.
VLMEvalKit, lmms-eval — community evaluation harnesses.
How I maintain this
I refresh these notes after major release cycles — roughly quarterly. Public leaderboards lag what I see in
production by a few months, so I keep a private eval set and re-run it when a provider ships a new vision model.