Personal notes · May 2026

Vision-Language Models (VLM)

My working notes on multimodal systems — how VLMs are built, where they break in production, and how I choose between frontier APIs and open weights when latency, cost, or privacy matter. I wrote this for myself while shipping document AI and agent-style workloads; treat the model names and benchmarks as snapshots, not gospel.

The question I keep returning to: where does the model actually fail on my images — preprocessing, token budget, or reasoning — and what would I measure before betting a product on it?
Covers: Architecture · Training · Inference · Eval · Production Written: May 2026 ✍️ By: Linh Truong

01Where Things Stand (May 2026)

Vision-Language Models (VLMs) — I still hear Multimodal Large Language Models (MLLMs) in papers — jointly read pixels and text. Between 2023 and 2026 they went from lab demos (BLIP-2, MiniGPT-4) to something I reach for in production: document extraction, GUI agents, robotics, everyday chat. The pattern I see winning is native multimodal pretraining: one transformer ingests interleaved image, video, and text tokens, with a vision encoder (or pixel patchifier) feeding the same decoder that writes language.

My read As of May 2026, frontier VLMs (GPT-5, Claude Opus 4.x, Gemini 2.5 Pro) are strong enough on static benchmarks that I stop arguing about "can the model see?" and start arguing about cost, latency, and failure modes. The interesting edge is long video, fine-grained spatial grounding, GUI/computer-use agents, and vision-language-action (VLA) stacks for robotics. My default split: frontier API when reasoning is the bottleneck; open weights (Qwen3-VL, Llama 4 Vision, InternVL3, Pixtral, Molmo) when I need privacy, fine-tuning, or unit economics.

Architecture — what I optimize first

The connector between vision encoder and LLM is the highest-leverage design choice. MLP projectors win on simplicity and quality; cross-attention (Flamingo/Llama 3.2 V) wins on token economy; native pixel-into-decoder (Fuyu, Chameleon, Pixtral) wins on flexibility.

Training — what actually moves the needle

Data quality dominates model size. Synthetic instruction data (LLaVA-NeXT, Cambrian, ShareGPT4V) and visual-RL/DPO are now table stakes.

Inference — where the bill shows up

Visual tokens dominate prefill cost. Resolution tiling, token pruning (FastV, VisionZip), and prefix caching of system images are where real money is saved.

Production — what breaks before the model does

Most VLM incidents I have seen are not "the model is dumb" — they are preprocessing, prompt formatting, and missing OCR fallback bugs. I build an eval harness before I pick a model.

02What Is a Vision-Language Model

Working definition A Vision-Language Model (VLM) is a neural network that conditions language generation on visual inputs (images, video frames, or screenshots) and optionally produces grounded outputs (bounding boxes, masks, points, actions). Modern VLMs share weights between vision and language reasoning rather than treating vision as a separate captioning step.

Capabilities I treat as baseline in 2026

Capabilities Still Fragile

03Anatomy & Architecture

At its core, every modern VLM is a four-stage pipeline:

Figure 1 — Canonical VLM Pipeline
Image + Text Prompt Vision Encoder ViT-L/SigLIP/InternViT patch → embeddings (or pure pixel patchifier) Connector MLP / Q-Former / Cross-Attn / Resampler aligns to LLM space LLM Decoder Llama / Qwen / Mistral / Gemma / proprietary treats vision tokens like text tokens causal attention + KV cache Output text · boxes · actions · code pixels v-tokens aligned tokens Pixels → patches → visual embeddings → projected to LLM space → interleaved with text → autoregressive generation

3.1 The Vision Encoder

The encoder maps an image into a sequence of embeddings, one per spatial patch. Three lineages dominate:

FamilyExamplesWhy it winsWhere it loses
CLIP-styleOpenAI CLIP, EVA-CLIP, OpenCLIPStrong zero-shot alignment from contrastive pretraining at scale.Lower-resolution defaults; weaker on dense OCR.
SigLIP / SigLIP 2Google SigLIP-SO400MSigmoid loss → better stability, smaller batches, strong fine-grained.Slightly worse on retrieval-style tasks vs CLIP at parity.
Native ViT-onlyInternViT-6B, AIM, DINOv2, DINOv3Self-supervised features generalize; great for grounding & segmentation.Requires its own alignment step into the LLM.
Encoder-free / Pixel-inFuyu-8B, Chameleon, PixtralArbitrary resolution, simpler stack, true native multimodal.Higher token cost; needs more pretraining data.

3.2 The Language Model

Almost every open VLM in 2026 is built on top of an existing LLM backbone — typically a Llama, Qwen, Mistral, Gemma, or DeepSeek variant. The choice of backbone constrains: context length, tool-use formatting, system-prompt behavior, license, and quantization support. A common mistake is assuming the VLM inherits the base LLM's full text quality; instruction-tuning on multimodal data often regresses pure-text performance by 3–8 % on hard reasoning benchmarks.

3.3 The Connector (a.k.a. Projector, Adapter, Resampler)

The connector compresses or projects vision-encoder outputs into the LLM's embedding space. This is where most engineering tradeoffs live; see Section 4.

04Connector Patterns

Figure 2 — Four Connector Patterns at a Glance
1. Linear / MLP Projection ViT MLP LLM LLaVA-1.5/NeXT, Cambrian, most open VLMs 2. Q-Former / Resampler ViT N learned queries LLM BLIP-2, MiniGPT-4, Qwen-VL, InstructBLIP — fixed budget 3. Cross-Attention Layers ViT LLM with gated x-attn into vision keys/values Flamingo, IDEFICS-2, Llama 3.2 Vision (11B/90B) 4. Encoder-Free / Native Patchify + Linear Decoder treats patches as tokens directly Fuyu-8B, Chameleon, Pixtral, frontier models Tradeoff dimensions Token cost: MLP > Pixel-in > X-attn ≈ Q-Former (lowest fixed budget) Quality ceiling: MLP and Pixel-in scale best with data; Q-Former saturates earlier Training simplicity: MLP is two-stage and easy; X-attn requires interleaved gating; Pixel-in needs largest pretrain budget Long-context video: Q-Former and X-attn shine — they cap per-frame token count Grounding / OCR: MLP + high-res tiling currently leads on documents and small text

4.1 When to pick which

You care most aboutPickNotes
Best general quality on a fixed budgetMLP projector + dynamic-tiling encoderThe LLaVA-NeXT/Qwen2-VL/InternVL recipe. Default for 2026.
Many images or long videoQ-Former or cross-attentionConstant token budget per frame; Gemini-style architectures lean here.
Arbitrary resolution, simplicityEncoder-free / pixel-inPixtral, Fuyu, Chameleon. Requires more pretraining compute.
Tight retrofit onto a strong text-LLMCross-attentionLets you freeze the base LLM and add capability. Llama 3.2 Vision uses this.

05Training Lifecycle

Figure 3 — Modern VLM Training Stages
Stage 1 Vision-language alignment freeze ViT & LLM, train connector Stage 2 Multimodal pretraining interleaved web data, unfreeze most weights Stage 3 Visual instruction tuning (SFT) curated VQA, OCR, grounding, reasoning Stage 4 Preference / RL alignment DPO, RLHF, RLAIF, hallucination-RL Stage 5 Domain SFT / LoRA customer fine-tune: docs, charts, GUI Frontier "native multimodal" models collapse stages 1–2 into a single from-scratch pretrain on interleaved tokens.

5.1 Data sources that matter

Tip When fine-tuning a VLM on your domain, mix 10–20 % general-domain instruction data back in. Pure-domain SFT reliably breaks instruction-following and OCR on out-of-domain images.

5.2 LoRA, QLoRA, and parameter-efficient fine-tuning

For most production use cases, fine-tune only the connector + LoRA adapters on the LLM. Rule of thumb:

GoalWhat to unfreezeTypical recipe
Style / format onlyLLM LoRA (r=16–32)1–2 epochs, 5k–50k examples
New visual domain (medical, satellite)Connector full + LLM LoRA + last 2 ViT layers3 epochs, 100k+ examples, mix general data
New output modality (boxes, masks)Connector + LLM LoRA + format tokensAdd ~5% synthetic grounding examples
Latency-critical domain modelDistill from frontier API → small open VLM2-stage: caption distillation, then task SFT

062026 Model Landscape

I bucket the market into frontier closed, frontier open, and specialist models. The table below is my shortlist when I am choosing what to pilot — May 2026 snapshot, re-check release notes before you commit.

6.1 Frontier closed (API-only)

ModelProviderStrengthsWatch-outs
GPT-5 / GPT-5 miniOpenAI Strongest general multimodal reasoning; native voice + vision; computer-use mode. Token costs; image rate limits; latency on heavy reasoning.
Claude Opus 4.x / Sonnet 4.xAnthropic Best-in-class document, chart, and PDF understanding; computer-use API; long-context (1M+). No native video at parity yet; image input billed in tokens.
Gemini 2.5 Pro / FlashGoogle Long video (1–2h with caching), 2M-token context, native audio & video. Behavior varies across regions; thinking-budget knob requires care.
Grok 3 / 4 VisionxAI Strong on real-world image reasoning; fast inference. Newer ecosystem; fewer integrations.

6.2 Frontier open-weight (self-hosting viable)

ModelSizesStrengthsLicense
Qwen3-VL / Qwen2.5-VL2B → 72B (dense & MoE)Best open OCR & document; strong grounding; dynamic resolution.Tongyi Qianwen (commercial OK with conditions)
Llama 4 VisionScout · Maverick · Behemoth classNative MoE; long context; ecosystem support; strong English.Llama Community
InternVL 3 / 2.51B → 78BBest-in-class multilingual OCR; matches GPT-4o on MMMU at the 78B tier.MIT
Pixtral 12B / Large12B, 124BEncoder-free, native arbitrary-resolution; clean format.Apache 2.0 (12B)
Molmo1B / 7B / 72BOpen data + open weights; strong pointing/grounding via PixMo.Apache 2.0
DeepSeek-VL23B / 16B / 27B MoEEfficient MoE; strong reasoning per active param.DeepSeek License
Phi-4-Multimodal~5–6BSmall-but-mighty; audio + vision + text; edge-friendly.MIT
PaliGemma 23B / 10B / 28BTask-specialist; best for fine-tuning to narrow vision tasks.Gemma
NVLM / NVILA7B – 72BNVIDIA recipes; strong out-of-box on document & chart.NVIDIA OneWay (research-leaning)
Aria~25B MoE (3.9B active)Long video & document MoE.Apache 2.0

6.3 Specialists worth knowing

Document & OCR

GOT-OCR 2.0, Nougat, MinerU, SmolDocling, olmOCR — purpose-built for layout-aware extraction. Use as fallback or preprocessor when a generalist VLM under-performs.

GUI / Computer Use

UI-TARS, OS-Atlas, ShowUI, SeeClick, Claude Computer Use, GPT-5 Operator. These output click/keystroke actions from screenshots.

Vision-Language-Action (Robotics)

RT-2, OpenVLA, π0 / π0.5, Gemini Robotics, Helix. Emit end-effector or motor commands.

Embedding / Retrieval

SigLIP 2, Jina CLIP v2, NV-Embed-MM, ColPali, VisRAG — image+text embeddings for multimodal RAG.

Edge / Mobile

MobileVLM v2, MiniCPM-V 2.6 / 4.0, SmolVLM 2, Phi-4 mini-multimodal. Sub-3B models that run on phones, Jetsons, and laptops.

Open Foundational

Chameleon, Emu3, Janus-Pro — early/native-token models that can also generate images. Useful for understanding-generation unification research.

07Image Tokenization & Resolution Strategies

The single most expensive variable at inference time is the number of vision tokens per image. A 4K screenshot in a naive 14×14 ViT becomes ~85k tokens — exhausting context windows and budgets. Modern VLMs use one of three strategies:

Figure 4 — Resolution Strategies
A. Fixed Low-Res 336 × 336 or 448 × 448 ~576 vision tokens Cheap, fast, but loses text and small details LLaVA-1.5, MiniGPT-4 B. AnyRes / Tiling Split into N × M tiles + 1 global thumbnail LLaVA-NeXT, InternVL, GPT-4o, Claude C. Native Dynamic Res Aspect-ratio preserved, arbitrary H × W 2D RoPE / M-RoPE over patch grid Qwen2-VL, Qwen3-VL, Pixtral, Fuyu Engineer cheat-sheet — typical vision-token counts per image Phone snap (1024²) → 256–1024 tokens after compression PDF page (A4 200dpi) → 1.5k–4k tokens depending on tiling 4K desktop screenshot → 4k–16k tokens; cap with a downscale-to-2K policy unless OCR-critical

7.1 Token-budget rules of thumb

08Beyond Single Images — Video, Documents, Charts, UI

8.1 Video

Three practical approaches:

  1. Uniform frame sampling (1–2 fps) — simplest; works to ~5 minutes on most open VLMs.
  2. Keyframe / scene-cut sampling — pre-process with a shot detector; pass 16–64 frames.
  3. Native video tokens (Qwen3-VL, Gemini 2.5, GPT-5) — model expects a video tensor or URI and handles temporal compression internally.

For long video, combine retrieval (Video-RAG) with frame sampling — embed clips, retrieve relevant windows, then pass only those to the VLM. M-RoPE / temporal positional embeddings let modern VLMs maintain absolute-time awareness across long sequences.

8.2 Documents and PDFs

The document pipeline I use when accuracy matters more than demo speed:

  1. Render each page to image at 150–200 DPI (higher only for fine-print).
  2. Tile aggressively for dense pages (Qwen3-VL or InternVL handle this natively).
  3. Pair with layout tokens from a lightweight layout model (e.g., DocLayout-YOLO) for structured extraction.
  4. Use a generalist VLM (Claude, GPT-5, Qwen3-VL-72B) for end-to-end Q&A and citation; reserve specialist OCR (GOT-OCR 2.0, MinerU) for archival extraction at scale.

8.3 Charts, tables, math

8.4 GUI and screenshots

For agents that drive UIs, the screenshot resolution policy is critical: too small → can't read buttons, too large → cost explodes. Use a two-pass recipe — low-res for planning, high-res crops for clicking. Specialist GUI models (UI-TARS, OS-Atlas) ship with click-coordinate output formats; align your harness to that schema.

09Inference & Serving

Figure 5 — Inference Path & Optimization Surface
Image Preproc resize · tile · normalize CPU / GPU kernel ViT Forward batched patch encode FlashAttention, fp16/bf16 Connector / Project to LLM embed space usually negligible cost LLM Prefill vision + text tokens dominant cost step LLM Decode autoregressive memory-bound Where the time and money go — and what to tune Preproc: do it on GPU (torchvision.v2, NVImageCodec). On large images CPU JPEG decode is a real bottleneck. ViT: bf16/fp8, FlashAttention-3, cudaGraphs. For batches, prefer many small images over one giant tiled one. Prefill: this is where vision tokens hurt — use prefix caching, token pruning, chunked prefill (vLLM, SGLang). Decode: speculative decoding, MQA/GQA, INT4/FP8 weights, PagedAttention. Same playbook as text-only LLMs. Routing: cascade — small VLM screens, large VLM only when small one is uncertain or asked for grounding.

9.1 Recommended serving stacks (2026)

StackBest forNotes
vLLM (≥0.7)General OSS VLM servingFirst-class multimodal: Qwen2/3-VL, Llama 3.2/4 Vision, Pixtral, InternVL, MiniCPM, Phi-4. Prefix caching for images.
SGLangComplex multi-turn / structured-outputRadixAttention for prefix caching across turns and images; great with tool use.
TensorRT-LLMNVIDIA-optimized productionBest raw throughput on H100/H200/B100; longer integration tail per new architecture.
TGI (HF)HuggingFace-native pipelinesEasy to deploy; slightly behind on multimodal feature parity.
llama.cpp / Ollama / MLXEdge, laptop, on-prem devGGUF-quantized Qwen2-VL, MiniCPM-V, SmolVLM, LLaVA — viable on a MacBook.
Triton + customHyperscale customLean here only when single-team ownership justifies the operational cost.

9.2 Prefix and image caching

Tip If your application repeatedly sends the same system image (logo template, base UI), serve it with a server that supports prefix caching (vLLM, SGLang, Anthropic prompt cache, Gemini context cache, OpenAI prompt caching). It is typically the single biggest cost win, often 10× on heavy-image workloads.

9.3 Batching strategy

10Quantization & Optimization

TechniqueQuality impactWhen to use
BF16 / FP16BaselineDefault for training and high-throughput H100 inference.
FP8 (E4M3 / E5M2)≤1 % drop on most benchmarksHopper/Blackwell production inference. TensorRT-LLM & vLLM both support it.
INT8 W8A8~1–2 % dropA100/L4/L40S deployments.
INT4 (AWQ, GPTQ, GGUF Q4)1–4 % drop; OCR/chart can regress moreMemory-constrained or edge. Test on your visual tasks before shipping.
FP4 / NVFP4~1 % with calibrationBlackwell-only; emerging for VLMs.
Vision-token pruning0–2 % dropFastV, VisionZip, SparseVLM — drop after layer 2–4. Free throughput.
Speculative decodingNone (lossless)Draft-model based; great for long answers.
Warn Always benchmark quantized VLMs on your OCR / document / chart tasks. Quantization affects fine-grained visual reasoning more than general VQA — a model that scores within 1 % on MMBench can drop 10 % on a small-text invoice task.

11Evaluation & Benchmarks

11.1 Public benchmarks worth tracking

BenchmarkWhat it measuresSignal quality
MMMU / MMMU-ProCollege-level multimodal reasoning across 30+ disciplinesHigh — top public score still discriminating in 2026
MathVista, MathVerseMath + diagramsHigh; frontier-only
MMBench / MMBench-v1.1General VQA across abilitiesMedium; saturated on frontier
MMStarVQA with leak-resistant questionsHigh
DocVQA, ChartQA, InfographicVQADocument & chart readingHigh for vertical apps
OCRBench / OCRBench v2OCR accuracy across languages, fontsHigh; pair with your own samples
RealWorldQA, BLINK, MMVPPerceptual edge cases & pairsHigh; cheap to run
RefCOCO/+/g, ReasonSegVisual groundingHigh for agents
VideoMME, MVBench, LongVideoBenchVideo understanding short→longHigh; gold standard for video VLMs
POPE, HallusionBench, MMHalHallucination rateCritical for production decisions
ScreenSpot, ScreenSpot-Pro, OSWorldGUI grounding & computer useHigh for agent products
VLMEvalKit / lmms-evalHarness running 50+ benchmarks reproduciblyUse this; do not roll your own

11.2 Build your own eval set

Public benchmarks tell you whether a model is generally capable. They do not tell you whether it will work on your images. Every team shipping VLMs should build:

  1. 200–500 golden examples drawn from real production traffic (with consent & redaction).
  2. Pairwise human or LLM-judge scoring with explicit rubrics (grounding, hallucination, format).
  3. Adversarial slices: low-light, occluded, multi-lingual, low-DPI scans, rotated.
  4. Tracked over time in a dashboard with diff-vs-baseline.
Rule Eval first, model second. A team without an eval harness picks the wrong model 60 % of the time and never finds out.

12Hallucinations & Safety

12.1 Common failure modes

12.2 Mitigations

12.3 Safety surface specific to VLMs

13Agents, Computer Use & Vision-Language-Action

Figure 6 — Agentic VLM Loop
Environment desktop · browser · robot · OS Observation screenshot · camera VLM Reasoner plan · ground · decide CoT, ReAct, reflection Action click · type · grasp · API Verifier / Critic did the action work? screenshot diff, success rule

13.1 Computer Use

Claude Computer Use, GPT-5 Operator, and open-weights stacks like UI-TARS, OS-Atlas, and ShowUI let a VLM drive a desktop or browser. What I have learned in practice:

13.2 Vision-Language-Action (VLA) for robotics

VLA models output low-level robot commands instead of text. RT-2, OpenVLA, π0 / π0.5, Gemini Robotics, and Figure's Helix dominate 2026 demos. The architectural twist: action tokens are interleaved with language tokens in the same decoder. When I wire up a VLA stack, I plan for:

14Shipping VLMs in Production

14.1 Pattern library

Visual Q&A

simple Single API call. Structured output schema. Prompt cache the system image if reusable.

Visual RAG

medium CLIP/SigLIP/ColPali embeddings → retrieve top-k images → VLM answers with citations. Best for knowledge bases that are images (slides, catalogs, manuals).

Doc Extraction Pipeline

medium Layout model → VLM per region → JSON schema → validator → human-in-loop for low confidence. Always pair with a deterministic format check.

Vision Agent

hard Loop: observe → plan → act → verify. Specialist GUI model + generalist planner is the strongest combo today.

Edge VLM

simple→medium MiniCPM-V 4.0 / SmolVLM 2 / Phi-4 mini on device for latency, privacy, or offline. Bigger model in cloud for fallback.

Distillation Pipeline

hard Frontier API labels → small open VLM SFT → in-house deployment. ROI shows up at >1M monthly requests.

14.2 Prompting patterns I use in production

// Production prompt skeleton
SYSTEM: You are a document understanding assistant.
        Read the image carefully. If a value is illegible,
        return "ILLEGIBLE" — never guess.

USER:   [IMAGE]
        Extract the following fields as JSON:
        - invoice_number (string)
        - total_amount (number, no currency symbol)
        - line_items (array of {description, quantity, unit_price})
        For each value, also provide bbox: [x1,y1,x2,y2] in 0–1000 coords.
        Output ONLY valid JSON. No prose, no markdown.

14.3 Observability

15Cost & Throughput Engineering

15.1 Cost levers ranked by impact

  1. Prefix cache reusable images (logo, base template, system shot) — typical 5–15× cost reduction.
  2. Right-size resolution per task — most chat-style use needs ≤1024×1024.
  3. Cascade: small open VLM screens, frontier handles only ambiguous 5–20 %.
  4. Vision-token pruning (FastV, VisionZip) on self-hosted — 40–70 % prefill saving.
  5. FP8 / INT4 weights — 2–4× throughput on the same GPU.
  6. Speculative decoding — 1.5–3× decode speedup, lossless.
  7. Batched async — never block on a single request when your traffic is mixed.

15.2 GPU memory budgeting (rough)

Model classBF16 weightsFP8 weightsMin GPU for serving
~2–4B (Phi-4, SmolVLM, MiniCPM)~6–9 GB~3–5 GBRTX 4060 / Jetson Orin
~7–8B (LLaVA-NeXT, Qwen2-VL-7B)~16 GB~9 GBL4 / 24 GB
~12–14B (Pixtral 12B, Phi-4 14B)~26 GB~14 GBA10G / L40S / single 4090
~30–40B~70 GB~38 GBA100 80 GB or H100
~70–78B (Qwen3-VL-72B, InternVL-78B)~150 GB~80 GB2× H100 / 1× H200 / 1× B100

Add ~10–25% headroom for KV cache and activations; more if you accept many concurrent long-context image requests.

16What I Am Watching Next

What is shipping in 2026

  • Native any-modality decoders — one model emits text, audio, and image tokens (GPT-5o, Gemini 2.5, Chameleon/Janus lineage).
  • Long video — 1+ hour reasoning becoming mainstream via temporal compression + caching.
  • Test-time scaling for vision — visual chain-of-thought, search over visual hypotheses (o-series, R1-Vision style).
  • Generalist VLA — robotics policies that transfer across embodiments.
  • Multimodal RAG with grounded citations — answer + bounding-box source by default.

Still open

  • Reliable counting and spatial reasoning at high object density.
  • Compositional generalization on unusual object–attribute combinations.
  • Continuous video memory beyond a few hours without summarization tricks.
  • Faithfulness under adversarial images (typographic, OCR-injection attacks).
  • Efficient on-device inference with full quality preserved for OCR.
  • Standardized grounding output formats across providers — every model still emits a different schema.

17How I Choose a Model

Figure 7 — Model Selection Decision Tree
Need a VLM define task, eval, budget Sensitive data / on-prem? privacy, compliance, regulator Self-host open weights Qwen3-VL · InternVL · Pixtral YES Frontier API OK GPT-5 · Claude 4.x · Gemini 2.5 NO Latency < 300 ms required? Edge VLM MiniCPM-V · SmolVLM YES Server-side OSS VLM Qwen3-VL-7B/72B NO Complex reasoning / agent? Frontier flagship Opus / GPT-5 YES Frontier mini Sonnet / Flash / 5-mini NO My rule: eval set first. Cascade small → large when I can. Distill from the API once traffic justifies the GPU bill.

17.1 Mistakes I have seen (or made)

18Glossary & References

18.1 Glossary

VLM / MLLMVision-Language Model / Multimodal LLM
ViTVision Transformer — patch-based image encoder.
CLIP / SigLIPContrastive image-text encoders; SigLIP uses sigmoid loss.
Connector / ProjectorModule bridging vision encoder to LLM embedding space.
Q-FormerQuerying transformer (BLIP-2) producing fixed-size visual token budget.
Cross-attention adapterGated attention layers inside LLM that read vision K/V (Flamingo).
AnyRes / Dynamic tilingVariable-resolution image input via tile + thumbnail strategy.
M-RoPEMultimodal Rotary Position Embedding — extends RoPE over (t, h, w).
Vision-token pruningDropping less-informative visual tokens at intermediate LLM layers.
Visual instruction tuningSFT on (image, instruction, response) triples; LLaVA-style.
VLAVision-Language-Action — VLMs that emit motor commands.
Computer UseVLM-driven control of a screen via screenshots + click/keystroke actions.
Multimodal RAGRetrieval-Augmented Generation where retrieved items are images or mixed.
Prefix cacheReusing KV-cache for repeated prompt prefixes (often a system image).
HallucinationOutput that contradicts the input image; the dominant production failure.

18.2 Reference papers and projects

How I maintain this I refresh these notes after major release cycles — roughly quarterly. Public leaderboards lag what I see in production by a few months, so I keep a private eval set and re-run it when a provider ships a new vision model.