Vision · best for
Best AI model for Image Captioning (2026)
Accessible alt text and detailed image descriptions. Ranked from 346 live models on the OpenRouter catalog, weighted for vision input, low latency.
| # | Model | Score | In / 1M | Out / 1M | Context | |
|---|---|---|---|---|---|---|
| 1 | MoonshotAI: Kimi K2.6moonshotai/kimi-k2.6 | 119 | $0.80 | $3.50 | 262,144 | Try → |
| 2 | Google: Gemma 4 26B A4B (free)google/gemma-4-26b-a4b-it:free | 119 | Free | Free | 262,144 | Try → |
| 3 | Google: Gemma 4 26B A4B google/gemma-4-26b-a4b-it | 119 | $0.07 | $0.35 | 262,144 | Try → |
| 4 | Google: Gemma 4 31B (free)google/gemma-4-31b-it:free | 119 | Free | Free | 262,144 | Try → |
| 5 | Google: Gemma 4 31Bgoogle/gemma-4-31b-it | 119 | $0.13 | $0.38 | 262,144 | Try → |
| 6 | Qwen: Qwen3.6 Plusqwen/qwen3.6-plus | 119 | $0.33 | $1.95 | 1,000,000 | Try → |
| 7 | Xiaomi: MiMo-V2-Omnixiaomi/mimo-v2-omni | 119 | $0.40 | $2.00 | 262,144 | Try → |
| 8 | OpenAI: GPT-5.4 Nanoopenai/gpt-5.4-nano | 119 | $0.20 | $1.25 | 400,000 | Try → |
| 9 | OpenAI: GPT-5.4 Miniopenai/gpt-5.4-mini | 119 | $0.75 | $4.50 | 400,000 | Try → |
| 10 | Mistral: Mistral Small 4mistralai/mistral-small-2603 | 119 | $0.15 | $0.60 | 262,144 | Try → |
| 11 | ByteDance Seed: Seed-2.0-Litebytedance-seed/seed-2.0-lite | 119 | $0.25 | $2.00 | 262,144 | Try → |
| 12 | Qwen: Qwen3.5-9Bqwen/qwen3.5-9b | 119 | $0.10 | $0.15 | 262,144 | Try → |
| 13 | Google: Gemini 3.1 Flash Lite Previewgoogle/gemini-3.1-flash-lite-preview | 119 | $0.25 | $1.50 | 1,048,576 | Try → |
| 14 | ByteDance Seed: Seed-2.0-Minibytedance-seed/seed-2.0-mini | 119 | $0.10 | $0.40 | 262,144 | Try → |
| 15 | Qwen: Qwen3.5-35B-A3Bqwen/qwen3.5-35b-a3b | 119 | $0.16 | $1.30 | 262,144 | Try → |
How we ranked these
For Image Captioning, we weight models on vision input, low latency. Higher means better. Scores combine OpenRouter's model metadata (context length, modality support, tool calling, structured output, reasoning capability) with public pricing. See full methodology →
Related tasks
Vision
Best for Image Generation
Models that produce images, not just read them.
Vision
Best for Diagram Extraction
Reading flowcharts, org charts, architecture diagrams.
Vision
Best for Screenshot Debugging
Diagnosing UI bugs from a screenshot.
Vision
Best for Chart & Graph Reading
Pulling numbers off charts in research papers and reports.