Vision · best for

Top picks for Image Captioning (2026)

Accessible alt text and detailed image descriptions. Ranked from 333 live models on the OpenRouter catalog, weighted for vision input, low latency.

What this is Ranked by capability match + real benchmark scores (Aider Polyglot, Artificial Analysis Intelligence Index) + live pricing. Models need the right specs for Image Captioning, then benchmark performance refines the order. Full methodology →

#	Model	Score	In / 1M	Out / 1M	Context
1	MiniMax: MiniMax M3minimax/minimax-m3	121	$0.30	$1.20	1,048,576	Details →
2	MoonshotAI: Kimi K2.6moonshotai/kimi-k2.6	121	$0.68	$3.42	262,144	Details →
3	MoonshotAI: Kimi K2.7 Codemoonshotai/kimi-k2.7-code	121	$0.82	$3.75	262,144	Details →
4	Qwen: Qwen3.6 Plusqwen/qwen3.6-plus	121	$0.33	$1.95	1,000,000	Details →
5	OpenAI: GPT-5.4 Miniopenai/gpt-5.4-mini	121	$0.75	$4.50	400,000	Details →
6	Qwen: Qwen3.7 Plusqwen/qwen3.7-plus	120	$0.32	$1.28	1,000,000	Details →
7	OpenAI: GPT-5.4 Nanoopenai/gpt-5.4-nano	120	$0.20	$1.25	400,000	Details →
8	MoonshotAI: Kimi K2.5moonshotai/kimi-k2.5	120	$0.57	$2.85	262,144	Details →
9	Qwen: Qwen3.6 27Bqwen/qwen3.6-27b	120	$0.45	$2.70	262,144	Details →
10	Qwen: Qwen3.5-27Bqwen/qwen3.5-27b	120	$0.26	$2.60	262,144	Details →
11	Qwen: Qwen3.5 397B A17Bqwen/qwen3.5-397b-a17b	120	$0.39	$2.34	262,144	Details →
12	Qwen: Qwen3.6 35B A3Bqwen/qwen3.6-35b-a3b	120	$0.14	$1.00	262,144	Details →
13	Qwen: Qwen3.5-122B-A10Bqwen/qwen3.5-122b-a10b	120	$0.26	$2.08	262,144	Details →
14	StepFun: Step 3.7 Flashstepfun/step-3.7-flash	120	$0.20	$1.15	256,000	Details →
15	Google: Gemma 4 31Bgoogle/gemma-4-31b-it	120	$0.12	$0.37	262,144	Details →

AI Video PixVerse Generate production-quality video from text or images.

Try free →

Affiliate link. PicksByModel may earn a commission at no extra cost to you.

How we ranked these

For Image Captioning, we weight models on vision input, low latency. Scores combine each model's public specs with independent benchmark results (Aider Polyglot coding scores, Artificial Analysis intelligence/coding/agentic indices) and live pricing. See full methodology →

About Image Captioning

Image captioning is the task of generating natural language descriptions for images, producing text that conveys visual content accurately and contextually. Use this when you need accessible alt text for web content, searchable descriptions for image archives, or automated tagging for large visual datasets. Good models balance accuracy with brevity, describing objects and relationships without hallucinating details that aren't present, while poor ones produce generic or misleading text. The critical trade-off: vision-language models like BLIP or LLaVA generate more natural captions than older CNN-based approaches but require significantly more computational resources, typically 2-4x slower inference time depending on model size.

When to use: Use this when you need to automatically generate text descriptions for images so they're readable by screen readers, searchable in databases, or accessible to people who can't see them.

Common questions

Which AI model produces the most accurate image captions today?

BLIP-2 and LLaVA represent the current best-in-class for caption quality, with LLaVA-1.6 offering particularly strong reasoning about image relationships. If you need faster inference, BLIP (the original) still delivers solid accuracy at half the computational cost. For production use, your choice depends on whether you prioritize caption quality or response latency.

How much does it cost to caption thousands of images at scale?

Running open-source models like LLaVA yourself costs roughly $0.0001-0.0005 per image on cloud compute, while API services like Google Vision or AWS Rekognition charge $0.0015-0.004 per image. For 10,000 images, self-hosted models save 50-70% but require infrastructure setup, whereas APIs eliminate operational overhead.

Related tasks

Vision

Top picks for Image Captioning (2026)

How we ranked these

About Image Captioning

Common questions

Which AI model produces the most accurate image captions today?

How much does it cost to caption thousands of images at scale?

Related tasks

Best for Image Generation

Best for Diagram Extraction

Best for Screenshot Debugging

Best for Chart & Graph Reading