Video · best for

Top picks for Video Auto-Tagging (2026)

Bulk video metadata generation. Ranked from 337 live models on the OpenRouter catalog, weighted for video input, low latency, requires_video.

What this is Ranked by capability match + real benchmark scores (Aider Polyglot, Artificial Analysis Intelligence Index) + live pricing. Models need the right specs for Video Auto-Tagging, then benchmark performance refines the order. Full methodology →

#	Model	Score	In / 1M	Out / 1M	Context
1	Google: Gemini 3.5 Flash Litegoogle/gemini-3.5-flash-lite	123	$0.30	$2.50	1,048,576	Details →
2	MiniMax: MiniMax M3minimax/minimax-m3	123	$0.30	$1.20	1,048,576	Details →
3	StepFun: Step 3.7 Flashstepfun/step-3.7-flash	123	$0.20	$1.15	262,144	Details →
4	Google: Gemini 3.1 Flash Litegoogle/gemini-3.1-flash-lite	123	$0.25	$1.50	1,048,576	Details →
5	NVIDIA: Nemotron 3 Nano Omni (free)nvidia/nemotron-3-nano-omni-30b-a3b-reasoning:free	123	Free	Free	256,000	Details →
6	Qwen: Qwen3.5 Plus 2026-04-20qwen/qwen3.5-plus-20260420	123	$0.30	$1.80	1,000,000	Details →
7	Qwen: Qwen3.6 Flashqwen/qwen3.6-flash	123	$0.19	$1.12	1,000,000	Details →
8	Qwen: Qwen3.6 35B A3Bqwen/qwen3.6-35b-a3b	123	$0.14	$1.00	262,144	Details →
9	Qwen: Qwen3.6 27Bqwen/qwen3.6-27b	123	$0.60	$3.60	262,144	Details →
10	Xiaomi: MiMo-V2.5xiaomi/mimo-v2.5	123	$0.14	$0.28	1,050,000	Details →
11	Google: Gemma 4 26B A4B google/gemma-4-26b-a4b-it	123	$0.12	$0.35	262,144	Details →
12	Google: Gemma 4 26B A4B (free)google/gemma-4-26b-a4b-it:free	123	Free	Free	262,144	Details →
13	Google: Gemma 4 31Bgoogle/gemma-4-31b-it	123	$0.12	$0.35	262,144	Details →
14	Google: Gemma 4 31B (free)google/gemma-4-31b-it:free	123	Free	Free	262,144	Details →
15	Qwen: Qwen3.6 Plusqwen/qwen3.6-plus	123	$0.33	$1.95	1,000,000	Details →

AI Video PixVerse Generate production-quality video from text or images.

Try free →

Affiliate link. PicksByModel may earn a commission at no extra cost to you.

How we ranked these

For Video Auto-Tagging, we weight models on video input, low latency, requires_video. Scores combine each model's public specs with independent benchmark results (Aider Polyglot coding scores, Artificial Analysis intelligence/coding/agentic indices) and live pricing. See full methodology →

About Video Auto-Tagging

Video auto-tagging is the process of automatically generating metadata labels, categories, and descriptions for video files at scale. You need this when you have dozens to thousands of videos and manually tagging each one would consume weeks of labor or is simply impractical. A good model identifies objects, actions, scenes, text overlays, and audio cues with minimal false positives, then structures output as machine-readable tags or descriptions. Bad models miss context, hallucinate tags unrelated to actual content, or fail on lower-quality video formats. Speed matters here: processing 100 hours of video with a slow model can easily cost 10x more than a fast one, so batch inference efficiency and codec support directly impact your per-video cost.

When to use: Use this when you have a library of videos that need searchable metadata but no budget or bandwidth for manual tagging. Common cases include e-commerce product videos, video archives, user-generated content platforms, or media asset management systems.

Common questions

What is the difference between video auto-tagging and video understanding?

Video auto-tagging produces structured metadata (tags, labels, categories) optimized for search and filtering. Video understanding is broader and may include generating captions, summaries, or answering questions about content. For bulk metadata generation, auto-tagging models like Gemini 2.0 Video or Claude's vision capabilities are purpose-built to be faster and cheaper.

How much does it cost to auto-tag 1,000 videos?

Cost depends on video length, resolution, and model choice. Cloud vision APIs typically charge $1 to $4 per video for moderate lengths. Batch processing and open-source models like CLIP-based taggers can reduce cost to under $0.10 per video if you have GPU infrastructure. Expect trade-offs: cheaper models produce fewer or less precise tags.

Related tasks

Video

Best for Video Summarization

Extracting key moments and summaries from video.