Video · best for

Top picks for Video Auto-Tagging (2026)

Bulk video metadata generation. Ranked from 337 live models on the OpenRouter catalog, weighted for video input, low latency, requires_video.

What this is Ranked by capability match + real benchmark scores (Aider Polyglot, Artificial Analysis Intelligence Index) + live pricing. Models need the right specs for Video Auto-Tagging, then benchmark performance refines the order. Full methodology →
#ModelScoreIn / 1MOut / 1MContext
1 MiniMax: MiniMax M3minimax/minimax-m3 123 $0.30 $1.20 1,048,576 Details →
2 StepFun: Step 3.7 Flashstepfun/step-3.7-flash 123 $0.20 $1.15 256,000 Details →
3 Google: Gemini 3.1 Flash Litegoogle/gemini-3.1-flash-lite 123 $0.25 $1.50 1,048,576 Details →
4 NVIDIA: Nemotron 3 Nano Omni (free)nvidia/nemotron-3-nano-omni-30b-a3b-reasoning:free 123 Free Free 256,000 Details →
5 Qwen: Qwen3.5 Plus 2026-04-20qwen/qwen3.5-plus-20260420 123 $0.30 $1.80 1,000,000 Details →
6 Qwen: Qwen3.6 Flashqwen/qwen3.6-flash 123 $0.19 $1.12 1,000,000 Details →
7 Qwen: Qwen3.6 35B A3Bqwen/qwen3.6-35b-a3b 123 $0.14 $1.00 262,144 Details →
8 Qwen: Qwen3.6 27Bqwen/qwen3.6-27b 123 $0.29 $2.40 262,144 Details →
9 Xiaomi: MiMo-V2.5xiaomi/mimo-v2.5 123 $0.14 $0.28 1,048,576 Details →
10 Google: Gemma 4 26B A4B (free)google/gemma-4-26b-a4b-it:free 123 Free Free 262,144 Details →
11 Google: Gemma 4 26B A4B google/gemma-4-26b-a4b-it 123 $0.06 $0.33 262,144 Details →
12 Google: Gemma 4 31B (free)google/gemma-4-31b-it:free 123 Free Free 262,144 Details →
13 Google: Gemma 4 31Bgoogle/gemma-4-31b-it 123 $0.12 $0.36 262,144 Details →
14 Qwen: Qwen3.6 Plusqwen/qwen3.6-plus 123 $0.33 $1.95 1,000,000 Details →
15 ByteDance Seed: Seed-2.0-Litebytedance-seed/seed-2.0-lite 123 $0.25 $2.00 262,144 Details →
AI Video PixVerse Generate production-quality video from text or images.
Try free →

Affiliate link. PicksByModel may earn a commission at no extra cost to you.

How we ranked these

For Video Auto-Tagging, we weight models on video input, low latency, requires_video. Scores combine each model's public specs with independent benchmark results (Aider Polyglot coding scores, Artificial Analysis intelligence/coding/agentic indices) and live pricing. See full methodology →

About Video Auto-Tagging

Video auto-tagging is the process of automatically generating metadata labels, categories, and descriptions for video files at scale. You need this when you have dozens to thousands of videos and manually tagging each one would consume weeks of labor or is simply impractical. A good model identifies objects, actions, scenes, text overlays, and audio cues with minimal false positives, then structures output as machine-readable tags or descriptions. Bad models miss context, hallucinate tags unrelated to actual content, or fail on lower-quality video formats. Speed matters here: processing 100 hours of video with a slow model can easily cost 10x more than a fast one, so batch inference efficiency and codec support directly impact your per-video cost.

When to use: Use this when you have a library of videos that need searchable metadata but no budget or bandwidth for manual tagging. Common cases include e-commerce product videos, video archives, user-generated content platforms, or media asset management systems.

Common questions

What is the difference between video auto-tagging and video understanding?

Video auto-tagging produces structured metadata (tags, labels, categories) optimized for search and filtering. Video understanding is broader and may include generating captions, summaries, or answering questions about content. For bulk metadata generation, auto-tagging models like Gemini 2.0 Video or Claude's vision capabilities are purpose-built to be faster and cheaper.

How much does it cost to auto-tag 1,000 videos?

Cost depends on video length, resolution, and model choice. Cloud vision APIs typically charge $1 to $4 per video for moderate lengths. Batch processing and open-source models like CLIP-based taggers can reduce cost to under $0.10 per video if you have GPU infrastructure. Expect trade-offs: cheaper models produce fewer or less precise tags.

Related tasks