Voice · best for

Top picks for Transcription (2026)

Speech-to-text accuracy and speed. Ranked from 352 live models on the OpenRouter catalog, weighted for audio input, low latency.

What this is A capability-matched shortlist, not a benchmark-tested winner. Models are scored by the fit of their declared specs (structured output, reasoning, context, modality, price) against Transcription. Pair with benchmark sources like Artificial Analysis or LMSys Arena before you ship. Full methodology →
#ModelScoreIn / 1MOut / 1MContext
1 Xiaomi: MiMo-V2.5xiaomi/mimo-v2.5 123 $0.40 $2.00 1,048,576 Details →
2 Xiaomi: MiMo-V2-Omnixiaomi/mimo-v2-omni 123 $0.40 $2.00 262,144 Details →
3 Google: Gemini 3.1 Flash Lite Previewgoogle/gemini-3.1-flash-lite-preview 123 $0.25 $1.50 1,048,576 Details →
4 Google: Gemini 3 Flash Previewgoogle/gemini-3-flash-preview 123 $0.50 $3.00 1,048,576 Details →
5 Google: Gemini 2.5 Flash Lite Preview 09-2025google/gemini-2.5-flash-lite-preview-09-2025 123 $0.10 $0.40 1,048,576 Details →
6 Google: Gemini 2.5 Flash Litegoogle/gemini-2.5-flash-lite 123 $0.10 $0.40 1,048,576 Details →
7 Google: Gemini 2.5 Flashgoogle/gemini-2.5-flash 123 $0.30 $2.50 1,048,576 Details →
8 Google: Gemini 2.0 Flash Litegoogle/gemini-2.0-flash-lite-001 123 $0.07 $0.30 1,048,576 Details →
9 Google: Gemini 2.0 Flashgoogle/gemini-2.0-flash-001 123 $0.10 $0.40 1,000,000 Details →
10 Google: Gemini 3.1 Pro Preview Custom Toolsgoogle/gemini-3.1-pro-preview-customtools 115 $2.00 $12.00 1,048,576 Details →
11 Google: Gemini 3.1 Pro Previewgoogle/gemini-3.1-pro-preview 115 $2.00 $12.00 1,048,576 Details →
12 Google: Gemini 2.5 Progoogle/gemini-2.5-pro 115 $1.25 $10.00 1,048,576 Details →
13 Google: Gemini 2.5 Pro Preview 06-05google/gemini-2.5-pro-preview 115 $1.25 $10.00 1,048,576 Details →
14 Google: Gemini 2.5 Pro Preview 05-06google/gemini-2.5-pro-preview-05-06 115 $1.25 $10.00 1,048,576 Details →
15 OpenAI: GPT Audio Miniopenai/gpt-audio-mini 112 $0.60 $2.40 128,000 Details →

How we ranked these

For Transcription, we weight models on audio input, low latency. Higher means better. Scores combine each model's public metadata (context length, modality support, tool calling, structured output, reasoning capability) with live pricing. See full methodology →

Related tasks