Voice · best for

Top picks for Transcription (2026)

Speech-to-text accuracy and speed. Ranked from 340 live models on the OpenRouter catalog, weighted for audio input, low latency, requires_audio.

What this is Ranked by capability match + real benchmark scores (Aider Polyglot, Artificial Analysis Intelligence Index) + live pricing. Models need the right specs for Transcription, then benchmark performance refines the order. Full methodology →

#	Model	Score	In / 1M	Out / 1M	Context
1	Google: Gemini 3.5 Flash Litegoogle/gemini-3.5-flash-lite	123	$0.30	$2.50	1,048,576	Details →
2	Google: Gemini 3.1 Flash Litegoogle/gemini-3.1-flash-lite	123	$0.25	$1.50	1,048,576	Details →
3	NVIDIA: Nemotron 3 Nano Omni (free)nvidia/nemotron-3-nano-omni-30b-a3b-reasoning:free	123	Free	Free	256,000	Details →
4	Xiaomi: MiMo-V2.5xiaomi/mimo-v2.5	123	$0.14	$0.28	1,050,000	Details →
5	Google: Gemini 3.1 Flash Lite Previewgoogle/gemini-3.1-flash-lite-preview	123	$0.25	$1.50	1,048,576	Details →
6	Google: Gemini 3 Flash Previewgoogle/gemini-3-flash-preview	123	$0.50	$3.00	1,048,576	Details →
7	Google: Gemini 2.5 Flash Litegoogle/gemini-2.5-flash-lite	123	$0.10	$0.40	1,048,576	Details →
8	Google: Gemini 2.5 Flashgoogle/gemini-2.5-flash	123	$0.30	$2.50	1,048,576	Details →
9	Google: Gemini 3.6 Flashgoogle/gemini-3.6-flash	115	$1.50	$7.50	1,048,576	Details →
10	Thinking Machines: Inklingthinkingmachines/inkling	115	$1.00	$4.05	1,048,576	Details →
11	Meta: Muse Spark 1.1meta/muse-spark-1.1	115	$1.25	$4.25	1,048,576	Details →
12	Google: Gemini 3.5 Flashgoogle/gemini-3.5-flash	115	$1.50	$9.00	1,048,576	Details →
13	Google Gemini Pro Latest~google/gemini-pro-latest	115	$2.00	$12.00	1,048,576	Details →
14	Google Gemini Flash Latest~google/gemini-flash-latest	115	$1.50	$7.50	1,048,576	Details →
15	Google: Gemini 3.1 Pro Preview Custom Toolsgoogle/gemini-3.1-pro-preview-customtools	115	$2.00	$12.00	1,048,576	Details →

How we ranked these

For Transcription, we weight models on audio input, low latency, requires_audio. Scores combine each model's public specs with independent benchmark results (Aider Polyglot coding scores, Artificial Analysis intelligence/coding/agentic indices) and live pricing. See full methodology →

About Transcription

Transcription is the conversion of spoken audio into written text using AI speech recognition. You need this when you have recorded conversations, meetings, interviews, or lectures that require searchable, editable text records. What separates high-performing models from weak ones is accuracy on accented speech, background noise handling, and punctuation placement. Whisper and similar transformer-based models excel at diverse audio conditions; older RNN approaches fail noticeably on overlapping speakers or poor audio quality. Speed matters in production: cloud APIs add latency (500ms to 2 seconds per minute of audio), while local models run faster but require GPU memory. Real-world accuracy typically ranges from 85 percent on clean studio audio to 60-70 percent on noisy field recordings.

When to use: Use this when you have audio recordings that need to become searchable text, like interviews, podcasts, meetings, or lectures you want indexed or archived without manual typing.

Common questions

What is the best AI model for transcription accuracy?

OpenAI's Whisper and Anthropic's backend models currently lead on mixed-condition audio. Whisper handles accents and background noise better than older alternatives like DeepSpeech, achieving 85-95 percent accuracy on standard English speech. For specialized domains (medical, legal), fine-tuned models often outperform general ones but require more setup.

How much does AI transcription cost, and is it faster than manual transcription?

API pricing ranges from $0.01 to $0.25 per minute depending on the provider and model used. AI transcription is 50-100x faster than human typing, completing a 60-minute recording in 30-120 seconds depending on whether you use cloud or local processing, versus 4-6 hours of manual work.

Related tasks

Voice