Latency · best for

Top picks for Real-Time Chat (2026)

Models tuned for sub-second response. Ranked from 337 live models on the OpenRouter catalog, weighted for low latency, low cost.

What this is Ranked by capability match + real benchmark scores (Aider Polyglot, Artificial Analysis Intelligence Index) + live pricing. Models need the right specs for Real-Time Chat, then benchmark performance refines the order. Full methodology →
#ModelScoreIn / 1MOut / 1MContext
1 NVIDIA: Nemotron 3 Nano Omni (free)nvidia/nemotron-3-nano-omni-30b-a3b-reasoning:free 118 Free Free 256,000 Details →
2 Xiaomi: MiMo-V2.5xiaomi/mimo-v2.5 118 $0.14 $0.28 1,048,576 Details →
3 Google: Gemma 4 26B A4B (free)google/gemma-4-26b-a4b-it:free 118 Free Free 262,144 Details →
4 Google: Gemma 4 26B A4B google/gemma-4-26b-a4b-it 118 $0.06 $0.33 262,144 Details →
5 Google: Gemma 4 31B (free)google/gemma-4-31b-it:free 118 Free Free 262,144 Details →
6 Google: Gemma 4 31Bgoogle/gemma-4-31b-it 118 $0.12 $0.36 262,144 Details →
7 Qwen: Qwen3.5-9Bqwen/qwen3.5-9b 118 $0.10 $0.15 262,144 Details →
8 ByteDance Seed: Seed-2.0-Minibytedance-seed/seed-2.0-mini 118 $0.10 $0.40 262,144 Details →
9 Qwen: Qwen3.5-Flashqwen/qwen3.5-flash-02-23 118 $0.07 $0.26 1,000,000 Details →
10 ByteDance Seed: Seed 1.6 Flashbytedance-seed/seed-1.6-flash 118 $0.07 $0.30 262,144 Details →
11 Google: Gemini 2.5 Flash Lite Preview 09-2025google/gemini-2.5-flash-lite-preview-09-2025 118 $0.10 $0.40 1,048,576 Details →
12 OpenAI: GPT-5 Nanoopenai/gpt-5-nano 118 $0.05 $0.40 400,000 Details →
13 Google: Gemini 2.5 Flash Litegoogle/gemini-2.5-flash-lite 118 $0.10 $0.40 1,048,576 Details →
14 OpenAI: GPT-4.1 Nanoopenai/gpt-4.1-nano 118 $0.10 $0.40 1,047,576 Details →
15 StepFun: Step 3.7 Flashstepfun/step-3.7-flash 117 $0.20 $1.15 256,000 Details →

How we ranked these

For Real-Time Chat, we weight models on low latency, low cost. Scores combine each model's public specs with independent benchmark results (Aider Polyglot coding scores, Artificial Analysis intelligence/coding/agentic indices) and live pricing. See full methodology →

About Real-Time Chat

Real-Time Chat is the task of generating conversational responses in under one second, typically 200-800ms per turn. You need this when users expect immediate feedback during dialogue, such as customer support bots, in-app assistants, or voice interfaces where latency breaks the illusion of conversation. A good model for this task combines low parameter count with efficient inference: smaller fine-tuned models like Llama 2 7B or Mistral 7B outperform larger ones here. Bad models are either too large (requiring batching that adds delay) or poorly quantized (losing coherence to gain speed). The practical tradeoff: sub-second response often means accepting slightly lower reasoning depth or restricting context window to 2K-4K tokens. Inference cost scales directly with model size and context length, so a 70B parameter model will rarely hit sub-second latency on commodity hardware. # WHEN_TO_USE Use this when you're building a chatbot, voice assistant, or live support tool where users notice delays longer than a second and will perceive the system as slow or unresponsive. # FAQ_Q1 Which AI models can actually respond in under one second for chat? # FAQ_A1 Llama 2 7B, Mistral 7B, and Phi-2 are production-proven choices, especially when quantized to 4-bit and deployed on GPU hardware with batch size 1. GPT-3.5-turbo via OpenAI's API typically hits 300-600ms end-to-end, though that includes network latency. # FAQ_Q2 How much does real-time chat latency cost compared to batch processing? # FAQ_A2 Real-time chat demands low-batch, high-concurrency infrastructure: you're paying for GPU reservation per user session rather than amortizing inference across batches. Expect 2-5x higher cost per token than batch APIs, though smaller models (7B-13B) keep total costs reasonable for high-volume applications.

When to use: Use this when you're building a chatbot, voice assistant, or live support tool where users notice delays longer than a second and will perceive the system as slow or unresponsive. # FAQ_Q1 Which AI models can actually respond in under one second for chat? # FAQ_A1 Llama 2 7B, Mistral 7B, and Phi-2 are production-proven choices, especially when quantized to 4-bit and deployed on GPU hardware with batch size 1. GPT-3.5-turbo via OpenAI's API typically hits 300-600ms end-to-end, though that includes network latency. # FAQ_Q2 How much does real-time chat latency cost compared to batch processing? # FAQ_A2 Real-time chat demands low-batch, high-concurrency infrastructure: you're paying for GPU reservation per user session rather than amortizing inference across batches. Expect 2-5x higher cost per token than batch APIs, though smaller models (7B-13B) keep total costs reasonable for high-volume applications.

Common questions

Which AI models can actually respond in under one second for chat? # FAQ_A1 Llama 2 7B, Mistral 7B, and Phi-2 are production-proven choices, especially when quantized to 4-bit and deployed on GPU hardware with batch size 1. GPT-3.5-turbo via OpenAI's API typically hits 300-600ms end-to-end, though that includes network latency. # FAQ_Q2 How much does real-time chat latency cost compared to batch processing? # FAQ_A2 Real-time chat demands low-batch, high-concurrency infrastructure: you're paying for GPU reservation per user session rather than amortizing inference across batches. Expect 2-5x higher cost per token than batch APIs, though smaller models (7B-13B) keep total costs reasonable for high-volume applications.

Llama 2 7B, Mistral 7B, and Phi-2 are production-proven choices, especially when quantized to 4-bit and deployed on GPU hardware with batch size 1. GPT-3.5-turbo via OpenAI's API typically hits 300-600ms end-to-end, though that includes network latency. # FAQ_Q2 How much does real-time chat latency cost compared to batch processing? # FAQ_A2 Real-time chat demands low-batch, high-concurrency infrastructure: you're paying for GPU reservation per user session rather than amortizing inference across batches. Expect 2-5x higher cost per token than batch APIs, though smaller models (7B-13B) keep total costs reasonable for high-volume applications.

How much does real-time chat latency cost compared to batch processing? # FAQ_A2 Real-time chat demands low-batch, high-concurrency infrastructure: you're paying for GPU reservation per user session rather than amortizing inference across batches. Expect 2-5x higher cost per token than batch APIs, though smaller models (7B-13B) keep total costs reasonable for high-volume applications.

Real-time chat demands low-batch, high-concurrency infrastructure: you're paying for GPU reservation per user session rather than amortizing inference across batches. Expect 2-5x higher cost per token than batch APIs, though smaller models (7B-13B) keep total costs reasonable for high-volume applications.