Top picks for Real-Time Chat (2026)
Models tuned for sub-second response. Ranked from 337 live models on the OpenRouter catalog, weighted for low latency, low cost.
| # | Model | Score | In / 1M | Out / 1M | Context | |
|---|---|---|---|---|---|---|
| 1 | NVIDIA: Nemotron 3 Nano Omni (free)nvidia/nemotron-3-nano-omni-30b-a3b-reasoning:free | 118 | Free | Free | 256,000 | Details → |
| 2 | Xiaomi: MiMo-V2.5xiaomi/mimo-v2.5 | 118 | $0.14 | $0.28 | 1,048,576 | Details → |
| 3 | Google: Gemma 4 26B A4B (free)google/gemma-4-26b-a4b-it:free | 118 | Free | Free | 262,144 | Details → |
| 4 | Google: Gemma 4 26B A4B google/gemma-4-26b-a4b-it | 118 | $0.06 | $0.33 | 262,144 | Details → |
| 5 | Google: Gemma 4 31B (free)google/gemma-4-31b-it:free | 118 | Free | Free | 262,144 | Details → |
| 6 | Google: Gemma 4 31Bgoogle/gemma-4-31b-it | 118 | $0.12 | $0.36 | 262,144 | Details → |
| 7 | Qwen: Qwen3.5-9Bqwen/qwen3.5-9b | 118 | $0.10 | $0.15 | 262,144 | Details → |
| 8 | ByteDance Seed: Seed-2.0-Minibytedance-seed/seed-2.0-mini | 118 | $0.10 | $0.40 | 262,144 | Details → |
| 9 | Qwen: Qwen3.5-Flashqwen/qwen3.5-flash-02-23 | 118 | $0.07 | $0.26 | 1,000,000 | Details → |
| 10 | ByteDance Seed: Seed 1.6 Flashbytedance-seed/seed-1.6-flash | 118 | $0.07 | $0.30 | 262,144 | Details → |
| 11 | Google: Gemini 2.5 Flash Lite Preview 09-2025google/gemini-2.5-flash-lite-preview-09-2025 | 118 | $0.10 | $0.40 | 1,048,576 | Details → |
| 12 | OpenAI: GPT-5 Nanoopenai/gpt-5-nano | 118 | $0.05 | $0.40 | 400,000 | Details → |
| 13 | Google: Gemini 2.5 Flash Litegoogle/gemini-2.5-flash-lite | 118 | $0.10 | $0.40 | 1,048,576 | Details → |
| 14 | OpenAI: GPT-4.1 Nanoopenai/gpt-4.1-nano | 118 | $0.10 | $0.40 | 1,047,576 | Details → |
| 15 | StepFun: Step 3.7 Flashstepfun/step-3.7-flash | 117 | $0.20 | $1.15 | 256,000 | Details → |
How we ranked these
For Real-Time Chat, we weight models on low latency, low cost. Scores combine each model's public specs with independent benchmark results (Aider Polyglot coding scores, Artificial Analysis intelligence/coding/agentic indices) and live pricing. See full methodology →
About Real-Time Chat
Real-Time Chat is the task of generating conversational responses in under one second, typically 200-800ms per turn. You need this when users expect immediate feedback during dialogue, such as customer support bots, in-app assistants, or voice interfaces where latency breaks the illusion of conversation. A good model for this task combines low parameter count with efficient inference: smaller fine-tuned models like Llama 2 7B or Mistral 7B outperform larger ones here. Bad models are either too large (requiring batching that adds delay) or poorly quantized (losing coherence to gain speed). The practical tradeoff: sub-second response often means accepting slightly lower reasoning depth or restricting context window to 2K-4K tokens. Inference cost scales directly with model size and context length, so a 70B parameter model will rarely hit sub-second latency on commodity hardware. # WHEN_TO_USE Use this when you're building a chatbot, voice assistant, or live support tool where users notice delays longer than a second and will perceive the system as slow or unresponsive. # FAQ_Q1 Which AI models can actually respond in under one second for chat? # FAQ_A1 Llama 2 7B, Mistral 7B, and Phi-2 are production-proven choices, especially when quantized to 4-bit and deployed on GPU hardware with batch size 1. GPT-3.5-turbo via OpenAI's API typically hits 300-600ms end-to-end, though that includes network latency. # FAQ_Q2 How much does real-time chat latency cost compared to batch processing? # FAQ_A2 Real-time chat demands low-batch, high-concurrency infrastructure: you're paying for GPU reservation per user session rather than amortizing inference across batches. Expect 2-5x higher cost per token than batch APIs, though smaller models (7B-13B) keep total costs reasonable for high-volume applications.
When to use: Use this when you're building a chatbot, voice assistant, or live support tool where users notice delays longer than a second and will perceive the system as slow or unresponsive. # FAQ_Q1 Which AI models can actually respond in under one second for chat? # FAQ_A1 Llama 2 7B, Mistral 7B, and Phi-2 are production-proven choices, especially when quantized to 4-bit and deployed on GPU hardware with batch size 1. GPT-3.5-turbo via OpenAI's API typically hits 300-600ms end-to-end, though that includes network latency. # FAQ_Q2 How much does real-time chat latency cost compared to batch processing? # FAQ_A2 Real-time chat demands low-batch, high-concurrency infrastructure: you're paying for GPU reservation per user session rather than amortizing inference across batches. Expect 2-5x higher cost per token than batch APIs, though smaller models (7B-13B) keep total costs reasonable for high-volume applications.
Common questions
Which AI models can actually respond in under one second for chat? # FAQ_A1 Llama 2 7B, Mistral 7B, and Phi-2 are production-proven choices, especially when quantized to 4-bit and deployed on GPU hardware with batch size 1. GPT-3.5-turbo via OpenAI's API typically hits 300-600ms end-to-end, though that includes network latency. # FAQ_Q2 How much does real-time chat latency cost compared to batch processing? # FAQ_A2 Real-time chat demands low-batch, high-concurrency infrastructure: you're paying for GPU reservation per user session rather than amortizing inference across batches. Expect 2-5x higher cost per token than batch APIs, though smaller models (7B-13B) keep total costs reasonable for high-volume applications.
Llama 2 7B, Mistral 7B, and Phi-2 are production-proven choices, especially when quantized to 4-bit and deployed on GPU hardware with batch size 1. GPT-3.5-turbo via OpenAI's API typically hits 300-600ms end-to-end, though that includes network latency. # FAQ_Q2 How much does real-time chat latency cost compared to batch processing? # FAQ_A2 Real-time chat demands low-batch, high-concurrency infrastructure: you're paying for GPU reservation per user session rather than amortizing inference across batches. Expect 2-5x higher cost per token than batch APIs, though smaller models (7B-13B) keep total costs reasonable for high-volume applications.
How much does real-time chat latency cost compared to batch processing? # FAQ_A2 Real-time chat demands low-batch, high-concurrency infrastructure: you're paying for GPU reservation per user session rather than amortizing inference across batches. Expect 2-5x higher cost per token than batch APIs, though smaller models (7B-13B) keep total costs reasonable for high-volume applications.
Real-time chat demands low-batch, high-concurrency infrastructure: you're paying for GPU reservation per user session rather than amortizing inference across batches. Expect 2-5x higher cost per token than batch APIs, though smaller models (7B-13B) keep total costs reasonable for high-volume applications.