Top picks for Cheap Bulk Inference (2026)
Lowest cost-per-million for high-volume jobs. Ranked from 340 live models on the OpenRouter catalog, weighted for low cost, low latency.
| # | Model | Score | In / 1M | Out / 1M | Context | |
|---|---|---|---|---|---|---|
| 1 | NVIDIA: Nemotron 3 Nano Omni (free)nvidia/nemotron-3-nano-omni-30b-a3b-reasoning:free | 138 | Free | Free | 256,000 | Details → |
| 2 | Google: Gemma 4 26B A4B (free)google/gemma-4-26b-a4b-it:free | 138 | Free | Free | 262,144 | Details → |
| 3 | Google: Gemma 4 31B (free)google/gemma-4-31b-it:free | 138 | Free | Free | 262,144 | Details → |
| 4 | Qwen: Qwen3.5-9Bqwen/qwen3.5-9b | 137 | $0.04 | $0.15 | 262,144 | Details → |
| 5 | Xiaomi: MiMo-V2.5xiaomi/mimo-v2.5 | 137 | $0.14 | $0.28 | 1,048,576 | Details → |
| 6 | Google: Gemma 4 26B A4B google/gemma-4-26b-a4b-it | 137 | $0.06 | $0.33 | 262,144 | Details → |
| 7 | Google: Gemma 4 31Bgoogle/gemma-4-31b-it | 137 | $0.12 | $0.36 | 262,144 | Details → |
| 8 | ByteDance Seed: Seed-2.0-Minibytedance-seed/seed-2.0-mini | 137 | $0.10 | $0.40 | 262,144 | Details → |
| 9 | Qwen: Qwen3.5-Flashqwen/qwen3.5-flash-02-23 | 137 | $0.07 | $0.26 | 1,000,000 | Details → |
| 10 | ByteDance Seed: Seed 1.6 Flashbytedance-seed/seed-1.6-flash | 137 | $0.07 | $0.30 | 262,144 | Details → |
| 11 | Google: Gemini 2.5 Flash Lite Preview 09-2025google/gemini-2.5-flash-lite-preview-09-2025 | 137 | $0.10 | $0.40 | 1,048,576 | Details → |
| 12 | OpenAI: GPT-5 Nanoopenai/gpt-5-nano | 137 | $0.05 | $0.40 | 400,000 | Details → |
| 13 | Google: Gemini 2.5 Flash Litegoogle/gemini-2.5-flash-lite | 137 | $0.10 | $0.40 | 1,048,576 | Details → |
| 14 | OpenAI: GPT-4.1 Nanoopenai/gpt-4.1-nano | 137 | $0.10 | $0.40 | 1,047,576 | Details → |
| 15 | DeepSeek: DeepSeek V4 Flashdeepseek/deepseek-v4-flash | 137 | $0.10 | $0.20 | 1,048,576 | Details → |
How we ranked these
For Cheap Bulk Inference, we weight models on low cost, low latency. Scores combine each model's public specs with independent benchmark results (Aider Polyglot coding scores, Artificial Analysis intelligence/coding/agentic indices) and live pricing. See full methodology →
About Cheap Bulk Inference
Cheap Bulk Inference is the task of running high-volume inference jobs where cost-per-million-tokens is the primary metric. You need this when you're processing thousands of documents, running batch analysis, or generating content at scale where latency doesn't matter but unit economics do. Good models for this task have aggressive per-token pricing, efficient context windows that avoid redundant processing, and stable availability for sustained workloads. Claude 3.5 Haiku and GPT-4o Mini excel here because they maintain quality at lower price points. The key tradeoff is that you'll sacrifice some reasoning capability and output quality compared to flagship models, so validate that your task doesn't require frontier performance. Batching requests into async jobs rather than real-time calls can further reduce costs by 20-50 percent through volume discounts. # WHEN_TO_USE Use this when you need to process large amounts of text or data through an AI model and your budget is tight, but you're willing to wait hours or days for results instead of getting answers immediately. # FAQ_Q1 Which AI model has the cheapest cost per million tokens for bulk processing? # FAQ_A1 Claude 3.5 Haiku currently offers the lowest cost-per-million for production-quality output at around $0.80 per million input tokens. GPT-4o Mini is competitive at similar pricing if you need broader API ecosystem support. For pure cost, smaller open-source models deployed on your own infrastructure (like Llama 2 7B) can be cheaper at scale, but require DevOps overhead. # FAQ_Q2 How much can I save by batching inference requests instead of calling the API in real-time? # FAQ_A2 Batch processing APIs typically offer 50 percent discounts compared to real-time pricing, so a task costing $10 in real-time calls might cost $5 with batched requests. Processing time extends to 24 hours, but for non-urgent work like content classification, document summarization, or historical data analysis, batch mode is almost always the right choice economically.
When to use: Use this when you need to process large amounts of text or data through an AI model and your budget is tight, but you're willing to wait hours or days for results instead of getting answers immediately. # FAQ_Q1 Which AI model has the cheapest cost per million tokens for bulk processing? # FAQ_A1 Claude 3.5 Haiku currently offers the lowest cost-per-million for production-quality output at around $0.80 per million input tokens. GPT-4o Mini is competitive at similar pricing if you need broader API ecosystem support. For pure cost, smaller open-source models deployed on your own infrastructure (like Llama 2 7B) can be cheaper at scale, but require DevOps overhead. # FAQ_Q2 How much can I save by batching inference requests instead of calling the API in real-time? # FAQ_A2 Batch processing APIs typically offer 50 percent discounts compared to real-time pricing, so a task costing $10 in real-time calls might cost $5 with batched requests. Processing time extends to 24 hours, but for non-urgent work like content classification, document summarization, or historical data analysis, batch mode is almost always the right choice economically.
Common questions
Which AI model has the cheapest cost per million tokens for bulk processing? # FAQ_A1 Claude 3.5 Haiku currently offers the lowest cost-per-million for production-quality output at around $0.80 per million input tokens. GPT-4o Mini is competitive at similar pricing if you need broader API ecosystem support. For pure cost, smaller open-source models deployed on your own infrastructure (like Llama 2 7B) can be cheaper at scale, but require DevOps overhead. # FAQ_Q2 How much can I save by batching inference requests instead of calling the API in real-time? # FAQ_A2 Batch processing APIs typically offer 50 percent discounts compared to real-time pricing, so a task costing $10 in real-time calls might cost $5 with batched requests. Processing time extends to 24 hours, but for non-urgent work like content classification, document summarization, or historical data analysis, batch mode is almost always the right choice economically.
Claude 3.5 Haiku currently offers the lowest cost-per-million for production-quality output at around $0.80 per million input tokens. GPT-4o Mini is competitive at similar pricing if you need broader API ecosystem support. For pure cost, smaller open-source models deployed on your own infrastructure (like Llama 2 7B) can be cheaper at scale, but require DevOps overhead. # FAQ_Q2 How much can I save by batching inference requests instead of calling the API in real-time? # FAQ_A2 Batch processing APIs typically offer 50 percent discounts compared to real-time pricing, so a task costing $10 in real-time calls might cost $5 with batched requests. Processing time extends to 24 hours, but for non-urgent work like content classification, document summarization, or historical data analysis, batch mode is almost always the right choice economically.
How much can I save by batching inference requests instead of calling the API in real-time? # FAQ_A2 Batch processing APIs typically offer 50 percent discounts compared to real-time pricing, so a task costing $10 in real-time calls might cost $5 with batched requests. Processing time extends to 24 hours, but for non-urgent work like content classification, document summarization, or historical data analysis, batch mode is almost always the right choice economically.
Batch processing APIs typically offer 50 percent discounts compared to real-time pricing, so a task costing $10 in real-time calls might cost $5 with batched requests. Processing time extends to 24 hours, but for non-urgent work like content classification, document summarization, or historical data analysis, batch mode is almost always the right choice economically.