Cost · best for

Top picks for Cheap Bulk Inference (2026)

Lowest cost-per-million for high-volume jobs. Ranked from 340 live models on the OpenRouter catalog, weighted for low cost, low latency.

What this is Ranked by capability match + real benchmark scores (Aider Polyglot, Artificial Analysis Intelligence Index) + live pricing. Models need the right specs for Cheap Bulk Inference, then benchmark performance refines the order. Full methodology →
#ModelScoreIn / 1MOut / 1MContext
1 NVIDIA: Nemotron 3 Nano Omni (free)nvidia/nemotron-3-nano-omni-30b-a3b-reasoning:free 138 Free Free 256,000 Details →
2 Google: Gemma 4 26B A4B (free)google/gemma-4-26b-a4b-it:free 138 Free Free 262,144 Details →
3 Google: Gemma 4 31B (free)google/gemma-4-31b-it:free 138 Free Free 262,144 Details →
4 Qwen: Qwen3.5-9Bqwen/qwen3.5-9b 137 $0.04 $0.15 262,144 Details →
5 Xiaomi: MiMo-V2.5xiaomi/mimo-v2.5 137 $0.14 $0.28 1,048,576 Details →
6 Google: Gemma 4 26B A4B google/gemma-4-26b-a4b-it 137 $0.06 $0.33 262,144 Details →
7 Google: Gemma 4 31Bgoogle/gemma-4-31b-it 137 $0.12 $0.36 262,144 Details →
8 ByteDance Seed: Seed-2.0-Minibytedance-seed/seed-2.0-mini 137 $0.10 $0.40 262,144 Details →
9 Qwen: Qwen3.5-Flashqwen/qwen3.5-flash-02-23 137 $0.07 $0.26 1,000,000 Details →
10 ByteDance Seed: Seed 1.6 Flashbytedance-seed/seed-1.6-flash 137 $0.07 $0.30 262,144 Details →
11 Google: Gemini 2.5 Flash Lite Preview 09-2025google/gemini-2.5-flash-lite-preview-09-2025 137 $0.10 $0.40 1,048,576 Details →
12 OpenAI: GPT-5 Nanoopenai/gpt-5-nano 137 $0.05 $0.40 400,000 Details →
13 Google: Gemini 2.5 Flash Litegoogle/gemini-2.5-flash-lite 137 $0.10 $0.40 1,048,576 Details →
14 OpenAI: GPT-4.1 Nanoopenai/gpt-4.1-nano 137 $0.10 $0.40 1,047,576 Details →
15 DeepSeek: DeepSeek V4 Flashdeepseek/deepseek-v4-flash 137 $0.10 $0.20 1,048,576 Details →

How we ranked these

For Cheap Bulk Inference, we weight models on low cost, low latency. Scores combine each model's public specs with independent benchmark results (Aider Polyglot coding scores, Artificial Analysis intelligence/coding/agentic indices) and live pricing. See full methodology →

About Cheap Bulk Inference

Cheap Bulk Inference is the task of running high-volume inference jobs where cost-per-million-tokens is the primary metric. You need this when you're processing thousands of documents, running batch analysis, or generating content at scale where latency doesn't matter but unit economics do. Good models for this task have aggressive per-token pricing, efficient context windows that avoid redundant processing, and stable availability for sustained workloads. Claude 3.5 Haiku and GPT-4o Mini excel here because they maintain quality at lower price points. The key tradeoff is that you'll sacrifice some reasoning capability and output quality compared to flagship models, so validate that your task doesn't require frontier performance. Batching requests into async jobs rather than real-time calls can further reduce costs by 20-50 percent through volume discounts. # WHEN_TO_USE Use this when you need to process large amounts of text or data through an AI model and your budget is tight, but you're willing to wait hours or days for results instead of getting answers immediately. # FAQ_Q1 Which AI model has the cheapest cost per million tokens for bulk processing? # FAQ_A1 Claude 3.5 Haiku currently offers the lowest cost-per-million for production-quality output at around $0.80 per million input tokens. GPT-4o Mini is competitive at similar pricing if you need broader API ecosystem support. For pure cost, smaller open-source models deployed on your own infrastructure (like Llama 2 7B) can be cheaper at scale, but require DevOps overhead. # FAQ_Q2 How much can I save by batching inference requests instead of calling the API in real-time? # FAQ_A2 Batch processing APIs typically offer 50 percent discounts compared to real-time pricing, so a task costing $10 in real-time calls might cost $5 with batched requests. Processing time extends to 24 hours, but for non-urgent work like content classification, document summarization, or historical data analysis, batch mode is almost always the right choice economically.

When to use: Use this when you need to process large amounts of text or data through an AI model and your budget is tight, but you're willing to wait hours or days for results instead of getting answers immediately. # FAQ_Q1 Which AI model has the cheapest cost per million tokens for bulk processing? # FAQ_A1 Claude 3.5 Haiku currently offers the lowest cost-per-million for production-quality output at around $0.80 per million input tokens. GPT-4o Mini is competitive at similar pricing if you need broader API ecosystem support. For pure cost, smaller open-source models deployed on your own infrastructure (like Llama 2 7B) can be cheaper at scale, but require DevOps overhead. # FAQ_Q2 How much can I save by batching inference requests instead of calling the API in real-time? # FAQ_A2 Batch processing APIs typically offer 50 percent discounts compared to real-time pricing, so a task costing $10 in real-time calls might cost $5 with batched requests. Processing time extends to 24 hours, but for non-urgent work like content classification, document summarization, or historical data analysis, batch mode is almost always the right choice economically.

Common questions

Which AI model has the cheapest cost per million tokens for bulk processing? # FAQ_A1 Claude 3.5 Haiku currently offers the lowest cost-per-million for production-quality output at around $0.80 per million input tokens. GPT-4o Mini is competitive at similar pricing if you need broader API ecosystem support. For pure cost, smaller open-source models deployed on your own infrastructure (like Llama 2 7B) can be cheaper at scale, but require DevOps overhead. # FAQ_Q2 How much can I save by batching inference requests instead of calling the API in real-time? # FAQ_A2 Batch processing APIs typically offer 50 percent discounts compared to real-time pricing, so a task costing $10 in real-time calls might cost $5 with batched requests. Processing time extends to 24 hours, but for non-urgent work like content classification, document summarization, or historical data analysis, batch mode is almost always the right choice economically.

Claude 3.5 Haiku currently offers the lowest cost-per-million for production-quality output at around $0.80 per million input tokens. GPT-4o Mini is competitive at similar pricing if you need broader API ecosystem support. For pure cost, smaller open-source models deployed on your own infrastructure (like Llama 2 7B) can be cheaper at scale, but require DevOps overhead. # FAQ_Q2 How much can I save by batching inference requests instead of calling the API in real-time? # FAQ_A2 Batch processing APIs typically offer 50 percent discounts compared to real-time pricing, so a task costing $10 in real-time calls might cost $5 with batched requests. Processing time extends to 24 hours, but for non-urgent work like content classification, document summarization, or historical data analysis, batch mode is almost always the right choice economically.

How much can I save by batching inference requests instead of calling the API in real-time? # FAQ_A2 Batch processing APIs typically offer 50 percent discounts compared to real-time pricing, so a task costing $10 in real-time calls might cost $5 with batched requests. Processing time extends to 24 hours, but for non-urgent work like content classification, document summarization, or historical data analysis, batch mode is almost always the right choice economically.

Batch processing APIs typically offer 50 percent discounts compared to real-time pricing, so a task costing $10 in real-time calls might cost $5 with batched requests. Processing time extends to 24 hours, but for non-urgent work like content classification, document summarization, or historical data analysis, batch mode is almost always the right choice economically.

Related tasks