Cost · best for

Top picks for Cheap Bulk Inference (2026)

Lowest cost-per-million for high-volume jobs. Ranked from 340 live models on the OpenRouter catalog, weighted for low cost, low latency.

What this is Ranked by capability match + real benchmark scores (Aider Polyglot, Artificial Analysis Intelligence Index) + live pricing. Models need the right specs for Cheap Bulk Inference, then benchmark performance refines the order. Full methodology →

#	Model	Score	In / 1M	Out / 1M	Context
1	NVIDIA: Nemotron 3 Nano Omni (free)nvidia/nemotron-3-nano-omni-30b-a3b-reasoning:free	138	Free	Free	256,000	Details →
2	Google: Gemma 4 26B A4B (free)google/gemma-4-26b-a4b-it:free	138	Free	Free	262,144	Details →
3	Google: Gemma 4 31B (free)google/gemma-4-31b-it:free	138	Free	Free	262,144	Details →
4	Qwen: Qwen3.5-9Bqwen/qwen3.5-9b	137	$0.04	$0.15	262,144	Details →
5	Xiaomi: MiMo-V2.5xiaomi/mimo-v2.5	137	$0.14	$0.28	1,048,576	Details →
6	Google: Gemma 4 26B A4B google/gemma-4-26b-a4b-it	137	$0.06	$0.33	262,144	Details →
7	Google: Gemma 4 31Bgoogle/gemma-4-31b-it	137	$0.12	$0.36	262,144	Details →
8	ByteDance Seed: Seed-2.0-Minibytedance-seed/seed-2.0-mini	137	$0.10	$0.40	262,144	Details →
9	Qwen: Qwen3.5-Flashqwen/qwen3.5-flash-02-23	137	$0.07	$0.26	1,000,000	Details →
10	ByteDance Seed: Seed 1.6 Flashbytedance-seed/seed-1.6-flash	137	$0.07	$0.30	262,144	Details →
11	Google: Gemini 2.5 Flash Lite Preview 09-2025google/gemini-2.5-flash-lite-preview-09-2025	137	$0.10	$0.40	1,048,576	Details →
12	OpenAI: GPT-5 Nanoopenai/gpt-5-nano	137	$0.05	$0.40	400,000	Details →
13	Google: Gemini 2.5 Flash Litegoogle/gemini-2.5-flash-lite	137	$0.10	$0.40	1,048,576	Details →
14	OpenAI: GPT-4.1 Nanoopenai/gpt-4.1-nano	137	$0.10	$0.40	1,047,576	Details →
15	DeepSeek: DeepSeek V4 Flashdeepseek/deepseek-v4-flash	137	$0.10	$0.20	1,048,576	Details →

How we ranked these

For Cheap Bulk Inference, we weight models on low cost, low latency. Scores combine each model's public specs with independent benchmark results (Aider Polyglot coding scores, Artificial Analysis intelligence/coding/agentic indices) and live pricing. See full methodology →

About Cheap Bulk Inference

Cheap Bulk Inference is the task of running high-volume inference jobs where cost-per-million-tokens is the primary metric. You need this when you're processing thousands of documents, running batch analysis, or generating content at scale where latency doesn't matter but unit economics do. Good models for this task have aggressive per-token pricing, efficient context windows that avoid redundant processing, and stable availability for sustained workloads. Claude 3.5 Haiku and GPT-4o Mini excel here because they maintain quality at lower price points. The key tradeoff is that you'll sacrifice some reasoning capability and output quality compared to flagship models, so validate that your task doesn't require frontier performance. Batching requests into async jobs rather than real-time calls can further reduce costs by 20-50 percent through volume discounts. # WHEN_TO_USE Use this when you need to process large amounts of text or data through an AI model and your budget is tight, but you're willing to wait hours or days for results instead of getting answers immediately. # FAQ_Q1 Which AI model has the cheapest cost per million tokens for bulk processing? # FAQ_A1 Claude 3.5 Haiku currently offers the lowest cost-per-million for production-quality output at around $0.80 per million input tokens. GPT-4o Mini is competitive at similar pricing if you need broader API ecosystem support. For pure cost, smaller open-source models deployed on your own infrastructure (like Llama 2 7B) can be cheaper at scale, but require DevOps overhead. # FAQ_Q2 How much can I save by batching inference requests instead of calling the API in real-time? # FAQ_A2 Batch processing APIs typically offer 50 percent discounts compared to real-time pricing, so a task costing $10 in real-time calls might cost $5 with batched requests. Processing time extends to 24 hours, but for non-urgent work like content classification, document summarization, or historical data analysis, batch mode is almost always the right choice economically.

When to use: Use this when you need to process large amounts of text or data through an AI model and your budget is tight, but you're willing to wait hours or days for results instead of getting answers immediately. # FAQ_Q1 Which AI model has the cheapest cost per million tokens for bulk processing? # FAQ_A1 Claude 3.5 Haiku currently offers the lowest cost-per-million for production-quality output at around $0.80 per million input tokens. GPT-4o Mini is competitive at similar pricing if you need broader API ecosystem support. For pure cost, smaller open-source models deployed on your own infrastructure (like Llama 2 7B) can be cheaper at scale, but require DevOps overhead. # FAQ_Q2 How much can I save by batching inference requests instead of calling the API in real-time? # FAQ_A2 Batch processing APIs typically offer 50 percent discounts compared to real-time pricing, so a task costing $10 in real-time calls might cost $5 with batched requests. Processing time extends to 24 hours, but for non-urgent work like content classification, document summarization, or historical data analysis, batch mode is almost always the right choice economically.

Common questions

Which AI model has the cheapest cost per million tokens for bulk processing? # FAQ_A1 Claude 3.5 Haiku currently offers the lowest cost-per-million for production-quality output at around $0.80 per million input tokens. GPT-4o Mini is competitive at similar pricing if you need broader API ecosystem support. For pure cost, smaller open-source models deployed on your own infrastructure (like Llama 2 7B) can be cheaper at scale, but require DevOps overhead. # FAQ_Q2 How much can I save by batching inference requests instead of calling the API in real-time? # FAQ_A2 Batch processing APIs typically offer 50 percent discounts compared to real-time pricing, so a task costing $10 in real-time calls might cost $5 with batched requests. Processing time extends to 24 hours, but for non-urgent work like content classification, document summarization, or historical data analysis, batch mode is almost always the right choice economically.

Claude 3.5 Haiku currently offers the lowest cost-per-million for production-quality output at around $0.80 per million input tokens. GPT-4o Mini is competitive at similar pricing if you need broader API ecosystem support. For pure cost, smaller open-source models deployed on your own infrastructure (like Llama 2 7B) can be cheaper at scale, but require DevOps overhead. # FAQ_Q2 How much can I save by batching inference requests instead of calling the API in real-time? # FAQ_A2 Batch processing APIs typically offer 50 percent discounts compared to real-time pricing, so a task costing $10 in real-time calls might cost $5 with batched requests. Processing time extends to 24 hours, but for non-urgent work like content classification, document summarization, or historical data analysis, batch mode is almost always the right choice economically.

How much can I save by batching inference requests instead of calling the API in real-time? # FAQ_A2 Batch processing APIs typically offer 50 percent discounts compared to real-time pricing, so a task costing $10 in real-time calls might cost $5 with batched requests. Processing time extends to 24 hours, but for non-urgent work like content classification, document summarization, or historical data analysis, batch mode is almost always the right choice economically.

Batch processing APIs typically offer 50 percent discounts compared to real-time pricing, so a task costing $10 in real-time calls might cost $5 with batched requests. Processing time extends to 24 hours, but for non-urgent work like content classification, document summarization, or historical data analysis, batch mode is almost always the right choice economically.

Related tasks

Cost

Best for Self-Hosted / Local

Open-weights models you can run yourself.