Cost · best for

Top picks for Self-Hosted / Local (2026)

Open-weights models you can run yourself. Ranked from 333 live models on the OpenRouter catalog, weighted for low cost.

What this is Ranked by capability match + real benchmark scores (Aider Polyglot, Artificial Analysis Intelligence Index) + live pricing. Models need the right specs for Self-Hosted / Local, then benchmark performance refines the order. Full methodology →

#	Model	Score	In / 1M	Out / 1M	Context
1	Nex AGI: Nex-N2-Mininex-agi/nex-n2-mini	118	$0.03	$0.10	262,144	Details →
2	NVIDIA: Nemotron 3 Nano Omni (free)nvidia/nemotron-3-nano-omni-30b-a3b-reasoning:free	118	Free	Free	256,000	Details →
3	Google: Gemma 4 26B A4B (free)google/gemma-4-26b-a4b-it:free	118	Free	Free	262,144	Details →
4	Google: Gemma 4 31B (free)google/gemma-4-31b-it:free	118	Free	Free	262,144	Details →
5	Xiaomi: MiMo-V2.5xiaomi/mimo-v2.5	117	$0.14	$0.28	1,048,576	Details →
6	Google: Gemma 4 26B A4B google/gemma-4-26b-a4b-it	117	$0.07	$0.34	262,144	Details →
7	Google: Gemma 4 31Bgoogle/gemma-4-31b-it	117	$0.12	$0.37	262,144	Details →
8	Qwen: Qwen3.5-9Bqwen/qwen3.5-9b	117	$0.10	$0.15	262,144	Details →
9	Qwen: Qwen3.5-Flashqwen/qwen3.5-flash-02-23	117	$0.07	$0.26	1,000,000	Details →
10	ByteDance Seed: Seed 1.6 Flashbytedance-seed/seed-1.6-flash	117	$0.07	$0.30	262,144	Details →
11	OpenAI: GPT-5 Nanoopenai/gpt-5-nano	117	$0.05	$0.40	400,000	Details →
12	Mistral: Mistral Small 4mistralai/mistral-small-2603	117	$0.15	$0.60	262,144	Details →
13	ByteDance Seed: Seed-2.0-Minibytedance-seed/seed-2.0-mini	117	$0.10	$0.40	262,144	Details →
14	Google: Gemini 2.5 Flash Litegoogle/gemini-2.5-flash-lite	117	$0.10	$0.40	1,048,576	Details →
15	OpenAI: GPT-4.1 Nanoopenai/gpt-4.1-nano	117	$0.10	$0.40	1,047,576	Details →

How we ranked these

For Self-Hosted / Local, we weight models on low cost. Scores combine each model's public specs with independent benchmark results (Aider Polyglot coding scores, Artificial Analysis intelligence/coding/agentic indices) and live pricing. See full methodology →

About Self-Hosted / Local

Self-hosted / local deployment means running open-weights AI models on your own hardware without relying on cloud APIs. You need this when you require privacy, want to avoid per-token costs at scale, need offline capability, or operate in restricted network environments. A good model for local deployment balances inference speed and output quality within your hardware constraints-typically measured in tokens per second and VRAM requirements. Quantization (reducing model precision to 4-bit or 8-bit) is the single most important cost lever: it cuts memory usage by 60-75% with minimal quality loss, often making the difference between running a model and not running it at all.

When to use: Use this when you want to run an AI model on your own computer or server without sending data to external cloud services, either to keep information private, save money on API fees, or work without an internet connection.

Common questions

What is the smallest model I can realistically run on a laptop?

Mistral 7B or Llama 2 7B quantized to 4-bit will run on most laptops with 8GB RAM using tools like Ollama or LM Studio, though you'll see noticeable slowdown compared to a GPU. For faster inference, aim for at least a GPU with 6-8GB of dedicated VRAM, which lets you run 13B models at practical speeds.

How much does it cost to run a model locally versus using an API?

Local deployment has near-zero marginal cost per inference after the initial hardware investment, while APIs typically cost $0.001-$0.10 per thousand tokens depending on model size. If you're running thousands of inferences monthly, self-hosting breaks even within weeks; if you're running millions monthly, it's orders of magnitude cheaper.

Related tasks

Cost

Best for Cheap Bulk Inference

Lowest cost-per-million for high-volume jobs.