Cost · best for

Top picks for Self-Hosted / Local (2026)

Open-weights models you can run yourself. Ranked from 340 live models on the OpenRouter catalog, weighted for low cost.

What this is Ranked by capability match + real benchmark scores (Aider Polyglot, Artificial Analysis Intelligence Index) + live pricing. Models need the right specs for Self-Hosted / Local, then benchmark performance refines the order. Full methodology →
#ModelScoreIn / 1MOut / 1MContext
1 NVIDIA: Nemotron 3 Nano Omni (free)nvidia/nemotron-3-nano-omni-30b-a3b-reasoning:free 118 Free Free 256,000 Details →
2 Google: Gemma 4 26B A4B (free)google/gemma-4-26b-a4b-it:free 118 Free Free 262,144 Details →
3 Google: Gemma 4 31B (free)google/gemma-4-31b-it:free 118 Free Free 262,144 Details →
4 Xiaomi: MiMo-V2.5xiaomi/mimo-v2.5 117 $0.14 $0.28 1,048,576 Details →
5 Google: Gemma 4 26B A4B google/gemma-4-26b-a4b-it 117 $0.06 $0.33 262,144 Details →
6 Google: Gemma 4 31Bgoogle/gemma-4-31b-it 117 $0.12 $0.36 262,144 Details →
7 Qwen: Qwen3.5-9Bqwen/qwen3.5-9b 117 $0.04 $0.15 262,144 Details →
8 Qwen: Qwen3.5-Flashqwen/qwen3.5-flash-02-23 117 $0.07 $0.26 1,000,000 Details →
9 ByteDance Seed: Seed 1.6 Flashbytedance-seed/seed-1.6-flash 117 $0.07 $0.30 262,144 Details →
10 OpenAI: GPT-5 Nanoopenai/gpt-5-nano 117 $0.05 $0.40 400,000 Details →
11 Mistral: Mistral Small 4mistralai/mistral-small-2603 117 $0.15 $0.60 262,144 Details →
12 ByteDance Seed: Seed-2.0-Minibytedance-seed/seed-2.0-mini 117 $0.10 $0.40 262,144 Details →
13 Google: Gemini 2.5 Flash Lite Preview 09-2025google/gemini-2.5-flash-lite-preview-09-2025 117 $0.10 $0.40 1,048,576 Details →
14 Google: Gemini 2.5 Flash Litegoogle/gemini-2.5-flash-lite 117 $0.10 $0.40 1,048,576 Details →
15 OpenAI: GPT-4.1 Nanoopenai/gpt-4.1-nano 117 $0.10 $0.40 1,047,576 Details →

How we ranked these

For Self-Hosted / Local, we weight models on low cost. Scores combine each model's public specs with independent benchmark results (Aider Polyglot coding scores, Artificial Analysis intelligence/coding/agentic indices) and live pricing. See full methodology →

About Self-Hosted / Local

Self-hosted / local deployment means running open-weights AI models on your own hardware without relying on cloud APIs. You need this when you require privacy, want to avoid per-token costs at scale, need offline capability, or operate in restricted network environments. A good model for local deployment balances inference speed and output quality within your hardware constraints-typically measured in tokens per second and VRAM requirements. Quantization (reducing model precision to 4-bit or 8-bit) is the single most important cost lever: it cuts memory usage by 60-75% with minimal quality loss, often making the difference between running a model and not running it at all. # WHEN_TO_USE Use this when you want to run an AI model on your own computer or server without sending data to external cloud services, either to keep information private, save money on API fees, or work without an internet connection. # FAQ_Q1 What is the smallest model I can realistically run on a laptop? # FAQ_A1 Mistral 7B or Llama 2 7B quantized to 4-bit will run on most laptops with 8GB RAM using tools like Ollama or LM Studio, though you'll see noticeable slowdown compared to a GPU. For faster inference, aim for at least a GPU with 6-8GB of dedicated VRAM, which lets you run 13B models at practical speeds. # FAQ_Q2 How much does it cost to run a model locally versus using an API? # FAQ_A2 Local deployment has near-zero marginal cost per inference after the initial hardware investment, while APIs typically cost $0.001-$0.10 per thousand tokens depending on model size. If you're running thousands of inferences monthly, self-hosting breaks even within weeks; if you're running millions monthly, it's orders of magnitude cheaper.

When to use: Use this when you want to run an AI model on your own computer or server without sending data to external cloud services, either to keep information private, save money on API fees, or work without an internet connection. # FAQ_Q1 What is the smallest model I can realistically run on a laptop? # FAQ_A1 Mistral 7B or Llama 2 7B quantized to 4-bit will run on most laptops with 8GB RAM using tools like Ollama or LM Studio, though you'll see noticeable slowdown compared to a GPU. For faster inference, aim for at least a GPU with 6-8GB of dedicated VRAM, which lets you run 13B models at practical speeds. # FAQ_Q2 How much does it cost to run a model locally versus using an API? # FAQ_A2 Local deployment has near-zero marginal cost per inference after the initial hardware investment, while APIs typically cost $0.001-$0.10 per thousand tokens depending on model size. If you're running thousands of inferences monthly, self-hosting breaks even within weeks; if you're running millions monthly, it's orders of magnitude cheaper.

Common questions

What is the smallest model I can realistically run on a laptop? # FAQ_A1 Mistral 7B or Llama 2 7B quantized to 4-bit will run on most laptops with 8GB RAM using tools like Ollama or LM Studio, though you'll see noticeable slowdown compared to a GPU. For faster inference, aim for at least a GPU with 6-8GB of dedicated VRAM, which lets you run 13B models at practical speeds. # FAQ_Q2 How much does it cost to run a model locally versus using an API? # FAQ_A2 Local deployment has near-zero marginal cost per inference after the initial hardware investment, while APIs typically cost $0.001-$0.10 per thousand tokens depending on model size. If you're running thousands of inferences monthly, self-hosting breaks even within weeks; if you're running millions monthly, it's orders of magnitude cheaper.

Mistral 7B or Llama 2 7B quantized to 4-bit will run on most laptops with 8GB RAM using tools like Ollama or LM Studio, though you'll see noticeable slowdown compared to a GPU. For faster inference, aim for at least a GPU with 6-8GB of dedicated VRAM, which lets you run 13B models at practical speeds. # FAQ_Q2 How much does it cost to run a model locally versus using an API? # FAQ_A2 Local deployment has near-zero marginal cost per inference after the initial hardware investment, while APIs typically cost $0.001-$0.10 per thousand tokens depending on model size. If you're running thousands of inferences monthly, self-hosting breaks even within weeks; if you're running millions monthly, it's orders of magnitude cheaper.

How much does it cost to run a model locally versus using an API? # FAQ_A2 Local deployment has near-zero marginal cost per inference after the initial hardware investment, while APIs typically cost $0.001-$0.10 per thousand tokens depending on model size. If you're running thousands of inferences monthly, self-hosting breaks even within weeks; if you're running millions monthly, it's orders of magnitude cheaper.

Local deployment has near-zero marginal cost per inference after the initial hardware investment, while APIs typically cost $0.001-$0.10 per thousand tokens depending on model size. If you're running thousands of inferences monthly, self-hosting breaks even within weeks; if you're running millions monthly, it's orders of magnitude cheaper.

Related tasks