Research · best for
Top picks for Math Proofs (2026)
Formal proof construction and verification. Ranked from 352 live models on the OpenRouter catalog, weighted for reasoning quality, context window.
What this is
A capability-matched shortlist, not a benchmark-tested winner. Models are scored by the fit of their declared specs (structured output, reasoning, context, modality, price) against Math Proofs. Pair with benchmark sources like Artificial Analysis or LMSys Arena before you ship. Full methodology →
| # | Model | Score | In / 1M | Out / 1M | Context | |
|---|---|---|---|---|---|---|
| 1 | Xiaomi: MiMo-V2.5xiaomi/mimo-v2.5 | 128 | $0.40 | $2.00 | 1,048,576 | Details → |
| 2 | Qwen: Qwen3.6 Plusqwen/qwen3.6-plus | 128 | $0.33 | $1.95 | 1,000,000 | Details → |
| 3 | xAI: Grok 4.20x-ai/grok-4.20 | 128 | $2.00 | $6.00 | 2,000,000 | Details → |
| 4 | OpenAI: GPT-5.4 Nanoopenai/gpt-5.4-nano | 128 | $0.20 | $1.25 | 400,000 | Details → |
| 5 | OpenAI: GPT-5.4 Miniopenai/gpt-5.4-mini | 128 | $0.75 | $4.50 | 400,000 | Details → |
| 6 | OpenAI: GPT-5.4openai/gpt-5.4 | 128 | $2.50 | $15.00 | 1,050,000 | Details → |
| 7 | Google: Gemini 3.1 Flash Lite Previewgoogle/gemini-3.1-flash-lite-preview | 128 | $0.25 | $1.50 | 1,048,576 | Details → |
| 8 | Qwen: Qwen3.5-Flashqwen/qwen3.5-flash-02-23 | 128 | $0.07 | $0.26 | 1,000,000 | Details → |
| 9 | Google: Gemini 3.1 Pro Preview Custom Toolsgoogle/gemini-3.1-pro-preview-customtools | 128 | $2.00 | $12.00 | 1,048,576 | Details → |
| 10 | OpenAI: GPT-5.3-Codexopenai/gpt-5.3-codex | 128 | $1.75 | $14.00 | 400,000 | Details → |
| 11 | Google: Gemini 3.1 Pro Previewgoogle/gemini-3.1-pro-preview | 128 | $2.00 | $12.00 | 1,048,576 | Details → |
| 12 | Qwen: Qwen3.5 Plus 2026-02-15qwen/qwen3.5-plus-02-15 | 128 | $0.26 | $1.56 | 1,000,000 | Details → |
| 13 | Google: Gemini 3 Flash Previewgoogle/gemini-3-flash-preview | 128 | $0.50 | $3.00 | 1,048,576 | Details → |
| 14 | OpenAI: GPT-5.2openai/gpt-5.2 | 128 | $1.75 | $14.00 | 400,000 | Details → |
| 15 | Amazon: Nova 2 Liteamazon/nova-2-lite-v1 | 128 | $0.30 | $2.50 | 1,000,000 | Details → |
How we ranked these
For Math Proofs, we weight models on reasoning quality, context window. Higher means better. Scores combine each model's public metadata (context length, modality support, tool calling, structured output, reasoning capability) with live pricing. See full methodology →
Related tasks
Research
Best for Scientific Coding
NumPy, JAX, PyTorch : research-grade code.
Research
Best for Literature Review
Synthesizing across many academic papers.
Research
Best for Experiment Design
Designing rigorous A/B and lab experiments.
Research
Best for Dataset Annotation
Annotating training data at scale.