Top picks for Function / Tool Calling (2026)
Reliable JSON tool-call generation. Ranked from 337 live models on the OpenRouter catalog, weighted for tool calling, structured output.
| # | Model | Score | In / 1M | Out / 1M | Context | |
|---|---|---|---|---|---|---|
| 1 | Anthropic: Claude Sonnet 4.6anthropic/claude-sonnet-4.6 | 163 | $3.00 | $15.00 | 1,000,000 | Details → |
| 2 | Anthropic: Claude Opus 4.7anthropic/claude-opus-4.7 | 160 | $5.00 | $25.00 | 1,000,000 | Details → |
| 3 | OpenAI: GPT-5openai/gpt-5 | 159 | $1.25 | $10.00 | 400,000 | Details → |
| 4 | OpenAI: o3openai/o3 | 153 | $2.00 | $8.00 | 200,000 | Details → |
| 5 | Anthropic: Claude Opus 4.8anthropic/claude-opus-4.8 | 152 | $5.00 | $25.00 | 1,000,000 | Details → |
| 6 | DeepSeek: DeepSeek V3deepseek/deepseek-chat | 149 | $0.20 | $0.80 | 131,072 | Details → |
| 7 | OpenAI: GPT-4.1openai/gpt-4.1 | 142 | $2.00 | $8.00 | 1,047,576 | Details → |
| 8 | Google: Gemini 2.5 Progoogle/gemini-2.5-pro | 130 | $1.25 | $10.00 | 1,048,576 | Details → |
| 9 | Meta: Llama 4 Maverickmeta-llama/llama-4-maverick | 128 | $0.15 | $0.60 | 1,048,576 | Details → |
| 10 | OpenAI: o4 Mini Highopenai/o4-mini-high | 128 | $1.10 | $4.40 | 200,000 | Details → |
| 11 | OpenAI: o3 Mini Highopenai/o3-mini-high | 127 | $1.10 | $4.40 | 200,000 | Details → |
| 12 | Google: Gemini 2.5 Flashgoogle/gemini-2.5-flash | 126 | $0.30 | $2.50 | 1,048,576 | Details → |
| 13 | OpenAI: o3 Miniopenai/o3-mini | 126 | $1.10 | $4.40 | 200,000 | Details → |
| 14 | OpenAI: GPT-4.1 Miniopenai/gpt-4.1-mini | 123 | $0.40 | $1.60 | 1,047,576 | Details → |
| 15 | OpenAI: GPT-4oopenai/gpt-4o | 123 | $2.50 | $10.00 | 128,000 | Details → |
Affiliate link. PicksByModel may earn a commission at no extra cost to you.
How we ranked these
For Function / Tool Calling, we weight models on tool calling, structured output. Scores combine each model's public specs with independent benchmark results (Aider Polyglot coding scores, Artificial Analysis intelligence/coding/agentic indices) and live pricing. See full methodology →
About Function / Tool Calling
Function calling is the task of generating properly formatted JSON that maps user intent to specific tool or API invocations. You need it when building agents, chatbots, or automation systems that must reliably execute external functions rather than generate freeform text. A good model produces valid, schema-compliant JSON consistently, with correct parameter mapping and no hallucinated fields; a poor one generates malformed JSON, invents tool names, or misaligns arguments to the wrong functions. The main cost consideration is that stricter models (like Claude 3.5 Sonnet with native tool_use) reduce parsing failures and retry loops, lowering total token spend despite higher per-call cost.
When to use: Use this when you need an AI to decide which real action to take (book a flight, query a database, send an email) rather than just talk about it. The AI should output a specific instruction the computer can immediately execute.
Common questions
Which models are best at function calling without generating invalid JSON?
Claude 3.5 Sonnet and GPT-4 Turbo both excel here, with Claude's native tool_use mode offering the lowest error rate for schema compliance. Open-source models like Llama 2 70B can work with strict prompt engineering, but require more retry overhead and validation logic.
How much slower is function calling compared to regular text generation?
Function calling typically adds 10-20% latency because models must reason about which tool to call before generating JSON. If you're calling multiple tools in sequence (a multi-step agent), latency compounds, but intelligent caching and parallel tool execution can offset this cost.