Agents · best for

Top picks for Function / Tool Calling (2026)

Reliable JSON tool-call generation. Ranked from 337 live models on the OpenRouter catalog, weighted for tool calling, structured output.

What this is Ranked by capability match + real benchmark scores (Aider Polyglot, Artificial Analysis Intelligence Index) + live pricing. Models need the right specs for Function / Tool Calling, then benchmark performance refines the order. Full methodology →
#ModelScoreIn / 1MOut / 1MContext
1 Anthropic: Claude Sonnet 4.6anthropic/claude-sonnet-4.6 163 $3.00 $15.00 1,000,000 Details →
2 Anthropic: Claude Opus 4.7anthropic/claude-opus-4.7 160 $5.00 $25.00 1,000,000 Details →
3 OpenAI: GPT-5openai/gpt-5 159 $1.25 $10.00 400,000 Details →
4 OpenAI: o3openai/o3 153 $2.00 $8.00 200,000 Details →
5 Anthropic: Claude Opus 4.8anthropic/claude-opus-4.8 152 $5.00 $25.00 1,000,000 Details →
6 DeepSeek: DeepSeek V3deepseek/deepseek-chat 149 $0.20 $0.80 131,072 Details →
7 OpenAI: GPT-4.1openai/gpt-4.1 142 $2.00 $8.00 1,047,576 Details →
8 Google: Gemini 2.5 Progoogle/gemini-2.5-pro 130 $1.25 $10.00 1,048,576 Details →
9 Meta: Llama 4 Maverickmeta-llama/llama-4-maverick 128 $0.15 $0.60 1,048,576 Details →
10 OpenAI: o4 Mini Highopenai/o4-mini-high 128 $1.10 $4.40 200,000 Details →
11 OpenAI: o3 Mini Highopenai/o3-mini-high 127 $1.10 $4.40 200,000 Details →
12 Google: Gemini 2.5 Flashgoogle/gemini-2.5-flash 126 $0.30 $2.50 1,048,576 Details →
13 OpenAI: o3 Miniopenai/o3-mini 126 $1.10 $4.40 200,000 Details →
14 OpenAI: GPT-4.1 Miniopenai/gpt-4.1-mini 123 $0.40 $1.60 1,047,576 Details →
15 OpenAI: GPT-4oopenai/gpt-4o 123 $2.50 $10.00 128,000 Details →
AI Apps OnSpace AI Build and deploy AI-powered apps without code.
Try free →

Affiliate link. PicksByModel may earn a commission at no extra cost to you.

How we ranked these

For Function / Tool Calling, we weight models on tool calling, structured output. Scores combine each model's public specs with independent benchmark results (Aider Polyglot coding scores, Artificial Analysis intelligence/coding/agentic indices) and live pricing. See full methodology →

About Function / Tool Calling

Function calling is the task of generating properly formatted JSON that maps user intent to specific tool or API invocations. You need it when building agents, chatbots, or automation systems that must reliably execute external functions rather than generate freeform text. A good model produces valid, schema-compliant JSON consistently, with correct parameter mapping and no hallucinated fields; a poor one generates malformed JSON, invents tool names, or misaligns arguments to the wrong functions. The main cost consideration is that stricter models (like Claude 3.5 Sonnet with native tool_use) reduce parsing failures and retry loops, lowering total token spend despite higher per-call cost.

When to use: Use this when you need an AI to decide which real action to take (book a flight, query a database, send an email) rather than just talk about it. The AI should output a specific instruction the computer can immediately execute.

Common questions

Which models are best at function calling without generating invalid JSON?

Claude 3.5 Sonnet and GPT-4 Turbo both excel here, with Claude's native tool_use mode offering the lowest error rate for schema compliance. Open-source models like Llama 2 70B can work with strict prompt engineering, but require more retry overhead and validation logic.

How much slower is function calling compared to regular text generation?

Function calling typically adds 10-20% latency because models must reason about which tool to call before generating JSON. If you're calling multiple tools in sequence (a multi-step agent), latency compounds, but intelligent caching and parallel tool execution can offset this cost.

Related tasks