Agents · best for

Top picks for Function / Tool Calling (2026)

Reliable JSON tool-call generation. Ranked from 337 live models on the OpenRouter catalog, weighted for tool calling, structured output.

What this is Ranked by capability match + real benchmark scores (Aider Polyglot, Artificial Analysis Intelligence Index) + live pricing. Models need the right specs for Function / Tool Calling, then benchmark performance refines the order. Full methodology →

#	Model	Score	In / 1M	Out / 1M	Context
1	Anthropic: Claude Sonnet 4.6anthropic/claude-sonnet-4.6	159	$3.00	$15.00	1,000,000	Details →
2	Anthropic: Claude Opus 4.7anthropic/claude-opus-4.7	157	$5.00	$25.00	1,000,000	Details →
3	OpenAI: GPT-5.6 Terraopenai/gpt-5.6-terra	154	$2.50	$15.00	1,050,000	Details →
4	Anthropic: Claude Sonnet 5anthropic/claude-sonnet-5	154	$2.00	$10.00	1,000,000	Details →
5	xAI: Grok 4.5x-ai/grok-4.5	154	$2.00	$6.00	500,000	Details →
6	OpenAI: GPT-5.6 Lunaopenai/gpt-5.6-luna	154	$1.00	$6.00	1,050,000	Details →
7	OpenAI: GPT-5.4openai/gpt-5.4	152	$2.50	$15.00	1,050,000	Details →
8	Z.ai: GLM 5.2z-ai/glm-5.2	151	$0.83	$2.60	1,048,576	Details →
9	Google: Gemini 3.5 Flashgoogle/gemini-3.5-flash	151	$1.50	$9.00	1,048,576	Details →
10	OpenAI: GPT-5openai/gpt-5	150	$1.25	$10.00	400,000	Details →
11	MiniMax: MiniMax M3minimax/minimax-m3	150	$0.30	$1.20	1,048,576	Details →
12	DeepSeek: DeepSeek V4 Prodeepseek/deepseek-v4-pro	149	$0.43	$0.87	1,048,576	Details →
13	Anthropic: Claude Opus 4.8anthropic/claude-opus-4.8	149	$5.00	$25.00	1,000,000	Details →
14	OpenAI: GPT-5.6 Solopenai/gpt-5.6-sol	149	$5.00	$30.00	1,050,000	Details →
15	MoonshotAI: Kimi K2.6moonshotai/kimi-k2.6	149	$0.68	$3.42	262,144	Details →

AI Apps OnSpace AI Build and deploy AI-powered apps without code.

Try free →

Affiliate link. PicksByModel may earn a commission at no extra cost to you.

How we ranked these

For Function / Tool Calling, we weight models on tool calling, structured output. Scores combine each model's public specs with independent benchmark results (Aider Polyglot coding scores, Artificial Analysis intelligence/coding/agentic indices) and live pricing. See full methodology →

About Function / Tool Calling

Function calling is the task of generating properly formatted JSON that maps user intent to specific tool or API invocations. You need it when building agents, chatbots, or automation systems that must reliably execute external functions rather than generate freeform text. A good model produces valid, schema-compliant JSON consistently, with correct parameter mapping and no hallucinated fields; a poor one generates malformed JSON, invents tool names, or misaligns arguments to the wrong functions. The main cost consideration is that stricter models (like Claude 3.5 Sonnet with native tool_use) reduce parsing failures and retry loops, lowering total token spend despite higher per-call cost.

When to use: Use this when you need an AI to decide which real action to take (book a flight, query a database, send an email) rather than just talk about it. The AI should output a specific instruction the computer can immediately execute.

Common questions

Which models are best at function calling without generating invalid JSON?

Claude 3.5 Sonnet and GPT-4 Turbo both excel here, with Claude's native tool_use mode offering the lowest error rate for schema compliance. Open-source models like Llama 2 70B can work with strict prompt engineering, but require more retry overhead and validation logic.

How much slower is function calling compared to regular text generation?

Function calling typically adds 10-20% latency because models must reason about which tool to call before generating JSON. If you're calling multiple tools in sequence (a multi-step agent), latency compounds, but intelligent caching and parallel tool execution can offset this cost.

Related tasks

Agents

Top picks for Function / Tool Calling (2026)

How we ranked these

About Function / Tool Calling

Common questions

Which models are best at function calling without generating invalid JSON?

How much slower is function calling compared to regular text generation?

Related tasks

Best for Agent Workflows

Best for Browser Automation

Best for RAG Pipelines

Best for Long-Context Q&A

Best for Coding Agents