how we rank

Methodology

Rankings blend real benchmark scores with capability matching and live pricing. Now on ranking methodology v3.

Ranking methodology v3: benchmark-blended rankings

Per-task rankings now blend independent benchmark data directly into every score. Capability-match remains the floor. Models still need the right shape for the task. Benchmarks refine ordering within that floor so the models that actually perform best on independent evaluations rise to the top.

Live sources:

Aider Polyglot (Apache-2.0): multi-language coding benchmark. Models that score higher get a proportional bonus on code-related tasks (SQL generation, code review, bug fixing, coding agents).
Artificial Analysis Intelligence Index (MIT): three sub-indices: overall intelligence, coding, and agentic capability. Blended into reasoning, tool-use, and structured-output tasks respectively.
EQ-Bench v3 (MIT): emotional reasoning and social intelligence benchmark across 74 frontier models. Blended into creative writing, screenwriting, and reasoning-heavy tasks.

Coming next: MMLU math subtask scores from a community-maintained source.

How scoring works

Every model's declared specs are pulled from public catalogs daily : context window, input/output pricing, input modalities (text/audio/image/video), tool-calling, structured output, reasoning mode, open weights. A per-task formula weights those dimensions for the task and produces a capability score. Benchmark data from the live sources above is then blended in proportionally. Models with strong independent evaluations get a score bonus on the tasks those benchmarks are relevant to.

The final score answers: which models have the right shape for this task, perform well on independent benchmarks, and at what price? Refreshed every morning so new launches and price cuts don't go stale.

What the score isn't

The score is not a head-to-head evaluation on your specific task. "Best for SQL Generation" means the model has the right capabilities, competitive pricing, and strong independent benchmark results, not that we ran your exact SQL suite against it. The value is shortlist time saved: instead of re-reading 333+ model cards, you get a daily, vendor-neutral ranking informed by both specs and real performance data.

License transparency

All benchmark sources are open-licensed (Apache-2.0 or MIT). Deliberately excluded: LMSys Chatbot Arena (CC-BY-NC-4.0, incompatible with commercial redistribution via our API). We acknowledge Arena ELO as a valuable signal but cannot incorporate it under that license.

Independence

No commercial relationship with any model vendor. No pay-to-rank. Every model is scored from the same public fields by the same formula. If you build your own app on the API, you get the exact inputs and weights in the response. No black box.