Reference Reference

📊 Benchmarks Reference

Benchmark Reference

Last verified: April 2026. Model capabilities shift rapidly. Use this page to understand what benchmarks mean, but always check authoritative leaderboards for the real-time rankings of newly released models.

The Golden Rule of Benchmarks

Trust your eyes over leaderboards.

Benchmarks are easily gamified. Labs often train their models specifically to pass the questions on popular tests (data contamination). A model that scores 90% on a coding benchmark might still fail to write a simple script for your specific, undocumented API.

Use leaderboards to identify the tier a model belongs in, but verify its usefulness by running it on your own data.

What Benchmarks Actually Measure

When you look at a model card, you’ll see a wall of acronyms. Here are the ones that actually matter today:

Chatbot Arena (LMSYS)

What it is: Human vibes. Two anonymous models answer the same prompt, and an actual human votes on which answer is better.
Why it matters: The most-used public benchmark for subjective quality, because real users do the rating. It’s the hardest benchmark to fake.
Good score: >1350 is frontier-level. >1250 is excellent local quality.

SWE-bench

What it is: Software Engineering context. The model is given a real, unsolved GitHub issue from a popular Python repository and asked to submit a pull request fixing it.
Why it matters: It measures agentic capability. The model has to understand a massive codebase, plan a fix, write the code, and ensure it doesn’t break existing tests.
Good score: >50% (Resolved) is frontier. >30% is highly capable local.

GPQA Diamond

What it is: PhD-level science. Questions so hard that even PhDs with access to Google struggle to answer them quickly.
Why it matters: Tests absolute peak reasoning and specialized knowledge without the model being able to simply parrot Wikipedia.
Good score: >80% is frontier-level reasoning.

MMLU (Massive Multitask Language Understanding)

What it is: Knowledge recall across 57 subjects: math, science, law, medicine, history, and more. Tests whether a model has broad factual coverage.
Why it matters now: MMLU was the dominant benchmark from 2020 to 2023, but top models now score above 90%, making score differences between them meaningless. Use it to confirm a model has basic knowledge coverage. Don’t use it to compare frontier models to each other.
Good score: >90% for frontier, >75% for capable local models.

HumanEval

What it is: 164 Python programming problems from OpenAI, evaluated by pass@k (does the generated code pass the unit tests?).
Why it matters: It was the first widely used coding benchmark and established the pass@k evaluation style. The problems are now well-known and contamination is high. For coding comparisons, use LiveCodeBench or SWE-bench instead.
Good score: >85% for frontier models.

LiveCodeBench

What it is: Coding problems from competitive programming contests (LeetCode, Codeforces, AtCoder), updated continuously after model training cutoffs.
Why it matters: Continuous updates make it much harder to contaminate than HumanEval. Problems are also harder, so the benchmark still differentiates between frontier models.
Good score: >50% is strong frontier performance.

Humanity’s Last Exam (HLE) and Tau2-bench (τ²)

What it is: HLE tests extreme multi-disciplinary expert reasoning, while Tau2-bench tests practical business agent performance (e.g., API usage).
Why it matters: The newer benchmarks — Arena, SWE-bench, GPQA — emerged because MMLU was saturated. HLE and Tau2 continue that shift toward tasks that require actual reasoning over knowledge recall.

Benchmarks to Treat Skeptically

MMLU: Saturated for frontier models. Scores above 90% are now common enough that differences between models are not meaningful. Use it to check whether a small or local model has general knowledge coverage, not to rank top-tier APIs.

GSM8K: Grade-school math reasoning. Also saturated at the top — most frontier models score above 95%. It was useful for distinguishing models in 2022 and 2023; it no longer is.

HumanEval: Contamination risk is high because the 164 problems are public and well-known. A model that scores well might have seen the problems during training. LiveCodeBench is a better choice for coding comparisons.

Production Evaluation Frameworks (Developer)

The better approach is running automated evaluation on your own data:

DeepEval / Confident AI: Open-source and SaaS platforms that treat prompt evaluations like standard software unit tests (e.g., pytest), catching model “drift” in production automatically.

Current Local Model Tier List

Note: These are estimates based on Q4 quantized local performance, not raw uncompressed API performance.

Tier 1: Frontier (Local Heavyweights)

Models requiring 64GB+ RAM. These rival top commercial APIs.

DeepSeek V3 / R1 (671B) - Highest SWE-bench Verified execution and efficient chain-of-thought processing.
Kimi K2.5 - Exceptional at GPQA Diamond and agentic coding.
GLM-5 (744B) - Dominant in multi-disciplinary systems engineering tasks.
GPT-oss 120B - Frequently matches proprietary models on AIME and MMLU-Pro.

Tier 2: The 32GB Sweet Spot

Models hitting high scores on Arena while remaining practical to run locally.

Qwen 3.5 35B-A3B
Gemma 4 31B / 26B-A4B
Nemotron 3 Nano 30B

Tier 3: The 16GB Workhorses

The best models for laptops.

Qwen 3.5 9B - The undisputed champion of this weight class.
Llama 3.3 8B
Gemma 4 E4B

Live Leaderboards to Check

To see where a model ranks today, check these trusted sources:

LMSYS Chatbot Arena: The definitive human-eval leaderboard.
HuggingFace Open LLM Leaderboard: The automated suite for base model intelligence.
SWE-bench Leaderboard: If you care about coding and agents, check here first.
Aider LLM Leaderboard: Excellent practical coding benchmark based on actual code editing tasks.

Module 9 covers how to evaluate models against your own tasks.