
LLM Benchmarks Explained: MMLU, Arena ELO, and What Actually Matters
Every time a new model launches, the same thing happens. The company publishes a blog post with a table full of numbers — 89.3% on MMLU, 92.1% on HumanEval, 1287 Arena ELO — and the internet collectively loses its mind debating whether those numbers mean anything. Half the people on Twitter are crowning it the new GOAT. The other half are pointing out the benchmark was probably in the training data.
Both sides have a point. LLM benchmarks are simultaneously the best tool we have for comparing models and deeply flawed in ways that matter. If you're a developer trying to choose a model for your project, you need to know what these numbers actually tell you — and more importantly, what they don't.
The Big Benchmarks, Explained Simply
Let's run through the ones you'll see cited most often. I'll skip the academic jargon and just tell you what they measure.
MMLU (Massive Multitask Language Understanding)
MMLU is a multiple-choice test covering 57 subjects — everything from abstract algebra to world history to clinical medicine. Think of it as the SAT for LLMs. A score of 90% means the model got 90% of questions right across all subjects.
Why people care: It's broad. A model that scores well on MMLU generally has wide-ranging knowledge. It's been the de facto "general intelligence" benchmark since GPT-4's launch.
Why it's overrated: Multiple choice is a terrible proxy for how people actually use LLMs. You don't send your chatbot a four-option quiz. You ask it to write code, analyze a contract, or explain a concept. A model can ace MMLU and still produce mediocre real-world output. And with scores clustering in the 85-92% range for top models, the differences are often within noise.
HumanEval (Code Generation)
HumanEval presents the model with 164 Python programming problems and checks if the generated code passes unit tests. Pass@1 means it got the right answer on the first try.
Why people care: If you're using an LLM for coding tasks, this is directly relevant. It measures functional correctness, not just "looks right."
The catch: 164 problems is a tiny sample. And these are all self-contained function-level tasks — the kind of thing Copilot handles well. Real coding involves understanding large codebases, debugging across files, and working with frameworks. HumanEval doesn't test any of that. Also, the problems have been around since 2021. Some models have likely seen them (or close variants) during training.
MATH
A dataset of 12,500 competition-level math problems spanning algebra, geometry, number theory, and more. Problems range from AMC-level (high school math competitions) to AIME and beyond.
Why people care: Math requires actual reasoning, not just pattern matching. A model that scores well on MATH can genuinely work through multi-step logical problems. It's one of the harder benchmarks to game.
The catch: Unless you're building a math tutor, competition-level math ability doesn't correlate strongly with the tasks most developers care about. It's a good signal for reasoning capability, though.
Arena ELO (Chatbot Arena / LMSYS)
This is the one I actually pay attention to. Chatbot Arena works like this: real users submit prompts and get responses from two anonymous models side by side. They pick which response they prefer. Over millions of these head-to-head battles, each model gets an ELO rating — the same system used in chess.
Why it matters most: It measures what people actually care about — which response do real humans prefer? It's resistant to gaming because the prompts come from real users, not a fixed dataset. And it captures qualities that benchmarks miss: tone, helpfulness, formatting, creativity, following instructions well.
The limitation: It skews toward chatbot-style interactions. If your use case is structured data extraction or code generation, the Arena ranking might not reflect your experience. Also, ELO differences under ~30 points are basically noise.
MT-Bench
A set of 80 multi-turn questions across 8 categories (writing, roleplay, reasoning, math, coding, extraction, STEM, humanities). A separate LLM (usually GPT-4) judges the quality of responses on a 1-10 scale.
Why people care: It tests multi-turn conversation, which is closer to real usage than single-shot benchmarks. And the categories cover practical use cases.
The catch: Having GPT-4 judge other models creates an obvious bias. Models that "sound like" GPT-4 tend to score higher. 80 questions is also pretty small. It's useful as one signal among many, not as a definitive ranking.
The Benchmark Paradox
Here's something that doesn't get talked about enough: the better a benchmark becomes at measuring model quality, the more incentive companies have to optimize for it specifically. This is Goodhart's Law in action — when a measure becomes a target, it ceases to be a good measure.
We've already seen this play out. Some models score suspiciously well on MMLU while underperforming on novel, held-out tests. Benchmark contamination — where test questions leak into training data — is a real and documented problem. Research from 2023 showed that several popular models had likely been exposed to benchmark data during training.
This is why Arena ELO is so valuable: you can't optimize for it without actually making your model better at responding to arbitrary human requests. The "test set" is infinite and constantly changing.
What to Actually Look at When Choosing a Model
Forget the leaderboard for a second. Here's what I actually recommend:
- Test with YOUR prompts. Take 20-50 real prompts from your application and run them through the top 3-4 models you're considering. Judge the outputs yourself. This takes an afternoon and tells you more than any benchmark.
- Check Arena ELO for general quality. If a model is top-10 on Arena, it's probably good. If it's ranked 30th, it probably has meaningful quality gaps for conversational tasks.
- Look at task-specific benchmarks. If you're doing code, look at HumanEval and SWE-bench. If you're doing math, look at MATH and GSM8K. Don't rely on MMLU to tell you if a model is good at coding.
- Consider the practical stuff. Context window, latency, price per token, rate limits, uptime. A model that's 2% better on benchmarks but 3x the price and 2x slower might not be the right call.
- Watch the community. Real developer experiences on Reddit, Twitter, and HN are genuinely more useful than benchmark tables. If everyone building coding tools is switching to Claude, that tells you something.
You Don't Have to Pick Just One
This is where routing changes the game. Instead of agonizing over whether GPT-4o or Claude Opus 4 is "better" — a question that depends entirely on the task — you can use both. Route coding tasks to Claude, creative tasks to GPT-4o, simple queries to something fast and cheap like GPT-4o-mini.
The benchmark debate becomes less of a high-stakes decision and more of a tuning exercise. Browse our model comparison page to see how models stack up across benchmarks, pricing, and capabilities — then build a routing strategy that plays to each model's strengths.
Because the right model for your use case isn't the one with the highest MMLU score. It's the one that gives your users the best experience at a price you can afford. And sometimes, that's a different model for every request.