LLM Benchmarks & Leaderboard
Compare 455+ AI models across intelligence, coding, math, arena ELO, and speed. Data from Artificial Analysis, Aider, LMSYS Arena, and Open LLM Leaderboard.
π§ Intelligence: Composite intelligence scores, MMLU-PRO, GPQA, and general reasoning
Data Sources
About LLM Benchmarks
LLM benchmarks measure the capabilities of large language models across key dimensions. Our leaderboard aggregates data from >5 sources to provide the most comprehensive view of model performance available. Intelligence benchmarks like MMLU-PRO and GPQA test knowledge and reasoning. Coding benchmarks from Aider and LiveCodeBench measure practical programming ability. Math benchmarks including MATH-500 and AIME test mathematical reasoning. Arena ELO ratings reflect real human preferences in blind comparisons.
Speed metrics show real-world API performance: output tokens per second measures generation throughput, while time-to-first-token (TTFT) measures initial response latency β critical for interactive applications.