Live data from 5+ benchmark sources

LLM Benchmarks & Leaderboard

Compare 455+ AI models across intelligence, coding, math, arena ELO, and speed. Data from Artificial Analysis, Aider, LMSYS Arena, and Open LLM Leaderboard.

121 models ranked

🧠 Intelligence: Composite intelligence scores, MMLU-PRO, GPQA, and general reasoning

#
Model
Score
πŸ₯‡
Microsoft: Phi 4
4865.0
πŸ₯ˆ
Meta: Llama 3 70B Instruct
4674.0
πŸ₯‰
Sao10k: Llama 3 Euryale 70B v2.1
4551.0
4
NVIDIA: Llama 3.1 Nemotron 70B Instruct
4354.0
5
Nous: Hermes 3 70B Instruct
4141.0
6
WizardLM-2 8x22B
3996.0
7
Nous: DeepHermes 3 Mistral 24B Preview
3989.0
8
Mistral: Mixtral 8x22B Instruct
3870.0
9
Google: Gemma 2 27B
3835.0
10
Qwen2.5 Coder 32B Instruct
3792.0
11
Google: Gemma 2 9B
3195.0
12
Sao10K: Llama 3 8B Lunaris
3097.0
13
Mistral: Mixtral 8x7B Instruct
2991.0
14
Meta: Llama 3 8B Instruct
2960.0
15
NeverSleep: Lumimaid v0.2 8B
2929.0
16
Mistral: Mistral Nemo
2797.0
17
Qwen: Qwen2.5 Coder 7B Instruct
2614.0
18
Meta: Llama 3.2 3B Instruct (free)
2439.0
19
Mistral: Mistral 7B Instruct
2306.0
20
Mistral: Mistral 7B Instruct v0.3
2306.0
21
NousResearch: Hermes 2 Pro - Llama-3 8B
2280.0
22
Mistral: Mistral 7B Instruct v0.2
1908.0
23
Mistral: Mistral 7B Instruct v0.1
1572.0
24
Meta: Llama 3.2 1B Instruct
824.0
25
Ministral 3B
103.0
26
o4 Mini
80.0
27
O4 Mini Deep Research
80.0
28
OpenAI: o4 Mini High
80.0
29
o3
78.0
30
Gemini 2.5 Pro
73.0
31
Google: Gemini 2.5 Pro Preview 05-06
73.0
32
Claude Opus 4
72.0
33
Claude Opus 4.6
72.0
34
Claude Opus 4.1
72.0
35
Claude Opus 4.5
72.0
36
DeepSeek R1 Distill Llama 70B
70.0
37
DeepSeek R1
70.0
38
DeepSeek R1 0528
70.0
39
Claude Sonnet 4.5
70.0
40
Deepseek Chat
70.0
41
DeepSeek R1
70.0
42
Deepseek R1 Distill Llama 70b
70.0
43
Deepseek R1 Distill Qwen 32b
70.0
44
Deepseek R1 Distill Qwen 14b
70.0
45
TNG: DeepSeek R1T2 Chimera
70.0
46
DeepSeek: R1 0528 (free)
70.0
47
TNG: DeepSeek R1T Chimera
70.0
48
DeepSeek: R1 Distill Qwen 32B
70.0
49
DeepSeek: R1 Distill Llama 70B
70.0
50
DeepSeek: R1
70.0
51
o1
68.0
52
Claude Sonnet 4
67.0
53
Gemini 2.5 Flash Lite
65.0
54
Gemini 2.5 Flash
65.0
55
Gemini 2.5 Flash Image (Nano Banana)
65.0
56
Google: Gemini 2.5 Flash Preview 09-2025
65.0
57
Google: Gemini 2.5 Flash Lite Preview 09-2025
65.0
58
Grok 3
64.0
59
xAI: Grok 3 Mini Beta
64.0
60
xAI: Grok 3 Beta
64.0
61
Grok 3 Fast Beta
64.0
62
Grok 3 Mini Fast Beta
64.0
63
o3 Mini
62.9
64
OpenAI: o3 Mini High
62.9
65
DeepSeek V3.1
59.0
66
DeepSeek V3
59.0
67
DeepSeek V3 0324 Fast
59.0
68
DeepSeek V3 0324
59.0
69
Deepseek V3 Turbo
59.0
70
Deepseek V3 0324
59.0
71
Nex AGI: DeepSeek V3.1 Nex N1
59.0
72
DeepSeek: DeepSeek V3.2 Speciale
59.0
73
DeepSeek: DeepSeek V3.2
59.0
74
DeepSeek: DeepSeek V3.2 Exp
59.0
75
DeepSeek: DeepSeek V3.1 Terminus (exacto)
59.0
76
DeepSeek V3 0324
59.0
77
DeepSeek-V3.1
59.0
78
DeepSeek V3.2 Thinking
59.0
79
Llama 4 Maverick 17b 128e Instruct Fp8
58.0
80
Meta: Llama 4 Maverick
58.0
81
GPT-4.1
57.0
82
GPT-4.1
57.0
83
OpenAI: GPT-4 Turbo (older v1106)
57.0
84
Claude 3.5 Sonnet
56.2
85
Qwen: QwQ 32B
56.0
86
Grok 3 Mini
55.0
87
GPT-4o
54.4
88
Chatgpt 4o
54.4
89
OpenAI: GPT-4o Audio
54.4
90
OpenAI: GPT-4o-mini Search Preview
54.4
91
OpenAI: GPT-4o Search Preview
54.4
92
OpenAI: GPT-4 Turbo
54.4
93
Gemini 2.0 Flash 001
53.0
94
Google: Gemini 2.0 Flash Lite
53.0
95
Gemini 2.0 Flash
53.0
96
Gemini 2.0 Flash Lite
53.0
97
Meta: Llama 4 Scout
52.0
98
Meta Llama 3.1 405B Instruct
51.0
99
Nous: Hermes 3 405B Instruct (free)
51.0
100
Meta: Llama 3.1 405B (base)
51.0

About LLM Benchmarks

LLM benchmarks measure the capabilities of large language models across key dimensions. Our leaderboard aggregates data from >5 sources to provide the most comprehensive view of model performance available. Intelligence benchmarks like MMLU-PRO and GPQA test knowledge and reasoning. Coding benchmarks from Aider and LiveCodeBench measure practical programming ability. Math benchmarks including MATH-500 and AIME test mathematical reasoning. Arena ELO ratings reflect real human preferences in blind comparisons.

Speed metrics show real-world API performance: output tokens per second measures generation throughput, while time-to-first-token (TTFT) measures initial response latency β€” critical for interactive applications.