Code Arena

Compare the performance of AI models on agentic coding tasks involving multi-step reasoning and tool use

Last Updated

Jan 16, 2026

Total Votes

105,851

Total Models

34

Rank Spread
1
1◄─►1
1510+10/-106,717
Anthropic
Proprietary
2
2◄─►4
1478+10/-106,326
Anthropic
Proprietary
3
2◄─►4
1477+16/-161,691
OpenAI
Proprietary
4
2◄─►5
1467+8/-813,138
Google
Proprietary
5
4◄─►6
1450+9/-96,563
Google
Proprietary
6
5◄─►6
1447+10/-104,833
Z.ai
MIT
7
7◄─►9
1422+9/-96,387
MiniMax
MIT
8
7◄─►10
1416+10/-104,649
Google
Proprietary
9
7◄─►14
1401+15/-151,628
OpenAI
Proprietary
10
8◄─►14
1398+12/-123,928
OpenAI
Proprietary
11
9◄─►15
1393+9/-96,587
OpenAI
Proprietary
12
9◄─►14
1393+8/-810,271
Anthropic
Proprietary
13
9◄─►15
1391+8/-89,118
Anthropic
Proprietary
14
9◄─►15
1386+8/-811,837
Anthropic
Proprietary
15
12◄─►17
1373+12/-122,996
DeepSeek
MIT
16
15◄─►18
1361+8/-88,883
Z.ai
MIT
17
15◄─►18
1356+8/-89,179
OpenAI
Proprietary
18
16◄─►20
1343+11/-113,215
Xiaomi
MIT
19
18◄─►20
1337+8/-88,901
Moonshot
Modified MIT
20
18◄─►21
1335+9/-96,659
OpenAI
Proprietary
21
20◄─►21
Minimax
1318+8/-88,990
MiniMax
Apache 2.0
22
22◄─►25
1297+8/-810,012
Anthropic
Proprietary
23
22◄─►25
1295+11/-113,932
DeepSeek
MIT
24
22◄─►25
1291+10/-105,127
DeepSeek
MIT
25
22◄─►26
1286+8/-89,832
Alibaba
Apache 2.0
26
25◄─►27
1264+15/-151,956
KwaiKAT
Proprietary
27
26◄─►29
1248+17/-171,538
OpenAI
Proprietary
28
27◄─►30
1235+12/-124,424
xAI
Proprietary
29
27◄─►31
1226+20/-201,038
Mistral
Apache 2.0
30
29◄─►31
1210+13/-133,454
Google
Proprietary
31
28◄─►31
1209+19/-191,265
xAI
Proprietary
32
32◄─►33
1157+22/-22970
xAI
Proprietary
33
32◄─►34
1144+21/-211,015
xAI
Proprietary
34
33◄─►34
1102+22/-221,020
Mistral
Proprietary

Remove Style Control Leaderboard Plots

Fraction of Model A Wins for All Non-tied A vs. B Battles

Battle Count for Each Combination of Models (without Ties)

Confidence Intervals on Model Strength (via Bootstrapping)

Average Win Rate Against All Other Models (Uniform Sampling and No Ties)