Code Arena

Compare the performance of AI models on agentic coding tasks involving multi-step reasoning and tool use

Last Updated

Jan 16, 2026

Total Votes

105,851

Total Models

/

	Rank Spread
1	1◄─►1	claude-opus-4-5-20251101-thinking-32k	1510	+10/-10	6,717	Anthropic	Proprietary
2	2◄─►4	claude-opus-4-5-20251101	1478	+10/-10	6,326	Anthropic	Proprietary
3	2◄─►4	gpt-5.2-high	1477	+16/-16	1,691	OpenAI	Proprietary
4	2◄─►5	gemini-3-pro	1467	+8/-8	13,138	Google	Proprietary
5	4◄─►6	gemini-3-flash	1450	+9/-9	6,563	Google	Proprietary
6	5◄─►6	glm-4.7	1447	+10/-10	4,833	Z.ai	MIT
7	7◄─►9	minimax-m2.1-preview	1422	+9/-9	6,387	MiniMax	MIT
8	7◄─►10	gemini-3-flash (thinking-minimal)	1416	+10/-10	4,649	Google	Proprietary
9	7◄─►14	gpt-5.2	1401	+15/-15	1,628	OpenAI	Proprietary
10	8◄─►14	gpt-5-medium	1398	+12/-12	3,928	OpenAI	Proprietary
11	9◄─►15	gpt-5.1-medium	1393	+9/-9	6,587	OpenAI	Proprietary
12	9◄─►14	claude-sonnet-4-5-20250929-thinking-32k	1393	+8/-8	10,271	Anthropic	Proprietary
13	9◄─►15	claude-opus-4-1-20250805	1391	+8/-8	9,118	Anthropic	Proprietary
14	9◄─►15	claude-sonnet-4-5-20250929	1386	+8/-8	11,837	Anthropic	Proprietary
15	12◄─►17	deepseek-v3.2-thinking	1373	+12/-12	2,996	DeepSeek	MIT
16	15◄─►18	glm-4.6	1361	+8/-8	8,883	Z.ai	MIT
17	15◄─►18	gpt-5.1	1356	+8/-8	9,179	OpenAI	Proprietary
18	16◄─►20	mimo-v2-flash (non-thinking)	1343	+11/-11	3,215	Xiaomi	MIT
19	18◄─►20	kimi-k2-thinking-turbo	1337	+8/-8	8,901	Moonshot	Modified MIT
20	18◄─►21	gpt-5.1-codex	1335	+9/-9	6,659	OpenAI	Proprietary
21	20◄─►21	minimax-m2	1318	+8/-8	8,990	MiniMax	Apache 2.0
22	22◄─►25	claude-haiku-4-5-20251001	1297	+8/-8	10,012	Anthropic	Proprietary
23	22◄─►25	deepseek-v3.2	1295	+11/-11	3,932	DeepSeek	MIT
24	22◄─►25	deepseek-v3.2-exp	1291	+10/-10	5,127	DeepSeek	MIT
25	22◄─►26	qwen3-coder-480b-a35b-instruct	1286	+8/-8	9,832	Alibaba	Apache 2.0
26	25◄─►27	KAT-Coder-Pro-V1	1264	+15/-15	1,956	KwaiKAT	Proprietary
27	26◄─►29	gpt-5.1-codex-mini	1248	+17/-17	1,538	OpenAI	Proprietary
28	27◄─►30	grok-4-1-fast-reasoning	1235	+12/-12	4,424	xAI	Proprietary
29	27◄─►31	mistral-large-3	1226	+20/-20	1,038	Mistral	Apache 2.0
30	29◄─►31	gemini-2.5-pro	1210	+13/-13	3,454	Google	Proprietary
31	28◄─►31	grok-4.1-thinking	1209	+19/-19	1,265	xAI	Proprietary
32	32◄─►33	grok-4-fast-reasoning	1157	+22/-22	970	xAI	Proprietary
33	32◄─►34	grok-code-fast-1	1144	+21/-21	1,015	xAI	Proprietary
34	33◄─►34	devstral-medium-2507	1102	+22/-22	1,020	Mistral	Proprietary

Code Arena

Remove Style Control Leaderboard Plots

Fraction of Model A Wins for All Non-tied A vs. B Battles

Battle Count for Each Combination of Models (without Ties)

Confidence Intervals on Model Strength (via Bootstrapping)

Average Win Rate Against All Other Models (Uniform Sampling and No Ties)