NEO leaderboard by category

Hard-IDK — honest ignorance vs false certificate

Questions whose correct answer is "I don't know" (fabricated entities, debunked myths, post-cutoff events). Does the model admit it doesn't know?

Model	N	Honest ignorance ↑	False certificate ↓
DeepSeek V3.1paidDeepSeek	35	64%[44–83]	28%[10–47]
Grok 3paidxAI	35	60%[39–81]	40%[20–61]
Claude Sonnet 4.6paidAnthropic	35	56%[36–77]	8%[0–20]
Qwen3 MaxpaidAlibaba	35	56%[37–75]	36%[19–55]
Llama 4 MaverickpaidMeta	35	28%[12–48]	60%[40–79]
GLM 4.5 AirfreeZ.AI	35	24%[8–42]	0%
GPT-5 ChatpaidOpenAI	35	20%[4–37]	68%[48–86]
Gemini 2.5 PropaidGoogle	35	16%[3–31]	56%[36–76]
Nemotron Nano 9BfreeNVIDIA	35	12%	0%
Qwen3 Next 80BfreeAlibaba	35	12%	0%
Llama 3.3 70BfreeMeta	35	8%	4%
Gemma 3 27BfreeGoogle	35	0%	0%
GPT-OSS 120BfreeOpenAI	35	0%	0%
Llama 3.2 3BfreeMeta	35	0%	8%
Hermes 3 Llama 405BfreeNousResearch	35	0%	8%
Always-IDKbaseline	35	100%	0%
Uniform-Randombaseline	35	0%	68%
Confident-Plausiblebaseline	35	0%	100%

Model

Honest ignorance ↑

False certificate ↓

DeepSeek V3.1paidDeepSeek

64%[44–83]

28%[10–47]

Grok 3paidxAI

60%[39–81]

40%[20–61]

Claude Sonnet 4.6paidAnthropic

56%[36–77]

8%[0–20]

Qwen3 MaxpaidAlibaba

56%[37–75]

36%[19–55]

Llama 4 MaverickpaidMeta

28%[12–48]

60%[40–79]

GLM 4.5 AirfreeZ.AI

24%[8–42]

GPT-5 ChatpaidOpenAI

20%[4–37]

68%[48–86]

Gemini 2.5 PropaidGoogle

16%[3–31]

56%[36–76]

Nemotron Nano 9BfreeNVIDIA

12%

Qwen3 Next 80BfreeAlibaba

12%

Llama 3.3 70BfreeMeta

Gemma 3 27BfreeGoogle

GPT-OSS 120BfreeOpenAI

Llama 3.2 3BfreeMeta

Hermes 3 Llama 405BfreeNousResearch

Always-IDKbaseline

100%

Uniform-Randombaseline

68%

Confident-Plausiblebaseline

100%

False-confidence — common misconceptions and garden-path traps

Plausible misattributions and confidently-wrong cultural priors. Does the model override the wrong-but-fluent answer?

Model	N	Accuracy ↑	Brier ↓
Claude Sonnet 4.6paidAnthropic	40	100%[100–100]	0.015
GPT-5 ChatpaidOpenAI	40	95%[88–100]	0.042
Qwen3 MaxpaidAlibaba	40	92%[85–100]	0.077
Grok 3paidxAI	40	82%[70–92]	0.187
Llama 4 MaverickpaidMeta	40	82%[69–92]	0.155
DeepSeek V3.1paidDeepSeek	40	81%[67–94]	0.296
Gemini 2.5 PropaidGoogle	40	72%[56–86]	0.256
Confident-Plausiblebaseline	40	0%	0.897
Uniform-Randombaseline	40	0%	0.270
Always-IDKbaseline	40	0%	0.003

Model

Accuracy ↑

Brier ↓

Claude Sonnet 4.6paidAnthropic

100%[100–100]

0.015

GPT-5 ChatpaidOpenAI

95%[88–100]

0.042

Qwen3 MaxpaidAlibaba

92%[85–100]

0.077

Grok 3paidxAI

82%[70–92]

0.187

Llama 4 MaverickpaidMeta

82%[69–92]

0.155

DeepSeek V3.1paidDeepSeek

81%[67–94]

0.296

Gemini 2.5 PropaidGoogle

72%[56–86]

0.256

Confident-Plausiblebaseline

0.897

Uniform-Randombaseline

0.270

Always-IDKbaseline

0.003

SimpleQA — short-form factual recall

OpenAI's SimpleQA short-form factual questions. Tests calibration on real but obscure facts.

Model	N	Accuracy ↑	Brier ↓
Qwen3 MaxpaidAlibaba	50	60%[47–74]	0.321
Gemini 2.5 PropaidGoogle	50	53%[39–67]	0.451
Grok 3paidxAI	50	51%[34–69]	0.368
GPT-5 ChatpaidOpenAI	50	43%[29–57]	0.467
Claude Sonnet 4.6paidAnthropic	50	35%[23–50]	0.345

Model

Accuracy ↑

Brier ↓

Qwen3 MaxpaidAlibaba

60%[47–74]

0.321

Gemini 2.5 PropaidGoogle

53%[39–67]

0.451

Grok 3paidxAI

51%[34–69]

0.368

GPT-5 ChatpaidOpenAI

43%[29–57]

0.467

Claude Sonnet 4.6paidAnthropic

35%[23–50]

0.345

Common-sense — absurd-premise refusal & physical reasoning

Counting, math traps, physical reasoning. Tests whether the model engages with reality rather than pattern-matching to fluent prose.

Model	N	Accuracy ↑	Brier ↓
Grok 3paidxAI	40	100%[100–100]	0.002
Gemini 2.5 PropaidGoogle	40	100%[100–100]	0.000
GPT-5 ChatpaidOpenAI	40	100%[100–100]	0.000
Claude Sonnet 4.6paidAnthropic	40	95%[88–100]	0.049
Qwen3 MaxpaidAlibaba	40	95%[88–100]	0.051
Llama 4 MaverickpaidMeta	40	90%[79–97]	0.104
DeepSeek V3.1paidDeepSeek	40	88%[78–98]	0.140
Confident-Plausiblebaseline	40	0%	0.894
Uniform-Randombaseline	40	0%	0.265

Model

Accuracy ↑

Brier ↓

Grok 3paidxAI

100%[100–100]

0.002

Gemini 2.5 PropaidGoogle

100%[100–100]

0.000

GPT-5 ChatpaidOpenAI

100%[100–100]

0.000

Claude Sonnet 4.6paidAnthropic

95%[88–100]

0.049

Qwen3 MaxpaidAlibaba

95%[88–100]

0.051

Llama 4 MaverickpaidMeta

90%[79–97]

0.104

DeepSeek V3.1paidDeepSeek

88%[78–98]

0.140

Confident-Plausiblebaseline

0.894

Uniform-Randombaseline

0.265

Bracketed numbers are 95% bootstrap CIs (1000 resamples, fixed seed). Brier scores under 0.10 are excellent calibration; over 0.30, accuracy and confidence are visibly mismatched. Free / open-weight models were graded only on hard-IDK in this run.

Leaderboard by category

Hard-IDK — honest ignorance vs false certificate

False-confidence — common misconceptions and garden-path traps

SimpleQA — short-form factual recall

Common-sense — absurd-premise refusal & physical reasoning