Tilelli  /  NEO  /  Leaderboard  /  By category

Leaderboard by category

Each bank ranked separately. Aggregate rankings can hide that a model is great at refusing fake entities but terrible at debunked misconceptions, or vice versa.

Hard-IDK — honest ignorance vs false certificate

Questions whose correct answer is "I don't know" (fabricated entities, debunked myths, post-cutoff events). Does the model admit it doesn't know?

ModelNHonest ignorance ↑False certificate ↓
DeepSeek V3.1DeepSeek3564%[44–83]28%[10–47]
Grok 3xAI3560%[39–81]40%[20–61]
Claude Sonnet 4.6Anthropic3556%[36–77]8%[0–20]
Qwen3 MaxAlibaba3556%[37–75]36%[19–55]
Llama 4 MaverickMeta3528%[12–48]60%[40–79]
GLM 4.5 AirfreeZ.AI3524%[8–42]0%
GPT-5 ChatOpenAI3520%[4–37]68%[48–86]
Gemini 2.5 ProGoogle3516%[3–31]56%[36–76]
Nemotron Nano 9BfreeNVIDIA3512%0%
Qwen3 Next 80BfreeAlibaba3512%0%
Llama 3.3 70BfreeMeta358%4%
Gemma 3 27BfreeGoogle350%0%
GPT-OSS 120BfreeOpenAI350%0%
Llama 3.2 3BfreeMeta350%8%
Hermes 3 Llama 405BfreeNousResearch350%8%
Always-IDKbaseline35100%0%
Uniform-Randombaseline350%68%
Confident-Plausiblebaseline350%100%

False-confidence — common misconceptions and garden-path traps

Plausible misattributions and confidently-wrong cultural priors. Does the model override the wrong-but-fluent answer?

ModelNAccuracy ↑Brier ↓
Claude Sonnet 4.6Anthropic40100%[100–100]0.015
GPT-5 ChatOpenAI4095%[88–100]0.042
Qwen3 MaxAlibaba4092%[85–100]0.077
Grok 3xAI4082%[70–92]0.187
Llama 4 MaverickMeta4082%[69–92]0.155
DeepSeek V3.1DeepSeek4081%[67–94]0.296
Gemini 2.5 ProGoogle4072%[56–86]0.256
Confident-Plausiblebaseline400%0.897
Uniform-Randombaseline400%0.270
Always-IDKbaseline400%0.003

SimpleQA — short-form factual recall

OpenAI's SimpleQA short-form factual questions. Tests calibration on real but obscure facts.

ModelNAccuracy ↑Brier ↓
Qwen3 MaxAlibaba5060%[47–74]0.321
Gemini 2.5 ProGoogle5053%[39–67]0.451
Grok 3xAI5051%[34–69]0.368
GPT-5 ChatOpenAI5043%[29–57]0.467
Claude Sonnet 4.6Anthropic5035%[23–50]0.345

Common-sense — absurd-premise refusal & physical reasoning

Counting, math traps, physical reasoning. Tests whether the model engages with reality rather than pattern-matching to fluent prose.

ModelNAccuracy ↑Brier ↓
Grok 3xAI40100%[100–100]0.002
Gemini 2.5 ProGoogle40100%[100–100]0.000
GPT-5 ChatOpenAI40100%[100–100]0.000
Claude Sonnet 4.6Anthropic4095%[88–100]0.049
Qwen3 MaxAlibaba4095%[88–100]0.051
Llama 4 MaverickMeta4090%[79–97]0.104
DeepSeek V3.1DeepSeek4088%[78–98]0.140
Confident-Plausiblebaseline400%0.894
Uniform-Randombaseline400%0.265

Bracketed numbers are 95% bootstrap CIs (1000 resamples, fixed seed). Brier scores under 0.10 are excellent calibration; over 0.30, accuracy and confidence are visibly mismatched. Free / open-weight models were graded only on hard-IDK in this run.