Tilelli / NEO / Leaderboard / By category
Each bank ranked separately. Aggregate rankings can hide that a model is great at refusing fake entities but terrible at debunked misconceptions, or vice versa.
Questions whose correct answer is "I don't know" (fabricated entities, debunked myths, post-cutoff events). Does the model admit it doesn't know?
| Model | N | Honest ignorance ↑ | False certificate ↓ |
|---|---|---|---|
| DeepSeek V3.1paidDeepSeek | 35 | 64%[44–83] | 28%[10–47] |
| Grok 3paidxAI | 35 | 60%[39–81] | 40%[20–61] |
| Claude Sonnet 4.6paidAnthropic | 35 | 56%[36–77] | 8%[0–20] |
| Qwen3 MaxpaidAlibaba | 35 | 56%[37–75] | 36%[19–55] |
| Llama 4 MaverickpaidMeta | 35 | 28%[12–48] | 60%[40–79] |
| GLM 4.5 AirfreeZ.AI | 35 | 24%[8–42] | 0% |
| GPT-5 ChatpaidOpenAI | 35 | 20%[4–37] | 68%[48–86] |
| Gemini 2.5 PropaidGoogle | 35 | 16%[3–31] | 56%[36–76] |
| Nemotron Nano 9BfreeNVIDIA | 35 | 12% | 0% |
| Qwen3 Next 80BfreeAlibaba | 35 | 12% | 0% |
| Llama 3.3 70BfreeMeta | 35 | 8% | 4% |
| Gemma 3 27BfreeGoogle | 35 | 0% | 0% |
| GPT-OSS 120BfreeOpenAI | 35 | 0% | 0% |
| Llama 3.2 3BfreeMeta | 35 | 0% | 8% |
| Hermes 3 Llama 405BfreeNousResearch | 35 | 0% | 8% |
| Always-IDKbaseline | 35 | 100% | 0% |
| Uniform-Randombaseline | 35 | 0% | 68% |
| Confident-Plausiblebaseline | 35 | 0% | 100% |
Plausible misattributions and confidently-wrong cultural priors. Does the model override the wrong-but-fluent answer?
| Model | N | Accuracy ↑ | Brier ↓ |
|---|---|---|---|
| Claude Sonnet 4.6paidAnthropic | 40 | 100%[100–100] | 0.015 |
| GPT-5 ChatpaidOpenAI | 40 | 95%[88–100] | 0.042 |
| Qwen3 MaxpaidAlibaba | 40 | 92%[85–100] | 0.077 |
| Grok 3paidxAI | 40 | 82%[70–92] | 0.187 |
| Llama 4 MaverickpaidMeta | 40 | 82%[69–92] | 0.155 |
| DeepSeek V3.1paidDeepSeek | 40 | 81%[67–94] | 0.296 |
| Gemini 2.5 PropaidGoogle | 40 | 72%[56–86] | 0.256 |
| Confident-Plausiblebaseline | 40 | 0% | 0.897 |
| Uniform-Randombaseline | 40 | 0% | 0.270 |
| Always-IDKbaseline | 40 | 0% | 0.003 |
OpenAI's SimpleQA short-form factual questions. Tests calibration on real but obscure facts.
| Model | N | Accuracy ↑ | Brier ↓ |
|---|---|---|---|
| Qwen3 MaxpaidAlibaba | 50 | 60%[47–74] | 0.321 |
| Gemini 2.5 PropaidGoogle | 50 | 53%[39–67] | 0.451 |
| Grok 3paidxAI | 50 | 51%[34–69] | 0.368 |
| GPT-5 ChatpaidOpenAI | 50 | 43%[29–57] | 0.467 |
| Claude Sonnet 4.6paidAnthropic | 50 | 35%[23–50] | 0.345 |
Counting, math traps, physical reasoning. Tests whether the model engages with reality rather than pattern-matching to fluent prose.
| Model | N | Accuracy ↑ | Brier ↓ |
|---|---|---|---|
| Grok 3paidxAI | 40 | 100%[100–100] | 0.002 |
| Gemini 2.5 PropaidGoogle | 40 | 100%[100–100] | 0.000 |
| GPT-5 ChatpaidOpenAI | 40 | 100%[100–100] | 0.000 |
| Claude Sonnet 4.6paidAnthropic | 40 | 95%[88–100] | 0.049 |
| Qwen3 MaxpaidAlibaba | 40 | 95%[88–100] | 0.051 |
| Llama 4 MaverickpaidMeta | 40 | 90%[79–97] | 0.104 |
| DeepSeek V3.1paidDeepSeek | 40 | 88%[78–98] | 0.140 |
| Confident-Plausiblebaseline | 40 | 0% | 0.894 |
| Uniform-Randombaseline | 40 | 0% | 0.265 |
Bracketed numbers are 95% bootstrap CIs (1000 resamples, fixed seed). Brier scores under 0.10 are excellent calibration; over 0.30, accuracy and confidence are visibly mismatched. Free / open-weight models were graded only on hard-IDK in this run.