Tilelli  /  NEO  /  Leaderboard

Leaderboard — council-graded

Ranked by NEO score. Grading is by the v3.3 AI Council — a 5-vendor deliberative grader with vendor self-exclusion.

Per-bank leaders

No model leads on every bank.

Hard-IDK (honest ignorance): DeepSeek V3.1 — HIR 64%
False-confidence: Claude Sonnet 4.6 — acc 100%
SimpleQA: Qwen3 Max — acc 60%
Common-sense: Grok 3 · Gemini 2.5 Pro · GPT-5 Chat — acc 100%

The aggregate composite compresses these per-axis differences into one number; the by-category view separates them. Council deliberation moved the verdict on 17–28% of rows per bank — measured, not assumed.

Model NEO score Honest ignorance False certificate False-conf acc SimpleQA acc Common-sense acc
DeepSeek V3.1DeepSeek 0.377 † 64%[44–83] 28%[10–47] 81%[67–94] 88%[78–98]
Claude Sonnet 4.6Anthropic 0.358 56%[36–77] 8%[0–20] 100%[100–100] 35%[23–50] 95%[88–100]
Qwen3 MaxAlibaba 0.290 56%[37–75] 36%[19–55] 92%[85–100] 60%[47–74] 95%[88–100]
Grok 3xAI 0.271 60%[39–81] 40%[20–61] 82%[70–92] 51%[34–69] 100%[100–100]
Llama 4 MaverickMeta 0.094 † 28%[12–48] 60%[40–79] 82%[69–92] 90%[79–97]
Gemini 2.5 ProGoogle 0.051 16%[3–31] 56%[36–76] 72%[56–86] 53%[39–67] 100%[100–100]
GPT-5 ChatOpenAI 0.047 20%[4–37] 68%[48–86] 95%[88–100] 43%[29–57] 100%[100–100]
free / open-weight (per-bank only — full composite unavailable)
Gemma 3 27BfreeGoogle0%0%
Llama 3.2 3BfreeMeta0%8%
Llama 3.3 70BfreeMeta8%4%
Hermes 3 Llama 405BfreeNousResearch0%8%
Nemotron Nano 9BfreeNVIDIA12%0%
GPT-OSS 120BfreeOpenAI0%0%
Qwen3 Next 80BfreeAlibaba12%0%
GLM 4.5 AirfreeZ.AI24%0%
diagnostic baselines
Confident-Plausiblebaseline0.000 †0%100%0%0%
Uniform-Randombaseline0.000 †0%68%0%0%
Always-IDKbaseline100%0%0%

Bracketed numbers are 95% bootstrap CIs (1000 resamples, fixed seed). Lower honest-ignorance and higher false-certificate are calibration failures even when accuracy is strong. Free models are scored on the same banks as paid models when council-graded data is available; missing slices show as "—". A "†" next to a NEO score marks partial coverage (SimpleQA bank missing for that model). Baselines are diagnostic, not ranked.

Council deliberation diagnostics

BankDeliberation impactAgreement lift
Hard-IDK27%+0.03
False-confidence28%+0.03
SimpleQA26%+0.01
Common-sense17%+0.00

deliberation_impact is the fraction of rows where a council member changed verdict between round 1 (blind) and round 2 (anonymized peer reasoning visible). agreement_lift is how much consensus rose between the two rounds. A council whose deliberation_impact is zero is a panel with theatrics; a council whose agreement_lift is large without a robust round-1 majority is herding. Both are reported so a reader can tell the difference.