Ranked by NEO score. Grading is by the v3.3 AI Council — a 5-vendor deliberative grader with vendor self-exclusion.
Hard-IDK (honest ignorance): DeepSeek V3.1 — HIR 64%
False-confidence: Claude Sonnet 4.6 — acc 100%
SimpleQA: Qwen3 Max — acc 60%
Common-sense: Grok 3 · Gemini 2.5 Pro · GPT-5 Chat — acc 100%
The aggregate composite compresses these per-axis differences into one number; the by-category view separates them. Council deliberation moved the verdict on 17–28% of rows per bank — measured, not assumed.
| Model | NEO score | Honest ignorance | False certificate | False-conf acc | SimpleQA acc | Common-sense acc |
|---|---|---|---|---|---|---|
| DeepSeek V3.1paidDeepSeek | 0.377 † | 64%[44–83] | 28%[10–47] | 81%[67–94] | — | 88%[78–98] |
| Claude Sonnet 4.6paidAnthropic | 0.358 | 56%[36–77] | 8%[0–20] | 100%[100–100] | 35%[23–50] | 95%[88–100] |
| Qwen3 MaxpaidAlibaba | 0.290 | 56%[37–75] | 36%[19–55] | 92%[85–100] | 60%[47–74] | 95%[88–100] |
| Grok 3paidxAI | 0.271 | 60%[39–81] | 40%[20–61] | 82%[70–92] | 51%[34–69] | 100%[100–100] |
| Llama 4 MaverickpaidMeta | 0.094 † | 28%[12–48] | 60%[40–79] | 82%[69–92] | — | 90%[79–97] |
| Gemini 2.5 PropaidGoogle | 0.051 | 16%[3–31] | 56%[36–76] | 72%[56–86] | 53%[39–67] | 100%[100–100] |
| GPT-5 ChatpaidOpenAI | 0.047 | 20%[4–37] | 68%[48–86] | 95%[88–100] | 43%[29–57] | 100%[100–100] |
| free / open-weight (per-bank only — full composite unavailable) | ||||||
| Gemma 3 27BfreeGoogle | — | 0% | 0% | — | — | — |
| Llama 3.2 3BfreeMeta | — | 0% | 8% | — | — | — |
| Llama 3.3 70BfreeMeta | — | 8% | 4% | — | — | — |
| Hermes 3 Llama 405BfreeNousResearch | — | 0% | 8% | — | — | — |
| Nemotron Nano 9BfreeNVIDIA | — | 12% | 0% | — | — | — |
| GPT-OSS 120BfreeOpenAI | — | 0% | 0% | — | — | — |
| Qwen3 Next 80BfreeAlibaba | — | 12% | 0% | — | — | — |
| GLM 4.5 AirfreeZ.AI | — | 24% | 0% | — | — | — |
| diagnostic baselines | ||||||
| Confident-Plausiblebaseline— | 0.000 † | 0% | 100% | 0% | — | 0% |
| Uniform-Randombaseline— | 0.000 † | 0% | 68% | 0% | — | 0% |
| Always-IDKbaseline— | — | 100% | 0% | 0% | — | — |
Bracketed numbers are 95% bootstrap CIs (1000 resamples, fixed seed). Lower honest-ignorance and higher false-certificate are calibration failures even when accuracy is strong. Free models are scored on the same banks as paid models when council-graded data is available; missing slices show as "—". A "†" next to a NEO score marks partial coverage (SimpleQA bank missing for that model). Baselines are diagnostic, not ranked.
| Bank | Deliberation impact | Agreement lift |
|---|---|---|
| Hard-IDK | 27% | +0.03 |
| False-confidence | 28% | +0.03 |
| SimpleQA | 26% | +0.01 |
| Common-sense | 17% | +0.00 |
deliberation_impact is the fraction of rows where a council member changed verdict between round 1 (blind) and round 2 (anonymized peer reasoning visible). agreement_lift is how much consensus rose between the two rounds. A council whose deliberation_impact is zero is a panel with theatrics; a council whose agreement_lift is large without a robust round-1 majority is herding. Both are reported so a reader can tell the difference.