NEO leaderboard — council-graded

Model	NEO score	Honest ignorance	False certificate	False-conf acc	SimpleQA acc	Common-sense acc
DeepSeek V3.1paidDeepSeek	0.377 †	64%[44–83]	28%[10–47]	81%[67–94]	—	88%[78–98]
Claude Sonnet 4.6paidAnthropic	0.358	56%[36–77]	8%[0–20]	100%[100–100]	35%[23–50]	95%[88–100]
Qwen3 MaxpaidAlibaba	0.290	56%[37–75]	36%[19–55]	92%[85–100]	60%[47–74]	95%[88–100]
Grok 3paidxAI	0.271	60%[39–81]	40%[20–61]	82%[70–92]	51%[34–69]	100%[100–100]
Llama 4 MaverickpaidMeta	0.094 †	28%[12–48]	60%[40–79]	82%[69–92]	—	90%[79–97]
Gemini 2.5 PropaidGoogle	0.051	16%[3–31]	56%[36–76]	72%[56–86]	53%[39–67]	100%[100–100]
GPT-5 ChatpaidOpenAI	0.047	20%[4–37]	68%[48–86]	95%[88–100]	43%[29–57]	100%[100–100]
free / open-weight (per-bank only — full composite unavailable)
Gemma 3 27BfreeGoogle	—	0%	0%	—	—	—
Llama 3.2 3BfreeMeta	—	0%	8%	—	—	—
Llama 3.3 70BfreeMeta	—	8%	4%	—	—	—
Hermes 3 Llama 405BfreeNousResearch	—	0%	8%	—	—	—
Nemotron Nano 9BfreeNVIDIA	—	12%	0%	—	—	—
GPT-OSS 120BfreeOpenAI	—	0%	0%	—	—	—
Qwen3 Next 80BfreeAlibaba	—	12%	0%	—	—	—
GLM 4.5 AirfreeZ.AI	—	24%	0%	—	—	—
diagnostic baselines
Confident-Plausiblebaseline—	0.000 †	0%	100%	0%	—	0%
Uniform-Randombaseline—	0.000 †	0%	68%	0%	—	0%
Always-IDKbaseline—	—	100%	0%	0%	—	—

Model

NEO score

Honest ignorance

False certificate

False-conf acc

SimpleQA acc

Common-sense acc

DeepSeek V3.1paidDeepSeek

0.377 †

64%[44–83]

28%[10–47]

81%[67–94]

—

88%[78–98]

Claude Sonnet 4.6paidAnthropic

0.358

56%[36–77]

8%[0–20]

100%[100–100]

35%[23–50]

95%[88–100]

Qwen3 MaxpaidAlibaba

0.290

56%[37–75]

36%[19–55]

92%[85–100]

60%[47–74]

95%[88–100]

Grok 3paidxAI

0.271

60%[39–81]

40%[20–61]

82%[70–92]

51%[34–69]

100%[100–100]

Llama 4 MaverickpaidMeta

0.094 †

28%[12–48]

60%[40–79]

82%[69–92]

—

90%[79–97]

Gemini 2.5 PropaidGoogle

0.051

16%[3–31]

56%[36–76]

72%[56–86]

53%[39–67]

100%[100–100]

GPT-5 ChatpaidOpenAI

0.047

20%[4–37]

68%[48–86]

95%[88–100]

43%[29–57]

100%[100–100]

free / open-weight (per-bank only — full composite unavailable)

Gemma 3 27BfreeGoogle

—

Llama 3.2 3BfreeMeta

—

Llama 3.3 70BfreeMeta

—

Hermes 3 Llama 405BfreeNousResearch

—

Nemotron Nano 9BfreeNVIDIA

—

12%

—

GPT-OSS 120BfreeOpenAI

—

Qwen3 Next 80BfreeAlibaba

—

12%

—

GLM 4.5 AirfreeZ.AI

—

24%

—

diagnostic baselines

Confident-Plausiblebaseline—

0.000 †

100%

—

Uniform-Randombaseline—

0.000 †

68%

—

Always-IDKbaseline—

—

100%

—

Bracketed numbers are 95% bootstrap CIs (1000 resamples, fixed seed). Lower honest-ignorance and higher false-certificate are calibration failures even when accuracy is strong. Free models are scored on the same banks as paid models when council-graded data is available; missing slices show as "—". A "†" next to a NEO score marks partial coverage (SimpleQA bank missing for that model). Baselines are diagnostic, not ranked.

Council deliberation diagnostics

Bank	Deliberation impact	Agreement lift
Hard-IDK	27%	+0.03
False-confidence	28%	+0.03
SimpleQA	26%	+0.01
Common-sense	17%	+0.00

Bank

Deliberation impact

Agreement lift

Hard-IDK

27%

+0.03

False-confidence

28%

+0.03

SimpleQA

26%

+0.01

Common-sense

17%

+0.00

deliberation_impact is the fraction of rows where a council member changed verdict between round 1 (blind) and round 2 (anonymized peer reasoning visible). agreement_lift is how much consensus rose between the two rounds. A council whose deliberation_impact is zero is a panel with theatrics; a council whose agreement_lift is large without a robust round-1 majority is herding. Both are reported so a reader can tell the difference.

Leaderboard — council-graded

No model leads on every bank.

Council deliberation diagnostics