NEO reproducibility — tests, code, how to re-grade

Tilelli / NEO / Reproducibility

Reproducibility — 148 tests, the protocol, the code

An honesty benchmark has to be honest about its own pipeline. This page is the audit surface.

148
pytest passing
last run on the v3.3 council pipeline

test files
covering scoring, grader, banks, council, baselines

1,015

test items in the banks
+ raw runs + council deliberation traces

vendors on the council
vendor self-exclusion enforced in code

Run the tests yourself

Tests live at /workspace/NEO/tests/. They cover the scoring functions, the deliberative council, bootstrap CIs, bank parsing, the grader, and per-bank slicing. Pytest is not in the system Python on these machines — it is in the Mosaic virtualenv. The exact command from project_neo.md:

# run the full suite (≈ 8 seconds, no API calls)
/workspace/cool-projects/mosaic/venv/bin/python -m pytest /workspace/NEO/tests/ -q

# expected:
# ........................... 148 passed in 7.94s

What each test file covers

test_baselines.py — three diagnostic baselines (Always-IDK, Uniform-Random, Confident-Plausible). These are not ranked; they exist so a reader can see where the floor is.
test_bootstrap_ci.py — 1000-resample, fixed-seed bootstrap CIs. Every percentage on the leaderboard ships with a bracket.
test_build_council_leaderboard.py — end-to-end build of the council-graded JSON from per-row deliberation traces. The artifact every leaderboard page consumes.
test_calibration_parse.py — parsing the "confidence: 0.NN" + "answer:" two-line protocol that every probe enforces.
test_compute_signature.py — compute-signature-per-difficulty proxies (tokens/sec, billing).
test_council.py — two-round deliberation: round-1 blind verdicts, round-2 anonymized peer reasoning, ties resolve to incorrect, vendor self-exclusion enforced.
test_cross_bank_correlation.py — per-model behavior across banks. Universal over-confidence on P10 falls out of this.
test_grader.py — the LLM-grader scaffold (prompt, retries, schema validation). What a single judge does before the council wraps it.
test_mock_personas.py — synthetic personas that mimic Sonnet / Sycophant / Fabricator. They exist so the test suite never depends on a live API.
test_openrouter_client.py — OpenRouter call wrapper, timeouts, error handling.
test_p1_p7_banks.py — the P1–P7 probe banks (false-confidence, certificate-or-confess, stated-emotion incoherence, self-knowledge, sycophancy, pattern-vs-reason, self-contradiction).
test_pairwise_significance.py — pairwise model-vs-model significance under bootstrap. Used to flag which leaderboard rows are inside the noise.
test_runner.py — the run loop: read bank, ask model, parse, score, log.
test_scoring.py — the core composite: geomean(fc, sqa, cs) × HIR × (1 − FCR).
test_scoring_certificate.py — certificate-or-confess judge: accept-or-refuse, no partial credit for plausible-sounding wrong derivations.
test_scoring_false_confidence.py — false-confidence Brier and ECE.
test_scoring_paraphrase.py — paraphrase consistency across 5 rephrasings.
test_slice_by_category.py — per-bank slicing for the by-category leaderboard.
conftest.py — shared fixtures (deterministic seeds, mock OpenRouter, synthetic banks).
__init__.py — package marker.

Re-grade with your own council

If you do not trust our judges, swap them out. The council is configured by a YAML list — five members from disjoint vendors. The grader entrypoint is scripts/regrade_with_council.py. Estimated cost across the v3.2 banks at OpenRouter list prices: $4–8.

# expected layout:
NEO/
├── banks/                        # 1,015 items across 13 probes
├── runs/                         # raw model outputs (per-bank, per-model)
├── grading/
│   ├── council_round_1/          # blind verdicts + reasoning
│   ├── council_round_2/          # after seeing anonymized peers
│   └── council_leaderboard.json  # final aggregate
├── scripts/
│   ├── regrade_with_council.py
│   └── compute_signature.py
└── tests/                        # the 148 above

What gets disclosed on every run

deliberation_impact — fraction of rows where any council member changed verdict between round 1 (blind) and round 2 (anonymized peer reasoning visible). Bank-level numbers are on the leaderboard.
agreement_lift — how much council consensus rose between the two rounds.
Per-row reasoning — both rounds, every member, in the raw JSON. Not summarized away.
Vendor self-exclusion — when grading a model from vendor V, the council member from V is dropped from that row's council. Enforced at the dispatcher level, not as a post-hoc filter.
Bootstrap CIs — 1000 resamples, fixed seed (so the brackets are byte-identical across reruns).

License + status

Apache 2.0 over the code; CC BY 4.0 over the text and figures. Banks, code, council deliberation traces, and reliability diagrams will move to a public repository once the v3.3 release is ready. Today the local source-of-truth is /workspace/NEO/ on the build machine.