Tilelli / NEO / Reproducibility
An honesty benchmark has to be honest about its own pipeline. This page is the audit surface.
Tests live at /workspace/NEO/tests/. They cover the scoring functions, the deliberative council, bootstrap CIs, bank parsing, the grader, and per-bank slicing. Pytest is not in the system Python on these machines — it is in the Mosaic virtualenv. The exact command from project_neo.md:
# run the full suite (≈ 8 seconds, no API calls) /workspace/cool-projects/mosaic/venv/bin/python -m pytest /workspace/NEO/tests/ -q # expected: # ........................... 148 passed in 7.94s
test_baselines.py — three diagnostic baselines (Always-IDK, Uniform-Random, Confident-Plausible). These are not ranked; they exist so a reader can see where the floor is.test_bootstrap_ci.py — 1000-resample, fixed-seed bootstrap CIs. Every percentage on the leaderboard ships with a bracket.test_build_council_leaderboard.py — end-to-end build of the council-graded JSON from per-row deliberation traces. The artifact every leaderboard page consumes.test_calibration_parse.py — parsing the "confidence: 0.NN" + "answer:" two-line protocol that every probe enforces.test_compute_signature.py — compute-signature-per-difficulty proxies (tokens/sec, billing).test_council.py — two-round deliberation: round-1 blind verdicts, round-2 anonymized peer reasoning, ties resolve to incorrect, vendor self-exclusion enforced.test_cross_bank_correlation.py — per-model behavior across banks. Universal over-confidence on P10 falls out of this.test_grader.py — the LLM-grader scaffold (prompt, retries, schema validation). What a single judge does before the council wraps it.test_mock_personas.py — synthetic personas that mimic Sonnet / Sycophant / Fabricator. They exist so the test suite never depends on a live API.test_openrouter_client.py — OpenRouter call wrapper, timeouts, error handling.test_p1_p7_banks.py — the P1–P7 probe banks (false-confidence, certificate-or-confess, stated-emotion incoherence, self-knowledge, sycophancy, pattern-vs-reason, self-contradiction).test_pairwise_significance.py — pairwise model-vs-model significance under bootstrap. Used to flag which leaderboard rows are inside the noise.test_runner.py — the run loop: read bank, ask model, parse, score, log.test_scoring.py — the core composite: geomean(fc, sqa, cs) × HIR × (1 − FCR).test_scoring_certificate.py — certificate-or-confess judge: accept-or-refuse, no partial credit for plausible-sounding wrong derivations.test_scoring_false_confidence.py — false-confidence Brier and ECE.test_scoring_paraphrase.py — paraphrase consistency across 5 rephrasings.test_slice_by_category.py — per-bank slicing for the by-category leaderboard.conftest.py — shared fixtures (deterministic seeds, mock OpenRouter, synthetic banks).__init__.py — package marker.If you do not trust our judges, swap them out. The council is configured by a YAML list — five members from disjoint vendors. The grader entrypoint is scripts/regrade_with_council.py. Estimated cost across the v3.2 banks at OpenRouter list prices: $4–8.
# expected layout: NEO/ ├── banks/ # 1,015 items across 13 probes ├── runs/ # raw model outputs (per-bank, per-model) ├── grading/ │ ├── council_round_1/ # blind verdicts + reasoning │ ├── council_round_2/ # after seeing anonymized peers │ └── council_leaderboard.json # final aggregate ├── scripts/ │ ├── regrade_with_council.py │ └── compute_signature.py └── tests/ # the 148 above
Apache 2.0 over the code; CC BY 4.0 over the text and figures. Banks, code, council deliberation traces, and reliability diagrams will move to a public repository once the v3.3 release is ready. Today the local source-of-truth is /workspace/NEO/ on the build machine.