Tilelli  /  NEO  /  Reproducibility

Reproducibility — 148 tests, the protocol, the code

An honesty benchmark has to be honest about its own pipeline. This page is the audit surface.

148
pytest passing
last run on the v3.3 council pipeline
20
test files
covering scoring, grader, banks, council, baselines
1,015
test items in the banks
+ raw runs + council deliberation traces
5
vendors on the council
vendor self-exclusion enforced in code

Run the tests yourself

Tests live at /workspace/NEO/tests/. They cover the scoring functions, the deliberative council, bootstrap CIs, bank parsing, the grader, and per-bank slicing. Pytest is not in the system Python on these machines — it is in the Mosaic virtualenv. The exact command from project_neo.md:

# run the full suite (≈ 8 seconds, no API calls)
/workspace/cool-projects/mosaic/venv/bin/python -m pytest /workspace/NEO/tests/ -q

# expected:
# ........................... 148 passed in 7.94s

What each test file covers

Re-grade with your own council

If you do not trust our judges, swap them out. The council is configured by a YAML list — five members from disjoint vendors. The grader entrypoint is scripts/regrade_with_council.py. Estimated cost across the v3.2 banks at OpenRouter list prices: $4–8.

# expected layout:
NEO/
├── banks/                        # 1,015 items across 13 probes
├── runs/                         # raw model outputs (per-bank, per-model)
├── grading/
│   ├── council_round_1/          # blind verdicts + reasoning
│   ├── council_round_2/          # after seeing anonymized peers
│   └── council_leaderboard.json  # final aggregate
├── scripts/
│   ├── regrade_with_council.py
│   └── compute_signature.py
└── tests/                        # the 148 above

What gets disclosed on every run

License + status

Apache 2.0 over the code; CC BY 4.0 over the text and figures. Banks, code, council deliberation traces, and reliability diagrams will move to a public repository once the v3.3 release is ready. Today the local source-of-truth is /workspace/NEO/ on the build machine.