Open-model evaluation lab

Public leaderboards answer "which model is best?" — but never "best for my use-case, on my hardware, with my latency budget." So I built the lab that answers it, and it runs every night.

Two halves

Best Open Models is the aggregation half: concurrent fetchers pull releases and results from Hugging Face leaderboards, GitHub, RSS, and arena sites into PostgreSQL — 8,351 distinct models, 246,567 benchmark rows across LLM, VLM, embedding, ASR, TTS, and image/video generation. On top: deployability-aware re-ranking (what actually fits in 24 GB VRAM), supersedence tracking, Pareto-frontier reports, and a daily narrative digest written by an extraction model. Incremental by content-hash, so a nightly run that finds nothing new costs almost nothing.

The benchmark workspace is the measurement half: six homegrown text benchmarks (theory-of-mind, procedural-error-finding, instruction following, claim provenance, code vulnerabilities, retrieval) plus a fast screening gate, run locally against every interesting open-weights release on one RTX 3090.

The part I'm proudest of: catching my own benchmark lying

An audit found my screening gate's headline correlation (ρ = 0.935 with the full suite) was in-sample fiction — items had been selected on the same models they were scored on. Honest held-out measurement: ρ = 0.607. I rebuilt the grading layer rather than the narrative:

Scorer drift eliminated by construction — the gate now imports the canonical verifiers from each source benchmark instead of carrying drifted copies.
Content-first scoring for thinking models — 11.6% of all rows had been scored off the reasoning channel; correct answers were failing strict verifiers because chain-of- thought lines counted as list items. Fixing this resurrected reasoning models that were being buried (one moved from rank 11 to rank 2).
An LLM judge with receipts — string-matching can't grade free-text theory-of-mind answers ("still in the house" vs "in her grandmother's house" is the same answer). A local Gemma-4-31B judge now grades equivalence, calibrated against string-match where both apply, with disagreements queued for review and scoring_method persisted so judge and string scores never silently mix. 13,662 judged answers and counting.
Item Response Theory for the gate — fitting a 2PL IRT model over the response matrix (44 models × 377 items) and selecting items by test information, the tinyBenchmarks way. Validated with model-split cross-validation only. Held-out ρ for the IRT-selected gate: 0.968, against 0.607 for the heuristic it replaces.
Contamination canaries — vendored public items (IFEval, ProcessBench) are paired with held-out homegrown complements and verbatim-vs-paraphrase canary pairs, so training- set leakage shows up as a measurable gap instead of silent score inflation.

Why it matters

Evaluation is the load-bearing wall of applied AI, and most of it is done badly. This lab is a working demonstration of doing it honestly on consumer hardware: psychometric item selection, judge calibration, channel-aware scoring, contamination policy — with every correction journaled and re-scorable in place from stored raw responses.

Status & limits — the 35-item gate's split-half reliability (~0.78) is an item-count limit, ±17 points per model score; the IRT-selected replacement lands after the current judge backfill and GPU validation runs complete. Confidence intervals on per-use-case rankings are next on the roadmap.

Two halves

The part I'm proudest of: catching my own benchmark lying

Why it matters

Stack