Open-model evaluation lab
A self-hosted answer to "what's the best open model for X?" — 8,351 models tracked, benchmarked across modalities, with psychometric test design and an LLM judge I actually calibrated.
Public leaderboards answer "which model is best?" — but never "best for my use-case, on my hardware, with my latency budget." So I built the lab that answers it, and it runs every night.
Two halves
Best Open Models is the aggregation half: concurrent fetchers pull releases and results from Hugging Face leaderboards, GitHub, RSS, and arena sites into PostgreSQL — 8,351 distinct models, 246,567 benchmark rows across LLM, VLM, embedding, ASR, TTS, and image/video generation. On top: deployability-aware re-ranking (what actually fits in 24 GB VRAM), supersedence tracking, Pareto-frontier reports, and a daily narrative digest written by an extraction model. Incremental by content-hash, so a nightly run that finds nothing new costs almost nothing.
The benchmark workspace is the measurement half: six homegrown text benchmarks (theory-of-mind, procedural-error-finding, instruction following, claim provenance, code vulnerabilities, retrieval) plus a fast screening gate, run locally against every interesting open-weights release on one RTX 3090.
The part I'm proudest of: catching my own benchmark lying
An audit found my screening gate's headline correlation (ρ = 0.935 with the full suite) was in-sample fiction — items had been selected on the same models they were scored on. Honest held-out measurement: ρ = 0.607. I rebuilt the grading layer rather than the narrative:
- Scorer drift eliminated by construction — the gate now imports the canonical verifiers from each source benchmark instead of carrying drifted copies.
- Content-first scoring for thinking models — 11.6% of all rows had been scored off the reasoning channel; correct answers were failing strict verifiers because chain-of- thought lines counted as list items. Fixing this resurrected reasoning models that were being buried (one moved from rank 11 to rank 2).
- An LLM judge with receipts — string-matching can't grade free-text theory-of-mind
answers ("still in the house" vs "in her grandmother's house" is the same answer).
A local Gemma-4-31B judge now grades equivalence, calibrated against string-match where
both apply, with disagreements queued for review and
scoring_methodpersisted so judge and string scores never silently mix. 13,662 judged answers and counting. - Item Response Theory for the gate — fitting a 2PL IRT model over the response matrix (44 models × 377 items) and selecting items by test information, the tinyBenchmarks way. Validated with model-split cross-validation only. Held-out ρ for the IRT-selected gate: 0.968, against 0.607 for the heuristic it replaces.
- Contamination canaries — vendored public items (IFEval, ProcessBench) are paired with held-out homegrown complements and verbatim-vs-paraphrase canary pairs, so training- set leakage shows up as a measurable gap instead of silent score inflation.
Why it matters
Evaluation is the load-bearing wall of applied AI, and most of it is done badly. This lab is a working demonstration of doing it honestly on consumer hardware: psychometric item selection, judge calibration, channel-aware scoring, contamination policy — with every correction journaled and re-scorable in place from stored raw responses.
Status & limits — the 35-item gate's split-half reliability (~0.78) is an item-count limit, ±17 points per model score; the IRT-selected replacement lands after the current judge backfill and GPU validation runs complete. Confidence intervals on per-use-case rankings are next on the roadmap.