Speech-enhancement benchmark harness

Dashcam audio is wind, traffic, and a voice you need — and there is no clean reference recording, which silently invalidates the standard metrics (PESQ, STOI, SI-SDR all require one). Most people compute them anyway. This harness doesn't.

Instead: every candidate model receives byte-identical prepared input and identical loudness-normalized post-processing; quality is judged by non-reference signals — spectral inspection, side-by-side listening, and the sharpest one, an ASR-hallucination check: run Whisper on enhanced output and flag content that wasn't in the original. An enhancer that invents intelligible speech is worse than noise, and this catches it.

The unglamorous part that makes it reproducible: each model lives in its own isolated virtualenv (four mutually incompatible PyTorch stacks, orchestrated by subprocess), runs are resumable and cached, and output is both machine-readable JSON and an HTML report with embedded players and spectrograms — engineers get data, decisions get made by ear and eye.

Status & limits — a tool with a verdict (MossFormer2 and DeepFilterNet led for this domain), not a service. Shown because measurement design under missing ground truth is the transferable skill.

Stack