Show HN: Reproducible open-source STT API benchmarks with full methodology

Author here. We built this because we kept seeing different word error rates (WER) for the same models depending on who was testing and how.

Normalization rules ended up being a big reason why this was happening, so we decided to release a fully reproducible evaluation framework. You can test it yourself with our full repo.

It includes: Normalization rules we use; Scoring scripts; Dataset coverage (conversational, noisy, multilingual); Full eval pipeline

We also published a detailed comparison using this framework across 8 leading STT providers, 7 datasets, and 74 hours of audio. You can see it here: https://www.gladia.io/competitors/benchmarks

Feedback welcomed!