Normalization rules ended up being a big reason why this was happening, so we decided to release a fully reproducible evaluation framework. You can test it yourself with our full repo.
It includes: Normalization rules we use; Scoring scripts; Dataset coverage (conversational, noisy, multilingual); Full eval pipeline
We also published a detailed comparison using this framework across 8 leading STT providers, 7 datasets, and 74 hours of audio. You can see it here: https://www.gladia.io/competitors/benchmarks
Feedback welcomed!