[2602.12249] "Sorry, I Didn't Catch That": How Speech Models Miss What Matters Most

[2602.12249] "Sorry, I Didn't Catch That": How Speech Models Miss What Matters Most

arXiv - AI 4 min read Article

Summary

This paper examines the shortcomings of speech recognition models in accurately transcribing high-stakes utterances, particularly U.S. street names, revealing a significant error rate and proposing a synthetic data approach to improve accuracy.

Why It Matters

The findings highlight a critical gap between the performance of speech recognition systems on standard benchmarks and their reliability in real-world applications. This is particularly relevant for enhancing accessibility and safety in navigation systems, especially for non-English speakers.

Key Takeaways

  • Speech models show a 44% average transcription error rate for U.S. street names.
  • Non-English primary speakers experience double the routing distance errors compared to English speakers.
  • Synthetic data generation can significantly improve transcription accuracy by nearly 60% with minimal samples.

Computer Science > Artificial Intelligence arXiv:2602.12249 (cs) [Submitted on 12 Feb 2026 (v1), last revised 16 Feb 2026 (this version, v2)] Title:"Sorry, I Didn't Catch That": How Speech Models Miss What Matters Most Authors:Kaitlyn Zhou, Martijn Bartelds, Federico Bianchi, James Zou View a PDF of the paper titled "Sorry, I Didn't Catch That": How Speech Models Miss What Matters Most, by Kaitlyn Zhou and 3 other authors View PDF HTML (experimental) Abstract:Despite speech recognition systems achieving low word error rates on standard benchmarks, they often fail on short, high-stakes utterances in real-world deployments. Here, we study this failure mode in a high-stakes task: the transcription of U.S. street names as spoken by U.S. participants. We evaluate 15 models from OpenAI, Deepgram, Google, and Microsoft on recordings from linguistically diverse U.S. speakers and find an average transcription error rate of 44%. We quantify the downstream impact of failed transcriptions by geographic locations and show that mis-transcriptions systematically cause errors for all speakers, but that routing distance errors are twice as large for non-English primary speakers compared to English primary speakers. To mitigate this harm, we introduce a synthetic data generation approach that produces diverse pronunciations of named entities using open-source text-to-speech models. Fine-tuning with less than 1,000 synthetic samples improves street name transcription accuracy by nearly 60% (...

Related Articles

Machine Learning

[R] Architecture Determines Optimization: Deriving Weight Updates from Network Topology (seeking arXiv endorsement - cs.LG)

Abstract: We derive neural network weight updates from first principles without assuming gradient descent or a specific loss function. St...

Reddit - Machine Learning · 1 min ·
Machine Learning

[P] ML project (XGBoost + Databricks + MLflow) — how to talk about “production issues” in interviews?

Hey all, I recently built an end-to-end fraud detection project using a large banking dataset: Trained an XGBoost model Used Databricks f...

Reddit - Machine Learning · 1 min ·
Machine Learning

[D] The memory chip market lost tens of billions over a paper this community would have understood in 10 minutes

TurboQuant was teased recently and tens of billions gone from memory chip market in 48 hours but anyone in this community who read the pa...

Reddit - Machine Learning · 1 min ·
Copilot is ‘for entertainment purposes only,’ according to Microsoft’s terms of use | TechCrunch
Machine Learning

Copilot is ‘for entertainment purposes only,’ according to Microsoft’s terms of use | TechCrunch

AI skeptics aren’t the only ones warning users not to unthinkingly trust models’ outputs — that’s what the AI companies say themselves in...

TechCrunch - AI · 3 min ·
More in Machine Learning: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime