[2604.08562] Neural networks for Text-to-Speech evaluation
About this article
Abstract page for arXiv paper 2604.08562: Neural networks for Text-to-Speech evaluation
Computer Science > Computation and Language arXiv:2604.08562 (cs) [Submitted on 17 Mar 2026] Title:Neural networks for Text-to-Speech evaluation Authors:Ilya Trofimenko, David Kocharyan, Aleksandr Zaitsev, Pavel Repnikov, Mark Levin, Nikita Shevtsov View a PDF of the paper titled Neural networks for Text-to-Speech evaluation, by Ilya Trofimenko and 5 other authors View PDF HTML (experimental) Abstract:Ensuring that Text-to-Speech (TTS) systems deliver human-perceived quality at scale is a central challenge for modern speech technologies. Human subjective evaluation protocols such as Mean Opinion Score (MOS) and Side-by-Side (SBS) comparisons remain the de facto gold standards, yet they are expensive, slow, and sensitive to pervasive assessor biases. This study addresses these barriers by formulating, and implementing a suite of novel neural models designed to approximate expert judgments in both relative (SBS) and absolute (MOS) settings. For relative assessment, we propose NeuralSBS, a HuBERT-backed model achieving 73.7% accuracy (on SOMOS dataset). For absolute assessment, we introduce enhancements to MOSNet using custom sequence-length batching, as well as WhisperBert, a multimodal stacking ensemble that combines Whisper audio features and BERT textual embeddings via weak learners. Our best MOS models achieve a Root Mean Square Error (RMSE) of ~0.40, significantly outperforming the human inter-rater RMSE baseline of 0.62. Furthermore, our ablation studies reveal that na...