[2604.08562] Neural networks for Text-to-Speech evaluation

[2604.08562] Neural networks for Text-to-Speech evaluation

arXiv - AI 4 min read

About this article

Abstract page for arXiv paper 2604.08562: Neural networks for Text-to-Speech evaluation

Computer Science > Computation and Language arXiv:2604.08562 (cs) [Submitted on 17 Mar 2026] Title:Neural networks for Text-to-Speech evaluation Authors:Ilya Trofimenko, David Kocharyan, Aleksandr Zaitsev, Pavel Repnikov, Mark Levin, Nikita Shevtsov View a PDF of the paper titled Neural networks for Text-to-Speech evaluation, by Ilya Trofimenko and 5 other authors View PDF HTML (experimental) Abstract:Ensuring that Text-to-Speech (TTS) systems deliver human-perceived quality at scale is a central challenge for modern speech technologies. Human subjective evaluation protocols such as Mean Opinion Score (MOS) and Side-by-Side (SBS) comparisons remain the de facto gold standards, yet they are expensive, slow, and sensitive to pervasive assessor biases. This study addresses these barriers by formulating, and implementing a suite of novel neural models designed to approximate expert judgments in both relative (SBS) and absolute (MOS) settings. For relative assessment, we propose NeuralSBS, a HuBERT-backed model achieving 73.7% accuracy (on SOMOS dataset). For absolute assessment, we introduce enhancements to MOSNet using custom sequence-length batching, as well as WhisperBert, a multimodal stacking ensemble that combines Whisper audio features and BERT textual embeddings via weak learners. Our best MOS models achieve a Root Mean Square Error (RMSE) of ~0.40, significantly outperforming the human inter-rater RMSE baseline of 0.62. Furthermore, our ablation studies reveal that na...

Originally published on April 13, 2026. Curated by AI News.

Related Articles

Llms

Transformer Math Explorer [P]

This is an interactive math reference for transformer models, presented via dataflow graphs, all the way down to elementary math. Covers ...

Reddit - Machine Learning · 1 min ·
Machine Learning

how much of your time goes into environment setup vs actual model work?

For most people I've talked to, it's embarrassingly high. New machine? Set up CUDA again. New team member? Good luck for reproducing the ...

Reddit - ML Jobs · 1 min ·
Machine Learning

How much can a video generated by the same diffusion model differ across GPU architectures if the initial noise latent is fixed? [D]

Hi! I am trying to sanity-check an assumption for diffusion video generation reproducibility. Suppose I run the same video diffusion mode...

Reddit - Machine Learning · 1 min ·
Llms

I am not an "anti" like this guy, but still an interesting video of person interacting with chat 4o

(Posting Here because removed by Chatgpt Complaints moderators because the model here is 4o, and refuse to believe there were any safety ...

Reddit - Artificial Intelligence · 1 min ·
More in Machine Learning: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime