[2602.14785] SA-SSL-MOS: Self-supervised Learning MOS Prediction with Spectral Augmentation for Generalized Multi-Rate Speech Assessment

arXiv - Machine Learning February 17, 2026 4 min read Article

Summary

The paper presents a novel approach to speech quality assessment using self-supervised learning and spectral augmentation, addressing challenges in estimating mean-opinion-scores for multi-rate speech.

Why It Matters

This research is significant as it tackles the limitations of existing self-supervised learning models in speech quality assessment, particularly regarding high-frequency information. By improving the accuracy of mean-opinion-score predictions, it can enhance applications in telecommunications and audio processing.

Key Takeaways

Introduces a spectrogram-augmented self-supervised learning method.
Addresses the challenge of limited MOS-labeled datasets for multi-rate speech.
Demonstrates improved generalization through a two-step training scheme.
Highlights the importance of high-frequency information in speech assessment.
Experimental results indicate significant performance enhancements.

Electrical Engineering and Systems Science > Audio and Speech Processing arXiv:2602.14785 (eess) [Submitted on 16 Feb 2026] Title:SA-SSL-MOS: Self-supervised Learning MOS Prediction with Spectral Augmentation for Generalized Multi-Rate Speech Assessment Authors:Fengyuan Cao, Xinyu Liang, Fredrik Cumlin, Victor Ungureanu, Chandan K. A. Reddy, Christian Schuldt, Saikat Chatterjee View a PDF of the paper titled SA-SSL-MOS: Self-supervised Learning MOS Prediction with Spectral Augmentation for Generalized Multi-Rate Speech Assessment, by Fengyuan Cao and 6 other authors View PDF HTML (experimental) Abstract:Designing a speech quality assessment (SQA) system for estimating mean-opinion-score (MOS) of multi-rate speech with varying sampling frequency (16-48 kHz) is a challenging task. The challenge arises due to the limited availability of a MOS-labeled training dataset comprising multi-rate speech samples. While self-supervised learning (SSL) models have been widely adopted in SQA to boost performance, a key limitation is that they are pretrained on 16 kHz speech and therefore discard high-frequency information present in higher sampling rates. To address this issue, we propose a spectrogram-augmented SSL method that incorporates high-frequency features (up to 48 kHz sampling rate) through a parallel-branch architecture. We further introduce a two-step training scheme: the model is first pre-trained on a large 48 kHz dataset and then fine-tuned on a smaller multi-rate dataset. ...

Read Original Article

[2602.14785] SA-SSL-MOS: Self-supervised Learning MOS Prediction with Spectral Augmentation for Generalized Multi-Rate Speech Assessment

Summary

Why It Matters

Key Takeaways

Related Articles

What's your "When Language Model AI can do X, I'll be impressed"?

Meta’s New AI Asked for My Raw Health Data—and Gave Me Terrible Advice | WIRED

What image/video training data is hardest to find right now? [R]

I implemented DPO from the paper and the reward margin hit 599 here's what that actually means [R]

No comments

Stay updated with AI News