[2602.14785] SA-SSL-MOS: Self-supervised Learning MOS Prediction with Spectral Augmentation for Generalized Multi-Rate Speech Assessment
Summary
The paper presents a novel approach to speech quality assessment using self-supervised learning and spectral augmentation, addressing challenges in estimating mean-opinion-scores for multi-rate speech.
Why It Matters
This research is significant as it tackles the limitations of existing self-supervised learning models in speech quality assessment, particularly regarding high-frequency information. By improving the accuracy of mean-opinion-score predictions, it can enhance applications in telecommunications and audio processing.
Key Takeaways
- Introduces a spectrogram-augmented self-supervised learning method.
- Addresses the challenge of limited MOS-labeled datasets for multi-rate speech.
- Demonstrates improved generalization through a two-step training scheme.
- Highlights the importance of high-frequency information in speech assessment.
- Experimental results indicate significant performance enhancements.
Electrical Engineering and Systems Science > Audio and Speech Processing arXiv:2602.14785 (eess) [Submitted on 16 Feb 2026] Title:SA-SSL-MOS: Self-supervised Learning MOS Prediction with Spectral Augmentation for Generalized Multi-Rate Speech Assessment Authors:Fengyuan Cao, Xinyu Liang, Fredrik Cumlin, Victor Ungureanu, Chandan K. A. Reddy, Christian Schuldt, Saikat Chatterjee View a PDF of the paper titled SA-SSL-MOS: Self-supervised Learning MOS Prediction with Spectral Augmentation for Generalized Multi-Rate Speech Assessment, by Fengyuan Cao and 6 other authors View PDF HTML (experimental) Abstract:Designing a speech quality assessment (SQA) system for estimating mean-opinion-score (MOS) of multi-rate speech with varying sampling frequency (16-48 kHz) is a challenging task. The challenge arises due to the limited availability of a MOS-labeled training dataset comprising multi-rate speech samples. While self-supervised learning (SSL) models have been widely adopted in SQA to boost performance, a key limitation is that they are pretrained on 16 kHz speech and therefore discard high-frequency information present in higher sampling rates. To address this issue, we propose a spectrogram-augmented SSL method that incorporates high-frequency features (up to 48 kHz sampling rate) through a parallel-branch architecture. We further introduce a two-step training scheme: the model is first pre-trained on a large 48 kHz dataset and then fine-tuned on a smaller multi-rate dataset. ...