[2602.13928] voice2mode: Phonation Mode Classification in Singing using Self-Supervised Speech Models
Summary
The paper presents voice2mode, a method for classifying four singing phonation modes using self-supervised speech models, demonstrating significant accuracy improvements over traditional methods.
Why It Matters
This research highlights the potential of self-supervised models in enhancing phonation mode classification in singing, which can impact music technology, vocal training, and speech recognition applications. The findings suggest a shift from traditional feature extraction to leveraging advanced machine learning techniques.
Key Takeaways
- voice2mode classifies four phonation modes: breathy, neutral, flow, and pressed.
- Utilizes embeddings from self-supervised models like HuBERT and wav2vec2 for improved accuracy.
- Achieved ~95.7% accuracy with SVM, outperforming traditional spectral methods by 12-15%.
- Lower layer embeddings are more effective for phonation classification than top layers specialized for ASR.
- Demonstrates the transferability of speech models to singing applications.
Computer Science > Sound arXiv:2602.13928 (cs) [Submitted on 14 Feb 2026] Title:voice2mode: Phonation Mode Classification in Singing using Self-Supervised Speech Models Authors:Aju Ani Justus, Ruchit Agrawal, Sudarsana Reddy Kadiri, Shrikanth Narayanan View a PDF of the paper titled voice2mode: Phonation Mode Classification in Singing using Self-Supervised Speech Models, by Aju Ani Justus and 2 other authors View PDF HTML (experimental) Abstract:We present voice2mode, a method for classification of four singing phonation modes (breathy, neutral (modal), flow, and pressed) using embeddings extracted from large self-supervised speech models. Prior work on singing phonation has relied on handcrafted signal features or task-specific neural nets; this work evaluates the transferability of speech foundation models to singing phonation classification. voice2mode extracts layer-wise representations from HuBERT and two wav2vec2 variants, applies global temporal pooling, and classifies the pooled embeddings with lightweight classifiers (SVM, XGBoost). Experiments on a publicly available soprano dataset (763 sustained vowel recordings, four labels) show that foundation-model features substantially outperform conventional spectral baselines (spectrogram, mel-spectrogram, MFCC). HuBERT embeddings obtained from early layers yield the best result (~95.7% accuracy with SVM), an absolute improvement of ~12-15% over the best traditional baseline. We also show layer-wise behaviour: lower lay...