[2602.13928] voice2mode: Phonation Mode Classification in Singing using Self-Supervised Speech Models

[2602.13928] voice2mode: Phonation Mode Classification in Singing using Self-Supervised Speech Models

arXiv - Machine Learning 3 min read Article

Summary

The paper presents voice2mode, a method for classifying four singing phonation modes using self-supervised speech models, demonstrating significant accuracy improvements over traditional methods.

Why It Matters

This research highlights the potential of self-supervised models in enhancing phonation mode classification in singing, which can impact music technology, vocal training, and speech recognition applications. The findings suggest a shift from traditional feature extraction to leveraging advanced machine learning techniques.

Key Takeaways

  • voice2mode classifies four phonation modes: breathy, neutral, flow, and pressed.
  • Utilizes embeddings from self-supervised models like HuBERT and wav2vec2 for improved accuracy.
  • Achieved ~95.7% accuracy with SVM, outperforming traditional spectral methods by 12-15%.
  • Lower layer embeddings are more effective for phonation classification than top layers specialized for ASR.
  • Demonstrates the transferability of speech models to singing applications.

Computer Science > Sound arXiv:2602.13928 (cs) [Submitted on 14 Feb 2026] Title:voice2mode: Phonation Mode Classification in Singing using Self-Supervised Speech Models Authors:Aju Ani Justus, Ruchit Agrawal, Sudarsana Reddy Kadiri, Shrikanth Narayanan View a PDF of the paper titled voice2mode: Phonation Mode Classification in Singing using Self-Supervised Speech Models, by Aju Ani Justus and 2 other authors View PDF HTML (experimental) Abstract:We present voice2mode, a method for classification of four singing phonation modes (breathy, neutral (modal), flow, and pressed) using embeddings extracted from large self-supervised speech models. Prior work on singing phonation has relied on handcrafted signal features or task-specific neural nets; this work evaluates the transferability of speech foundation models to singing phonation classification. voice2mode extracts layer-wise representations from HuBERT and two wav2vec2 variants, applies global temporal pooling, and classifies the pooled embeddings with lightweight classifiers (SVM, XGBoost). Experiments on a publicly available soprano dataset (763 sustained vowel recordings, four labels) show that foundation-model features substantially outperform conventional spectral baselines (spectrogram, mel-spectrogram, MFCC). HuBERT embeddings obtained from early layers yield the best result (~95.7% accuracy with SVM), an absolute improvement of ~12-15% over the best traditional baseline. We also show layer-wise behaviour: lower lay...

Related Articles

Llms

Claude Max 20x usage hit 40% by Monday noon — how does Codex CLI compare?

I'm on Claude Max (the $100/mo plan) and noticed something that surprised me. By Monday noon I had already used 40% of the 20x monthly li...

Reddit - Artificial Intelligence · 1 min ·
How to use the new ChatGPT app integrations, including DoorDash, Spotify, Uber, and others | TechCrunch
Llms

How to use the new ChatGPT app integrations, including DoorDash, Spotify, Uber, and others | TechCrunch

Learn how to use Spotify, Canva, Figma, Expedia, and other apps directly in ChatGPT.

TechCrunch - AI · 10 min ·
Anthropic Restricts Claude Agent Access Amid AI Automation Boom in Crypto
Llms

Anthropic Restricts Claude Agent Access Amid AI Automation Boom in Crypto

AI Tools & Products · 7 min ·
Is cutting ‘please’ when talking to ChatGPT better for the planet? An expert explains
Llms

Is cutting ‘please’ when talking to ChatGPT better for the planet? An expert explains

AI Tools & Products · 5 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime