Machine Learning Ai Agents Data Science

[2602.21772] UniWhisper: Efficient Continual Multi-task Training for Robust Universal Audio Representation

arXiv - AI February 26, 2026 3 min read Article

Summary

UniWhisper introduces an efficient framework for continual multi-task training, enhancing audio representation across diverse tasks, outperforming existing models.

Why It Matters

This research addresses the limitations of current audio encoders that excel in specific domains but struggle in others. By proposing a unified training approach, UniWhisper aims to improve the robustness of audio representations, which is crucial for applications in speech recognition, environmental sound classification, and music analysis.

Key Takeaways

UniWhisper employs a continual multi-task training framework for audio tasks.
It achieves superior performance compared to existing models like Whisper.
The model is trained on a substantial dataset of 38k hours of public audio.
UniWhisper maintains strong performance in speech while improving general audio representation.
The approach simplifies training by using a unified instruction and answer format.

Computer Science > Sound arXiv:2602.21772 (cs) [Submitted on 25 Feb 2026] Title:UniWhisper: Efficient Continual Multi-task Training for Robust Universal Audio Representation Authors:Yuxuan Chen, Peize He, Haoyuan Xu, Junzi Zhang View a PDF of the paper titled UniWhisper: Efficient Continual Multi-task Training for Robust Universal Audio Representation, by Yuxuan Chen and 3 other authors View PDF HTML (experimental) Abstract:A universal audio representation should capture fine-grained speech cues and high-level semantics for environmental sounds and music in a single encoder. Existing encoders often excel in one domain but degrade in others. We propose UniWhisper, an efficient continual multi-task training framework that casts heterogeneous audio tasks into a unified instruction and answer format. This enables standard next-token training without task-specific heads and losses. We train it on 38k hours of public audio and assess the encoder using shallow MLP probes and k-nearest neighbors (kNN) on 20 tasks spanning speech, environmental sound, and music. UniWhisper reaches normalized weighted averages of 0.81 with MLP probes and 0.61 with kNN, compared to 0.64 and 0.46 for Whisper, while retaining strong speech performance. Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI) Cite as: arXiv:2602.21772 [cs.SD] (or arXiv:2602.21772v1 [cs.SD] for this version) https://doi.org/10.48550/arXiv.2602.21772 Focus to learn more arXiv-issued DOI via DataCite (pending registrat...

Read Original Article

[2602.21772] UniWhisper: Efficient Continual Multi-task Training for Robust Universal Audio Representation

Summary

Why It Matters

Key Takeaways

Related Articles

[2603.14841] Real-Time Driver Safety Scoring Through Inverse Crash Probability Modeling

[2603.17839] How do LLMs Compute Verbal Confidence

[2603.15970] 100x Cost & Latency Reduction: Performance Analysis of AI Query Approximation using Lightweight Proxy Models

[2603.09085] Not All News Is Equal: Topic- and Event-Conditional Sentiment from Finetuned LLMs for Aluminum Price Forecasting

No comments

Stay updated with AI News