[2602.20113] StyleStream: Real-Time Zero-Shot Voice Style Conversion
Summary
StyleStream introduces a novel real-time zero-shot voice style conversion system that enhances voice synthesis by disentangling linguistic content from style attributes, achieving state-of-the-art performance.
Why It Matters
This research addresses the limitations of previous voice style conversion methods by enabling real-time processing, which is crucial for applications in virtual assistants, gaming, and entertainment. The ability to convert voice styles without extensive training data opens new avenues for personalized user experiences.
Key Takeaways
- StyleStream achieves real-time voice style conversion with a latency of just 1 second.
- The system utilizes a Destylizer and a Stylizer for effective content-style disentanglement.
- It operates in a zero-shot manner, requiring no prior training on target styles.
- Robust performance is ensured through text supervision and a constrained information bottleneck.
- This advancement has significant implications for applications in AI-driven voice synthesis.
Computer Science > Sound arXiv:2602.20113 (cs) [Submitted on 23 Feb 2026] Title:StyleStream: Real-Time Zero-Shot Voice Style Conversion Authors:Yisi Liu, Nicholas Lee, Gopala Anumanchipalli View a PDF of the paper titled StyleStream: Real-Time Zero-Shot Voice Style Conversion, by Yisi Liu and 2 other authors View PDF HTML (experimental) Abstract:Voice style conversion aims to transform an input utterance to match a target speaker's timbre, accent, and emotion, with a central challenge being the disentanglement of linguistic content from style. While prior work has explored this problem, conversion quality remains limited, and real-time voice style conversion has not been addressed. We propose StyleStream, the first streamable zero-shot voice style conversion system that achieves state-of-the-art performance. StyleStream consists of two components: a Destylizer, which removes style attributes while preserving linguistic content, and a Stylizer, a diffusion transformer (DiT) that reintroduces target style conditioned on reference speech. Robust content-style disentanglement is enforced through text supervision and a highly constrained information bottleneck. This design enables a fully non-autoregressive architecture, achieving real-time voice style conversion with an end-to-end latency of 1 second. Samples and real-time demo: this https URL. Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI) Cite as: arXiv:2602.20113 [cs.SD] (or arXiv:2602.20113v1 [cs.SD] for this v...