[2509.19852] Eliminating stability hallucinations in llm-based tts models via attention guidance
Summary
This paper addresses stability hallucinations in LLM-based TTS models by enhancing attention mechanisms, proposing a new alignment metric, and demonstrating effective results in reducing speech synthesis errors.
Why It Matters
Stability hallucinations in text-to-speech systems can significantly impact user experience and application reliability. By improving alignment between text and speech, this research contributes to the development of more accurate and stable TTS models, which is crucial for advancements in AI-driven communication technologies.
Key Takeaways
- Introduces a novel metric, Optimal Alignment Score (OAS), to evaluate text-speech alignment.
- Implements attention guidance to reduce stability hallucinations in synthesized speech.
- Demonstrates effective results on Seed-TTS-Eval and CV3-Eval test sets.
- Enhances training methods for LLMs to achieve continuous and stable alignment.
- Addresses a critical issue in TTS systems, improving overall user experience.
Computer Science > Sound arXiv:2509.19852 (cs) This paper has been withdrawn by ShiMing Wang [Submitted on 24 Sep 2025 (v1), last revised 13 Feb 2026 (this version, v2)] Title:Eliminating stability hallucinations in llm-based tts models via attention guidance Authors:ShiMing Wang, ZhiHao Du, Yang Xiang, TianYu Zhao, Han Zhao, Qian Chen, XianGang Li, HanJie Guo, ZhenHua Ling View a PDF of the paper titled Eliminating stability hallucinations in llm-based tts models via attention guidance, by ShiMing Wang and 8 other authors No PDF available, click to view other formats Abstract:This paper focuses on resolving stability hallucinations (e.g., repetitive or omitted speech) in LLM-based Text-to-Speech (TTS) models by improving and leveraging the attention mechanism. First, we analyzed the alignment mechanism between text tokens and speech tokens in LLMs. We then proposed a metric termed the Optimal Alignment Score (OAS), which employs the Viterbi algorithm to evaluate text-speech alignment quality. Subsequently, OAS was integrated into the training of CosyVoice2 to assist LLMs in learning continuous, stable alignment. Additionally, the pre-trained attention value is employed to guide the training of the student CosyVoice2 via chain-of-thought (CoT), which further reduces stability hallucinations in synthesized speech. Experiments on the Seed-TTS-Eval and CV3-Eval test sets demonstrate that the proposed methods can effectively reduce the stability hallucinations of CosyVoice2 wi...