[2509.19852] Eliminating stability hallucinations in llm-based tts models via attention guidance

[2509.19852] Eliminating stability hallucinations in llm-based tts models via attention guidance

arXiv - AI 3 min read Article

Summary

This paper addresses stability hallucinations in LLM-based TTS models by enhancing attention mechanisms, proposing a new alignment metric, and demonstrating effective results in reducing speech synthesis errors.

Why It Matters

Stability hallucinations in text-to-speech systems can significantly impact user experience and application reliability. By improving alignment between text and speech, this research contributes to the development of more accurate and stable TTS models, which is crucial for advancements in AI-driven communication technologies.

Key Takeaways

  • Introduces a novel metric, Optimal Alignment Score (OAS), to evaluate text-speech alignment.
  • Implements attention guidance to reduce stability hallucinations in synthesized speech.
  • Demonstrates effective results on Seed-TTS-Eval and CV3-Eval test sets.
  • Enhances training methods for LLMs to achieve continuous and stable alignment.
  • Addresses a critical issue in TTS systems, improving overall user experience.

Computer Science > Sound arXiv:2509.19852 (cs) This paper has been withdrawn by ShiMing Wang [Submitted on 24 Sep 2025 (v1), last revised 13 Feb 2026 (this version, v2)] Title:Eliminating stability hallucinations in llm-based tts models via attention guidance Authors:ShiMing Wang, ZhiHao Du, Yang Xiang, TianYu Zhao, Han Zhao, Qian Chen, XianGang Li, HanJie Guo, ZhenHua Ling View a PDF of the paper titled Eliminating stability hallucinations in llm-based tts models via attention guidance, by ShiMing Wang and 8 other authors No PDF available, click to view other formats Abstract:This paper focuses on resolving stability hallucinations (e.g., repetitive or omitted speech) in LLM-based Text-to-Speech (TTS) models by improving and leveraging the attention mechanism. First, we analyzed the alignment mechanism between text tokens and speech tokens in LLMs. We then proposed a metric termed the Optimal Alignment Score (OAS), which employs the Viterbi algorithm to evaluate text-speech alignment quality. Subsequently, OAS was integrated into the training of CosyVoice2 to assist LLMs in learning continuous, stable alignment. Additionally, the pre-trained attention value is employed to guide the training of the student CosyVoice2 via chain-of-thought (CoT), which further reduces stability hallucinations in synthesized speech. Experiments on the Seed-TTS-Eval and CV3-Eval test sets demonstrate that the proposed methods can effectively reduce the stability hallucinations of CosyVoice2 wi...

Related Articles

I Asked ChatGPT 500 Questions. Here Are the Ads I Saw Most Often | WIRED
Llms

I Asked ChatGPT 500 Questions. Here Are the Ads I Saw Most Often | WIRED

Ads are rolling out across the US on ChatGPT’s free tier. I asked OpenAI's bot 500 questions to see what these ads were like and how they...

Wired - AI · 9 min ·
Llms

Abacus.Ai Claw LLM consumes an incredible amount of credit without any usage :(

Three days ago, I clicked the "Deploy OpenClaw In Seconds" button to get an overview of the new service, but I didn't build any automatio...

Reddit - Artificial Intelligence · 1 min ·
Google’s Gemini AI app debuts in Hong Kong
Llms

Google’s Gemini AI app debuts in Hong Kong

Tech giant’s chatbot service tops Apple’s app store chart in the city.

AI Tools & Products · 2 min ·
Google Launches Gemini Import Tools to Poach Users From Rival AI Apps
Llms

Google Launches Gemini Import Tools to Poach Users From Rival AI Apps

Anyone looking to switch their AI assistant will find it surprisingly easy, as it only takes a few steps to move from A to B. This is not...

AI Tools & Products · 4 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime