[2602.19574] CTC-TTS: LLM-based dual-streaming text-to-speech with CTC alignment

[2602.19574] CTC-TTS: LLM-based dual-streaming text-to-speech with CTC alignment

arXiv - AI 3 min read Article

Summary

The paper presents CTC-TTS, a novel dual-streaming text-to-speech system that utilizes a CTC-based aligner for improved text-speech alignment and lower latency, outperforming traditional methods.

Why It Matters

As demand for real-time speech synthesis grows, advancements in TTS technology are crucial. CTC-TTS offers a significant improvement in generating natural speech with low latency, making it relevant for applications in AI communication, virtual assistants, and accessibility tools.

Key Takeaways

  • CTC-TTS replaces traditional GMM-HMM aligners with a CTC-based approach for better performance.
  • The system introduces bi-word interleaving to enhance text-speech alignment.
  • Two variants of CTC-TTS are designed for quality and latency optimization.
  • Experiments indicate CTC-TTS outperforms fixed-ratio interleaving and MFA-based methods.
  • Speech samples demonstrate the practical application of the proposed system.

Electrical Engineering and Systems Science > Audio and Speech Processing arXiv:2602.19574 (eess) [Submitted on 23 Feb 2026] Title:CTC-TTS: LLM-based dual-streaming text-to-speech with CTC alignment Authors:Hanwen Liu, Saierdaer Yusuyin, Hao Huang, Zhijian Ou View a PDF of the paper titled CTC-TTS: LLM-based dual-streaming text-to-speech with CTC alignment, by Hanwen Liu and 3 other authors View PDF HTML (experimental) Abstract:Large-language-model (LLM)-based text-to-speech (TTS) systems can generate natural speech, but most are not designed for low-latency dual-streaming synthesis. High-quality dual-streaming TTS depends on accurate text--speech alignment and well-designed training sequences that balance synthesis quality and latency. Prior work often relies on GMM-HMM based forced-alignment toolkits (e.g., MFA), which are pipeline-heavy and less flexible than neural aligners; fixed-ratio interleaving of text and speech tokens struggles to capture text--speech alignment regularities. We propose CTC-TTS, which replaces MFA with a CTC based aligner and introduces a bi-word based interleaving strategy. Two variants are designed: CTC-TTS-L (token concatenation along the sequence length) for higher quality and CTC-TTS-F (embedding stacking along the feature dimension) for lower latency. Experiments show that CTC-TTS outperforms fixed-ratio interleaving and MFA-based baselines on streaming synthesis and zero-shot tasks. Speech samples are available at this https URL. Comments: ...

Related Articles

Llms

What if Claude purposefully made its own code leakable so that it would get leaked

What if Claude leaked itself by socially and architecturally engineering itself to be leaked by a dumb human submitted by /u/smurfcsgoawp...

Reddit - Artificial Intelligence · 1 min ·
Llms

Observer-Embedded Reality

Observer-Embedded Reality Consciousness, Complexity, Meaning, and the Limits of Human Knowledge A Conceptual Philosophy-of-Science Paper ...

Reddit - Artificial Intelligence · 1 min ·
Llms

I think we’re about to have a new kind of “SEO”… and nobody is talking about it.

More people are asking ChatGPT things like: “what’s the best CRM?” “is this tool worth it?” “alternatives to X” And they just… trust the ...

Reddit - Artificial Intelligence · 1 min ·
Llms

Why would Claude give me the same response over and over and give others different replies?

I asked Claude to "generate me a random word" so I could do some word play. Then I asked it again in a new prompt window on desktop after...

Reddit - Artificial Intelligence · 1 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime