Llms Machine Learning Ai Safety Generative Ai

[2602.19574] CTC-TTS: LLM-based dual-streaming text-to-speech with CTC alignment

arXiv - AI February 24, 2026 3 min read Article

Summary

The paper presents CTC-TTS, a novel dual-streaming text-to-speech system that utilizes a CTC-based aligner for improved text-speech alignment and lower latency, outperforming traditional methods.

Why It Matters

As demand for real-time speech synthesis grows, advancements in TTS technology are crucial. CTC-TTS offers a significant improvement in generating natural speech with low latency, making it relevant for applications in AI communication, virtual assistants, and accessibility tools.

Key Takeaways

CTC-TTS replaces traditional GMM-HMM aligners with a CTC-based approach for better performance.
The system introduces bi-word interleaving to enhance text-speech alignment.
Two variants of CTC-TTS are designed for quality and latency optimization.
Experiments indicate CTC-TTS outperforms fixed-ratio interleaving and MFA-based methods.
Speech samples demonstrate the practical application of the proposed system.

Electrical Engineering and Systems Science > Audio and Speech Processing arXiv:2602.19574 (eess) [Submitted on 23 Feb 2026] Title:CTC-TTS: LLM-based dual-streaming text-to-speech with CTC alignment Authors:Hanwen Liu, Saierdaer Yusuyin, Hao Huang, Zhijian Ou View a PDF of the paper titled CTC-TTS: LLM-based dual-streaming text-to-speech with CTC alignment, by Hanwen Liu and 3 other authors View PDF HTML (experimental) Abstract:Large-language-model (LLM)-based text-to-speech (TTS) systems can generate natural speech, but most are not designed for low-latency dual-streaming synthesis. High-quality dual-streaming TTS depends on accurate text--speech alignment and well-designed training sequences that balance synthesis quality and latency. Prior work often relies on GMM-HMM based forced-alignment toolkits (e.g., MFA), which are pipeline-heavy and less flexible than neural aligners; fixed-ratio interleaving of text and speech tokens struggles to capture text--speech alignment regularities. We propose CTC-TTS, which replaces MFA with a CTC based aligner and introduces a bi-word based interleaving strategy. Two variants are designed: CTC-TTS-L (token concatenation along the sequence length) for higher quality and CTC-TTS-F (embedding stacking along the feature dimension) for lower latency. Experiments show that CTC-TTS outperforms fixed-ratio interleaving and MFA-based baselines on streaming synthesis and zero-shot tasks. Speech samples are available at this https URL. Comments: ...

Read Original Article

[2602.19574] CTC-TTS: LLM-based dual-streaming text-to-speech with CTC alignment

Summary

Why It Matters

Key Takeaways

Related Articles

What if Claude purposefully made its own code leakable so that it would get leaked

Observer-Embedded Reality

I think we’re about to have a new kind of “SEO”… and nobody is talking about it.

Why would Claude give me the same response over and over and give others different replies?

No comments

Stay updated with AI News