[2602.19166] CosyAccent: Duration-Controllable Accent Normalization Using Source-Synthesis Training Data

[2602.19166] CosyAccent: Duration-Controllable Accent Normalization Using Source-Synthesis Training Data

arXiv - AI 3 min read Article

Summary

The paper presents CosyAccent, a novel approach to accent normalization that utilizes source-synthesis training data, enhancing naturalness and control over speech duration without requiring real L2 data.

Why It Matters

Accent normalization is crucial for improving the intelligibility and naturalness of synthesized speech, particularly in multilingual contexts. This research addresses common challenges in the field by proposing a method that avoids the pitfalls of traditional training data, potentially leading to more effective applications in speech technology.

Key Takeaways

  • CosyAccent employs a source-synthesis methodology for training data construction.
  • The model achieves improved content preservation and naturalness without real L2 data.
  • It balances prosodic naturalness with explicit control over output duration.
  • The approach mitigates issues related to TTS artifacts in accent normalization.
  • Experiments demonstrate significant advancements over traditional models.

Electrical Engineering and Systems Science > Audio and Speech Processing arXiv:2602.19166 (eess) [Submitted on 22 Feb 2026] Title:CosyAccent: Duration-Controllable Accent Normalization Using Source-Synthesis Training Data Authors:Qibing Bai, Shuhao Shi, Shuai Wang, Yukai Ju, Yannan Wang, Haizhou Li View a PDF of the paper titled CosyAccent: Duration-Controllable Accent Normalization Using Source-Synthesis Training Data, by Qibing Bai and 5 other authors View PDF HTML (experimental) Abstract:Accent normalization (AN) systems often struggle with unnatural outputs and undesired content distortion, stemming from both suboptimal training data and rigid duration modeling. In this paper, we propose a "source-synthesis" methodology for training data construction. By generating source L2 speech and using authentic native speech as the training target, our approach avoids learning from TTS artifacts and, crucially, requires no real L2 data in training. Alongside this data strategy, we introduce CosyAccent, a non-autoregressive model that resolves the trade-off between prosodic naturalness and duration control. CosyAccent implicitly models rhythm for flexibility yet offers explicit control over total output duration. Experiments show that, despite being trained without any real L2 speech, CosyAccent achieves significantly improved content preservation and superior naturalness compared to strong baselines trained on real-world data. Comments: Subjects: Audio and Speech Processing (ees...

Related Articles

Machine Learning

[D] ICML 2026 Average Score

Hi all, I’m curious about the current review dynamics for ICML 2026, especially after the rebuttal phase. For those who are reviewers (or...

Reddit - Machine Learning · 1 min ·
Machine Learning

[R] VOID: Video Object and Interaction Deletion (physically-consistent video inpainting)

We present VOID, a model for video object removal that aims to handle *physical interactions*, not just appearance. Most existing video i...

Reddit - Machine Learning · 1 min ·
Machine Learning

FLUX 2 Pro (2026) Sketch to Image

I sketched a cow and tested how different models interpret it into a realistic image for downstream 3D generation, turns out some models ...

Reddit - Artificial Intelligence · 1 min ·
Improving AI models’ ability to explain their predictions
Machine Learning

Improving AI models’ ability to explain their predictions

AI News - General · 9 min ·
More in Machine Learning: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime