[2512.04552] RRPO: Robust Reward Policy Optimization for LLM-based Emotional TTS

[2512.04552] RRPO: Robust Reward Policy Optimization for LLM-based Emotional TTS

arXiv - AI 4 min read Article

Summary

The paper presents Robust Reward Policy Optimization (RRPO), a novel framework designed to enhance emotional text-to-speech (TTS) systems by mitigating reward hacking through a robust reward model.

Why It Matters

This research addresses a critical challenge in emotional TTS systems, where traditional reinforcement learning approaches can lead to suboptimal performance due to reward hacking. By improving the alignment of reward signals with human perception, RRPO enhances the quality and expressiveness of synthesized speech, making it more applicable in real-world scenarios.

Key Takeaways

  • RRPO employs a hybrid regularization scheme to improve reward model robustness.
  • The framework effectively reduces reward hacking, enhancing emotional expressiveness in TTS.
  • Ablation studies demonstrate strong cross-lingual generalization of the robust reward model.
  • Subjective evaluations confirm significant improvements in naturalness over existing baselines.
  • The findings have implications for developing more reliable emotional TTS applications.

Computer Science > Sound arXiv:2512.04552 (cs) [Submitted on 4 Dec 2025 (v1), last revised 15 Feb 2026 (this version, v3)] Title:RRPO: Robust Reward Policy Optimization for LLM-based Emotional TTS Authors:Cong Wang, Changfeng Gao, Yang Xiang, Zhihao Du, Keyu An, Han Zhao, Qian Chen, Xiangang Li, Yingming Gao, Ya Li View a PDF of the paper titled RRPO: Robust Reward Policy Optimization for LLM-based Emotional TTS, by Cong Wang and 9 other authors View PDF HTML (experimental) Abstract:Differentiable reinforcement learning (RL) frameworks like DiffRO offer a powerful approach for controllable text-to-speech (TTS), but are vulnerable to reward hacking, particularly for nuanced tasks like emotion control. The policy model can exploit a vanilla Reward Model (RM) by generating acoustic artifacts to achieve spurious rewards, but at the cost of degrading perceptual quality. To address this, we propose Robust Reward Policy Optimization (RRPO), a novel framework that employs a hybrid regularization scheme. This scheme develops a robust RM whose reward signal is more reliably aligned with human perception, compelling the policy to abandon detrimental shortcuts and instead learn the complex features of genuine emotions. Our ablation study confirms the enhanced robustness of our RM, as evidenced by its strong cross-lingual generalization. The subjective evaluation demonstrates that this robust RM effectively mitigates reward hacking, leading to significant improvements in both emotional...

Related Articles

Iran threatens ‘complete and utter annihilation’ of OpenAI's $30B Stargate AI data center in Abu Dhabi — regime posts video with satellite imagery of ChatGPT-maker's premier 1GW data center
Llms

Iran threatens ‘complete and utter annihilation’ of OpenAI's $30B Stargate AI data center in Abu Dhabi — regime posts video with satellite imagery of ChatGPT-maker's premier 1GW data center

Iran's Islamic Revolutionary Guard Corps (IRGC) issued this specific threat in a video update.

AI Tools & Products · 5 min ·
AI Desktop 98 lets you chat with Claude, ChatGPT, and Gemini through a Windows 98-inspired interface
Llms

AI Desktop 98 lets you chat with Claude, ChatGPT, and Gemini through a Windows 98-inspired interface

AI Tools & Products · 3 min ·
Anthropic Restricts Claude Agent Access Amid AI Automation Boom in Crypto
Llms

Anthropic Restricts Claude Agent Access Amid AI Automation Boom in Crypto

Anthropic cut Claude subscription access for Openclaw on April 4, pushing crypto AI agent users to pay-as-you-go billing.

AI Tools & Products · 7 min ·
I hit Claude’s new usage limits — and It changed how I use AI forever
Llms

I hit Claude’s new usage limits — and It changed how I use AI forever

Claude's message limits are dynamic, meaning they change based on site demand which is why I recommend using "Mega-Prompts" and utilizing...

AI Tools & Products · 8 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime