Llms Machine Learning Generative Ai Nlp Ai Agents

[2512.04552] RRPO: Robust Reward Policy Optimization for LLM-based Emotional TTS

arXiv - AI February 17, 2026 4 min read Article

Summary

The paper presents Robust Reward Policy Optimization (RRPO), a novel framework designed to enhance emotional text-to-speech (TTS) systems by mitigating reward hacking through a robust reward model.

Why It Matters

This research addresses a critical challenge in emotional TTS systems, where traditional reinforcement learning approaches can lead to suboptimal performance due to reward hacking. By improving the alignment of reward signals with human perception, RRPO enhances the quality and expressiveness of synthesized speech, making it more applicable in real-world scenarios.

Key Takeaways

RRPO employs a hybrid regularization scheme to improve reward model robustness.
The framework effectively reduces reward hacking, enhancing emotional expressiveness in TTS.
Ablation studies demonstrate strong cross-lingual generalization of the robust reward model.
Subjective evaluations confirm significant improvements in naturalness over existing baselines.
The findings have implications for developing more reliable emotional TTS applications.

Computer Science > Sound arXiv:2512.04552 (cs) [Submitted on 4 Dec 2025 (v1), last revised 15 Feb 2026 (this version, v3)] Title:RRPO: Robust Reward Policy Optimization for LLM-based Emotional TTS Authors:Cong Wang, Changfeng Gao, Yang Xiang, Zhihao Du, Keyu An, Han Zhao, Qian Chen, Xiangang Li, Yingming Gao, Ya Li View a PDF of the paper titled RRPO: Robust Reward Policy Optimization for LLM-based Emotional TTS, by Cong Wang and 9 other authors View PDF HTML (experimental) Abstract:Differentiable reinforcement learning (RL) frameworks like DiffRO offer a powerful approach for controllable text-to-speech (TTS), but are vulnerable to reward hacking, particularly for nuanced tasks like emotion control. The policy model can exploit a vanilla Reward Model (RM) by generating acoustic artifacts to achieve spurious rewards, but at the cost of degrading perceptual quality. To address this, we propose Robust Reward Policy Optimization (RRPO), a novel framework that employs a hybrid regularization scheme. This scheme develops a robust RM whose reward signal is more reliably aligned with human perception, compelling the policy to abandon detrimental shortcuts and instead learn the complex features of genuine emotions. Our ablation study confirms the enhanced robustness of our RM, as evidenced by its strong cross-lingual generalization. The subjective evaluation demonstrates that this robust RM effectively mitigates reward hacking, leading to significant improvements in both emotional...

Read Original Article