[2510.14628] RLAIF-SPA: Structured AI Feedback for Semantic-Prosodic Alignment in Speech Synthesis
About this article
Abstract page for arXiv paper 2510.14628: RLAIF-SPA: Structured AI Feedback for Semantic-Prosodic Alignment in Speech Synthesis
Computer Science > Computation and Language arXiv:2510.14628 (cs) [Submitted on 16 Oct 2025 (v1), last revised 7 Apr 2026 (this version, v2)] Title:RLAIF-SPA: Structured AI Feedback for Semantic-Prosodic Alignment in Speech Synthesis Authors:Qing Yang, Zhenghao Liu, Yangfan Du, Pengcheng Huang, Tong Xiao View a PDF of the paper titled RLAIF-SPA: Structured AI Feedback for Semantic-Prosodic Alignment in Speech Synthesis, by Qing Yang and 4 other authors View PDF HTML (experimental) Abstract:Recent advances in Text-To-Speech (TTS) synthesis have achieved near-human speech quality in neutral speaking styles. However, most existing approaches either depend on costly emotion annotations or optimize surrogate objectives that fail to adequately capture perceptual emotional quality. As a result, the generated speech, while semantically accurate, often lacks expressive and emotionally rich characteristics. To address these limitations, we propose RLAIF-SPA, a novel framework that integrates Reinforcement Learning from AI Feedback (RLAIF) to directly optimize both emotional expressiveness and intelligibility without human supervision. Specifically, RLAIF-SPA incorporates Automatic Speech Recognition (ASR) to provide semantic accuracy feedback, while leveraging structured reward modeling to evaluate prosodic-emotional consistency. RLAIF-SPA enables more precise and nuanced control over expressive speech generation along four structured evaluation dimensions: Structure, Emotion, Speed...