[2603.19831] Gesture2Speech: How Far Can Hand Movements Shape Expressive Speech?
About this article
Abstract page for arXiv paper 2603.19831: Gesture2Speech: How Far Can Hand Movements Shape Expressive Speech?
Electrical Engineering and Systems Science > Audio and Speech Processing arXiv:2603.19831 (eess) [Submitted on 20 Mar 2026] Title:Gesture2Speech: How Far Can Hand Movements Shape Expressive Speech? Authors:Lokesh Kumar, Nirmesh Shah, Ashishkumar P. Gudmalwar, Pankaj Wasnik View a PDF of the paper titled Gesture2Speech: How Far Can Hand Movements Shape Expressive Speech?, by Lokesh Kumar and 3 other authors View PDF HTML (experimental) Abstract:Human communication seamlessly integrates speech and bodily motion, where hand gestures naturally complement vocal prosody to express intent, emotion, and emphasis. While recent text-to-speech (TTS) systems have begun incorporating multimodal cues such as facial expressions or lip movements, the role of hand gestures in shaping prosody remains largely underexplored. We propose a novel multimodal TTS framework, Gesture2Speech, that leverages visual gesture cues to modulate prosody in synthesized speech. Motivated by the observation that confident and expressive speakers coordinate gestures with vocal prosody, we introduce a multimodal Mixture-of-Experts (MoE) architecture that dynamically fuses linguistic content and gesture features within a dedicated style extraction module. The fused representation conditions an LLM-based speech decoder, enabling prosodic modulation that is temporally aligned with hand movements. We further design a gesture-speech alignment loss that explicitly models their temporal correspondence to ensure fine-gr...