[2510.12834] Gelina: Unified Speech and Gesture Synthesis via Interleaved Token Prediction
About this article
Abstract page for arXiv paper 2510.12834: Gelina: Unified Speech and Gesture Synthesis via Interleaved Token Prediction
Computer Science > Sound arXiv:2510.12834 (cs) [Submitted on 13 Oct 2025 (v1), last revised 27 Mar 2026 (this version, v3)] Title:Gelina: Unified Speech and Gesture Synthesis via Interleaved Token Prediction Authors:Téo Guichoux, Théodor Lemerle, Shivam Mehta, Jonas Beskow, Gustav Eje Henter, Laure Soulier, Catherine Pelachaud, Nicolas Obin View a PDF of the paper titled Gelina: Unified Speech and Gesture Synthesis via Interleaved Token Prediction, by T\'eo Guichoux and 7 other authors View PDF HTML (experimental) Abstract:Human communication is multimodal, with speech and gestures tightly coupled, yet most computational methods for generating speech and gestures synthesize them sequentially, weakening synchrony and prosody alignment. We introduce Gelina, a unified framework that jointly synthesizes speech and co-speech gestures from text using interleaved token sequences in a discrete autoregressive backbone, with modality-specific decoders. Gelina supports multi-speaker and multi-style cloning and enables gesture-only synthesis from speech inputs. Subjective and objective evaluations demonstrate competitive speech quality and improved gesture generation over unimodal baselines. Comments: Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS) MSC classes: 68T07 Cite as: arXiv:2510.12834 [cs.SD] (or arXiv:2510.12834v3 [cs.SD] for this version) https://doi.org/10.48550/arXiv.2510.12834 Focus to learn more arXiv-issued DOI via Dat...