[2510.12834] Gelina: Unified Speech and Gesture Synthesis via

[2510.12834] Gelina: Unified Speech and Gesture Synthesis via Interleaved Token Prediction

arXiv - AI March 30, 2026 3 min read

About this article

Abstract page for arXiv paper 2510.12834: Gelina: Unified Speech and Gesture Synthesis via Interleaved Token Prediction

Computer Science > Sound arXiv:2510.12834 (cs) [Submitted on 13 Oct 2025 (v1), last revised 27 Mar 2026 (this version, v3)] Title:Gelina: Unified Speech and Gesture Synthesis via Interleaved Token Prediction Authors:Téo Guichoux, Théodor Lemerle, Shivam Mehta, Jonas Beskow, Gustav Eje Henter, Laure Soulier, Catherine Pelachaud, Nicolas Obin View a PDF of the paper titled Gelina: Unified Speech and Gesture Synthesis via Interleaved Token Prediction, by T\'eo Guichoux and 7 other authors View PDF HTML (experimental) Abstract:Human communication is multimodal, with speech and gestures tightly coupled, yet most computational methods for generating speech and gestures synthesize them sequentially, weakening synchrony and prosody alignment. We introduce Gelina, a unified framework that jointly synthesizes speech and co-speech gestures from text using interleaved token sequences in a discrete autoregressive backbone, with modality-specific decoders. Gelina supports multi-speaker and multi-style cloning and enables gesture-only synthesis from speech inputs. Subjective and objective evaluations demonstrate competitive speech quality and improved gesture generation over unimodal baselines. Comments: Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS) MSC classes: 68T07 Cite as: arXiv:2510.12834 [cs.SD] (or arXiv:2510.12834v3 [cs.SD] for this version) https://doi.org/10.48550/arXiv.2510.12834 Focus to learn more arXiv-issued DOI via Dat...

Originally published on March 30, 2026. Curated by AI News.

Machine Learning

[D] I had an idea, would love your thoughts

What happens that while training an AI during pre training we make it such that if makes "misaligned behaviour" then we just reduce like ...

Reddit - Machine Learning · 1 min · about 11 hours ago

Machine Learning

I had an idea, would love your thoughts

What happens that while training an AI during pre training we make it such that if makes "misaligned behaviour" then we just reduce like ...

Reddit - Artificial Intelligence · 1 min · about 12 hours ago

Ai Safety

Newsom signs executive order requiring AI companies to have safety, privacy guardrails

submitted by /u/Fcking_Chuck [link] [comments]

Reddit - Artificial Intelligence · 1 min · about 16 hours ago

Ai Safety

[2511.16417] Pharos-ESG: A Framework for Multimodal Parsing, Contextual Narration, and Hierarchical Labeling of ESG Report

Abstract page for arXiv paper 2511.16417: Pharos-ESG: A Framework for Multimodal Parsing, Contextual Narration, and Hierarchical Labeling...

arXiv - AI · 4 min · about 17 hours ago

[2510.12834] Gelina: Unified Speech and Gesture Synthesis via Interleaved Token Prediction

About this article

Related Articles

[D] I had an idea, would love your thoughts

I had an idea, would love your thoughts

Newsom signs executive order requiring AI companies to have safety, privacy guardrails

[2511.16417] Pharos-ESG: A Framework for Multimodal Parsing, Contextual Narration, and Hierarchical Labeling of ESG Report

No comments

Stay updated with AI News