[2510.12834] Gelina: Unified Speech and Gesture Synthesis via Interleaved Token Prediction

[2510.12834] Gelina: Unified Speech and Gesture Synthesis via Interleaved Token Prediction

arXiv - AI 3 min read

About this article

Abstract page for arXiv paper 2510.12834: Gelina: Unified Speech and Gesture Synthesis via Interleaved Token Prediction

Computer Science > Sound arXiv:2510.12834 (cs) [Submitted on 13 Oct 2025 (v1), last revised 27 Mar 2026 (this version, v3)] Title:Gelina: Unified Speech and Gesture Synthesis via Interleaved Token Prediction Authors:Téo Guichoux, Théodor Lemerle, Shivam Mehta, Jonas Beskow, Gustav Eje Henter, Laure Soulier, Catherine Pelachaud, Nicolas Obin View a PDF of the paper titled Gelina: Unified Speech and Gesture Synthesis via Interleaved Token Prediction, by T\'eo Guichoux and 7 other authors View PDF HTML (experimental) Abstract:Human communication is multimodal, with speech and gestures tightly coupled, yet most computational methods for generating speech and gestures synthesize them sequentially, weakening synchrony and prosody alignment. We introduce Gelina, a unified framework that jointly synthesizes speech and co-speech gestures from text using interleaved token sequences in a discrete autoregressive backbone, with modality-specific decoders. Gelina supports multi-speaker and multi-style cloning and enables gesture-only synthesis from speech inputs. Subjective and objective evaluations demonstrate competitive speech quality and improved gesture generation over unimodal baselines. Comments: Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS) MSC classes: 68T07 Cite as: arXiv:2510.12834 [cs.SD]   (or arXiv:2510.12834v3 [cs.SD] for this version)   https://doi.org/10.48550/arXiv.2510.12834 Focus to learn more arXiv-issued DOI via Dat...

Originally published on March 30, 2026. Curated by AI News.

Related Articles

Machine Learning

[D] I had an idea, would love your thoughts

What happens that while training an AI during pre training we make it such that if makes "misaligned behaviour" then we just reduce like ...

Reddit - Machine Learning · 1 min ·
Machine Learning

I had an idea, would love your thoughts

What happens that while training an AI during pre training we make it such that if makes "misaligned behaviour" then we just reduce like ...

Reddit - Artificial Intelligence · 1 min ·
Ai Safety

Newsom signs executive order requiring AI companies to have safety, privacy guardrails

submitted by /u/Fcking_Chuck [link] [comments]

Reddit - Artificial Intelligence · 1 min ·
[2511.16417] Pharos-ESG: A Framework for Multimodal Parsing, Contextual Narration, and Hierarchical Labeling of ESG Report
Ai Safety

[2511.16417] Pharos-ESG: A Framework for Multimodal Parsing, Contextual Narration, and Hierarchical Labeling of ESG Report

Abstract page for arXiv paper 2511.16417: Pharos-ESG: A Framework for Multimodal Parsing, Contextual Narration, and Hierarchical Labeling...

arXiv - AI · 4 min ·
More in Ai Safety: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime