[2602.17770] CLUTCH: Contextualized Language model for Unlocking Text-Conditioned Hand motion modelling in the wild
Summary
The paper introduces CLUTCH, a novel model for generating hand motions from text, leveraging a new dataset and advanced techniques to improve realism and scalability in real-world applications.
Why It Matters
This research addresses the limitations of existing hand motion modeling methods, which often rely on constrained datasets. By introducing a large-scale dataset and innovative modeling techniques, CLUTCH has the potential to enhance applications in robotics, animation, and human-computer interaction, making it significant for advancing the field of computer vision and machine learning.
Key Takeaways
- CLUTCH introduces a new dataset, '3D Hands in the Wild', with 32K hand-motion sequences and aligned text.
- The model employs a novel VQ-VAE architecture called SHIFT for improved hand motion tokenization.
- A geometric refinement stage enhances animation quality by co-supervising with reconstruction loss.
- CLUTCH sets a new benchmark for text-to-motion and motion-to-text tasks in real-world scenarios.
- The research aims to bridge the gap between studio-captured data and in-the-wild applications.
Computer Science > Computer Vision and Pattern Recognition arXiv:2602.17770 (cs) [Submitted on 19 Feb 2026] Title:CLUTCH: Contextualized Language model for Unlocking Text-Conditioned Hand motion modelling in the wild Authors:Balamurugan Thambiraja, Omid Taheri, Radek Danecek, Giorgio Becherini, Gerard Pons-Moll, Justus Thies View a PDF of the paper titled CLUTCH: Contextualized Language model for Unlocking Text-Conditioned Hand motion modelling in the wild, by Balamurugan Thambiraja and 5 other authors View PDF HTML (experimental) Abstract:Hands play a central role in daily life, yet modeling natural hand motions remains underexplored. Existing methods that tackle text-to-hand-motion generation or hand animation captioning rely on studio-captured datasets with limited actions and contexts, making them costly to scale to "in-the-wild" settings. Further, contemporary models and their training schemes struggle to capture animation fidelity with text-motion alignment. To address this, we (1) introduce '3D Hands in the Wild' (3D-HIW), a dataset of 32K 3D hand-motion sequences and aligned text, and (2) propose CLUTCH, an LLM-based hand animation system with two critical innovations: (a) SHIFT, a novel VQ-VAE architecture to tokenize hand motion, and (b) a geometric refinement stage to finetune the LLM. To build 3D-HIW, we propose a data annotation pipeline that combines vision-language models (VLMs) and state-of-the-art 3D hand trackers, and apply it to a large corpus of egocent...