[2603.23481] VTAM: Video-Tactile-Action Models for Complex Physical Interaction Beyond VLAs
About this article
Abstract page for arXiv paper 2603.23481: VTAM: Video-Tactile-Action Models for Complex Physical Interaction Beyond VLAs
Computer Science > Robotics arXiv:2603.23481 (cs) [Submitted on 24 Mar 2026] Title:VTAM: Video-Tactile-Action Models for Complex Physical Interaction Beyond VLAs Authors:Haoran Yuan, Weigang Yi, Zhenyu Zhang, Wendi Chen, Yuchen Mo, Jiashi Yin, Xinzhuo Li, Xiangyu Zeng, Chuan Wen, Cewu Lu, Katherine Driggs-Campbell, Ismini Lourentzou View a PDF of the paper titled VTAM: Video-Tactile-Action Models for Complex Physical Interaction Beyond VLAs, by Haoran Yuan and 11 other authors View PDF Abstract:Video-Action Models (VAMs) have emerged as a promising framework for embodied intelligence, learning implicit world dynamics from raw video streams to produce temporally consistent action predictions. Although such models demonstrate strong performance on long-horizon tasks through visual reasoning, they remain limited in contact-rich scenarios where critical interaction states are only partially observable from vision alone. In particular, fine-grained force modulation and contact transitions are not reliably encoded in visual tokens, leading to unstable or imprecise behaviors. To bridge this gap, we introduce the Video-Tactile Action Model (VTAM), a multimodal world modeling framework that incorporates tactile perception as a complementary grounding signal. VTAM augments a pretrained video transformer with tactile streams via a lightweight modality transfer finetuning, enabling efficient cross-modal representation learning without tactile-language paired data or independent tactil...