[2605.07474] ForgeVLA: Federated Vision-Language-Action Learning without Language Annotations
About this article
Abstract page for arXiv paper 2605.07474: ForgeVLA: Federated Vision-Language-Action Learning without Language Annotations
Computer Science > Computer Vision and Pattern Recognition arXiv:2605.07474 (cs) [Submitted on 8 May 2026] Title:ForgeVLA: Federated Vision-Language-Action Learning without Language Annotations Authors:Yuhao Zhou, Yunpeng Zhu, Yang Zhou, Jindi Lyu, Jian Lan, Zhangyuan Wang, Dan Si, Thomas Seidl, Qing Ye, Jiancheng Lyu View a PDF of the paper titled ForgeVLA: Federated Vision-Language-Action Learning without Language Annotations, by Yuhao Zhou and 9 other authors View PDF HTML (experimental) Abstract:Vision-Language-Action (VLA) models hold great promise for general-purpose robotic intelligence, yet scaling up such models is severely bottlenecked by the high cost of acquiring annotated training data. Fortunately, vision-equipped robots deployed across various domains already produce abundant vision-action pairs that can be leveraged to scale up VLA training more efficiently. However, these raw data cannot be centrally aggregated due to various constraints and also exhibit severe heterogeneity. To address these challenges, in this paper, we propose ForgeVLA, a federated VLA training framework that learns VLA models from distributed vision-action pairs without centralizing raw data or requiring manual annotations. Specifically, each client in ForgeVLA is equipped with an embodied instruction classifier that maps vision-action pairs to a predefined instruction set, recovering the missing language modality and forming complete vision-language-action triplets. Beyond triplet con...