[2511.05275] TwinVLA: Data-Efficient Bimanual Manipulation with Twin Single-Arm Vision-Language-Action Models
Summary
The paper presents TwinVLA, a modular framework for bimanual manipulation using two single-arm Vision-Language-Action models, enhancing data efficiency and performance without requiring extensive bimanual data.
Why It Matters
TwinVLA addresses the challenge of adapting existing single-arm models for bimanual tasks, which is crucial for advancing robotic manipulation capabilities. By improving data efficiency, it enables broader applications in robotics without the need for costly proprietary datasets.
Key Takeaways
- TwinVLA combines two pretrained single-arm models for bimanual tasks.
- It improves data efficiency compared to traditional monolithic models.
- The framework outperforms existing models without requiring bimanual pretraining.
- TwinVLA narrows the performance gap with state-of-the-art models using public data.
- This approach offers a scalable solution for high-performance robotic manipulation.
Computer Science > Robotics arXiv:2511.05275 (cs) [Submitted on 7 Nov 2025 (v1), last revised 23 Feb 2026 (this version, v2)] Title:TwinVLA: Data-Efficient Bimanual Manipulation with Twin Single-Arm Vision-Language-Action Models Authors:Hokyun Im, Euijin Jeong, Andrey Kolobov, Jianlong Fu, Youngwoon Lee View a PDF of the paper titled TwinVLA: Data-Efficient Bimanual Manipulation with Twin Single-Arm Vision-Language-Action Models, by Hokyun Im and 4 other authors View PDF HTML (experimental) Abstract:Vision-language-action models (VLAs) trained on large-scale robotic datasets have demonstrated strong performance on manipulation tasks, including bimanual tasks. However, because most public datasets focus on single-arm demonstrations, adapting VLAs for bimanual tasks typically requires substantial additional bimanual data and fine-tuning. To address this challenge, we introduce TwinVLA, a modular framework that composes two copies of a pretrained single-arm VLA into a coordinated bimanual VLA. Unlike monolithic cross-embodiment models trained on mixtures of single-arm and bimanual data, TwinVLA improves both data efficiency and performance by composing pretrained single-arm policies. Across diverse bimanual tasks in real-world and simulation settings, TwinVLA outperforms a comparably-sized monolithic RDT-1B model without requiring any bimanual pretraining. Furthermore, it narrows the gap to state-of-the-art model $\pi_0$, which relies on extensive proprietary bimanual data an...