Machine Learning Robotics Data Science Ai Agents

[2511.05275] TwinVLA: Data-Efficient Bimanual Manipulation with Twin Single-Arm Vision-Language-Action Models

arXiv - Machine Learning February 24, 2026 3 min read Article

Summary

The paper presents TwinVLA, a modular framework for bimanual manipulation using two single-arm Vision-Language-Action models, enhancing data efficiency and performance without requiring extensive bimanual data.

Why It Matters

TwinVLA addresses the challenge of adapting existing single-arm models for bimanual tasks, which is crucial for advancing robotic manipulation capabilities. By improving data efficiency, it enables broader applications in robotics without the need for costly proprietary datasets.

Key Takeaways

TwinVLA combines two pretrained single-arm models for bimanual tasks.
It improves data efficiency compared to traditional monolithic models.
The framework outperforms existing models without requiring bimanual pretraining.
TwinVLA narrows the performance gap with state-of-the-art models using public data.
This approach offers a scalable solution for high-performance robotic manipulation.

Computer Science > Robotics arXiv:2511.05275 (cs) [Submitted on 7 Nov 2025 (v1), last revised 23 Feb 2026 (this version, v2)] Title:TwinVLA: Data-Efficient Bimanual Manipulation with Twin Single-Arm Vision-Language-Action Models Authors:Hokyun Im, Euijin Jeong, Andrey Kolobov, Jianlong Fu, Youngwoon Lee View a PDF of the paper titled TwinVLA: Data-Efficient Bimanual Manipulation with Twin Single-Arm Vision-Language-Action Models, by Hokyun Im and 4 other authors View PDF HTML (experimental) Abstract:Vision-language-action models (VLAs) trained on large-scale robotic datasets have demonstrated strong performance on manipulation tasks, including bimanual tasks. However, because most public datasets focus on single-arm demonstrations, adapting VLAs for bimanual tasks typically requires substantial additional bimanual data and fine-tuning. To address this challenge, we introduce TwinVLA, a modular framework that composes two copies of a pretrained single-arm VLA into a coordinated bimanual VLA. Unlike monolithic cross-embodiment models trained on mixtures of single-arm and bimanual data, TwinVLA improves both data efficiency and performance by composing pretrained single-arm policies. Across diverse bimanual tasks in real-world and simulation settings, TwinVLA outperforms a comparably-sized monolithic RDT-1B model without requiring any bimanual pretraining. Furthermore, it narrows the gap to state-of-the-art model $\pi_0$, which relies on extensive proprietary bimanual data an...

Read Original Article

[2511.05275] TwinVLA: Data-Efficient Bimanual Manipulation with Twin Single-Arm Vision-Language-Action Models

Summary

Why It Matters

Key Takeaways

Related Articles

World models will be the next big thing, bye-bye LLMs

[D] Got my first offer after months of searching — below posted range, contract-to-hire, and worried it may pause my search. Do I take it?

[Research] AI training is bad, so I started an research

[P] Unix philosophy for ML pipelines: modular, swappable stages with typed contracts

No comments

Stay updated with AI News