[2511.05275] TwinVLA: Data-Efficient Bimanual Manipulation with Twin Single-Arm Vision-Language-Action Models

[2511.05275] TwinVLA: Data-Efficient Bimanual Manipulation with Twin Single-Arm Vision-Language-Action Models

arXiv - Machine Learning 3 min read Article

Summary

The paper presents TwinVLA, a modular framework for bimanual manipulation using two single-arm Vision-Language-Action models, enhancing data efficiency and performance without requiring extensive bimanual data.

Why It Matters

TwinVLA addresses the challenge of adapting existing single-arm models for bimanual tasks, which is crucial for advancing robotic manipulation capabilities. By improving data efficiency, it enables broader applications in robotics without the need for costly proprietary datasets.

Key Takeaways

  • TwinVLA combines two pretrained single-arm models for bimanual tasks.
  • It improves data efficiency compared to traditional monolithic models.
  • The framework outperforms existing models without requiring bimanual pretraining.
  • TwinVLA narrows the performance gap with state-of-the-art models using public data.
  • This approach offers a scalable solution for high-performance robotic manipulation.

Computer Science > Robotics arXiv:2511.05275 (cs) [Submitted on 7 Nov 2025 (v1), last revised 23 Feb 2026 (this version, v2)] Title:TwinVLA: Data-Efficient Bimanual Manipulation with Twin Single-Arm Vision-Language-Action Models Authors:Hokyun Im, Euijin Jeong, Andrey Kolobov, Jianlong Fu, Youngwoon Lee View a PDF of the paper titled TwinVLA: Data-Efficient Bimanual Manipulation with Twin Single-Arm Vision-Language-Action Models, by Hokyun Im and 4 other authors View PDF HTML (experimental) Abstract:Vision-language-action models (VLAs) trained on large-scale robotic datasets have demonstrated strong performance on manipulation tasks, including bimanual tasks. However, because most public datasets focus on single-arm demonstrations, adapting VLAs for bimanual tasks typically requires substantial additional bimanual data and fine-tuning. To address this challenge, we introduce TwinVLA, a modular framework that composes two copies of a pretrained single-arm VLA into a coordinated bimanual VLA. Unlike monolithic cross-embodiment models trained on mixtures of single-arm and bimanual data, TwinVLA improves both data efficiency and performance by composing pretrained single-arm policies. Across diverse bimanual tasks in real-world and simulation settings, TwinVLA outperforms a comparably-sized monolithic RDT-1B model without requiring any bimanual pretraining. Furthermore, it narrows the gap to state-of-the-art model $\pi_0$, which relies on extensive proprietary bimanual data an...

Related Articles

Llms

World models will be the next big thing, bye-bye LLMs

Was at Nvidia's GTC conference recently and honestly, it was one of the most eye-opening events I've attended in a while. There was a lot...

Reddit - Artificial Intelligence · 1 min ·
Machine Learning

[D] Got my first offer after months of searching — below posted range, contract-to-hire, and worried it may pause my search. Do I take it?

I could really use some outside perspective. I’m a senior ML/CV engineer in Canada with about 5–6 years across research and industry. Mas...

Reddit - Machine Learning · 1 min ·
Machine Learning

[Research] AI training is bad, so I started an research

Hello, I started researching about AI training Q:Why? R: Because AI training is bad right now. Q: What do you mean its bad? R: Like when ...

Reddit - Machine Learning · 1 min ·
Machine Learning

[P] Unix philosophy for ML pipelines: modular, swappable stages with typed contracts

We built an open-source prototype that applies Unix philosophy to retrieval pipelines. Each stage (PII redaction, chunking, dedup, embedd...

Reddit - Machine Learning · 1 min ·
More in Machine Learning: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime