[2602.08392] ST-BiBench: Benchmarking Multi-Stream Multimodal Coordination in Bimanual Embodied Tasks for MLLMs

[2602.08392] ST-BiBench: Benchmarking Multi-Stream Multimodal Coordination in Bimanual Embodied Tasks for MLLMs

arXiv - AI 4 min read

About this article

Abstract page for arXiv paper 2602.08392: ST-BiBench: Benchmarking Multi-Stream Multimodal Coordination in Bimanual Embodied Tasks for MLLMs

Computer Science > Robotics arXiv:2602.08392 (cs) [Submitted on 9 Feb 2026 (v1), last revised 4 Apr 2026 (this version, v2)] Title:ST-BiBench: Benchmarking Multi-Stream Multimodal Coordination in Bimanual Embodied Tasks for MLLMs Authors:Xin Wu, Zhixuan Liang, Yue Ma, Mengkang Hu, Zhiyuan Qin, Xiu Li View a PDF of the paper titled ST-BiBench: Benchmarking Multi-Stream Multimodal Coordination in Bimanual Embodied Tasks for MLLMs, by Xin Wu and 5 other authors View PDF HTML (experimental) Abstract:Multimodal Large Language Models (MLLMs) have significantly advanced the landscape of embodied AI, yet transitioning to synchronized bimanual coordination introduces formidable challenges in multi-stream multimodal integration. We introduce ST-BiBench, a comprehensive multi-tier framework for evaluating spatio-temporal multimodal coordination. Our approach centers on Strategic Coordination Planning, assessing high-level cross-modal reasoning over multiple action and perception streams. To investigate the "proximity paradox"-where semantically coherent plans fail to align with spatially grounded visual inputs-we incorporate Foundational Spatial Grounding to verify workspace awareness and arm-selection logic. Furthermore, we probe model frontiers through Fine-Grained Action Control, investigating whether MLLMs can directly synthesize high-dimensional continuous action modalities (16-Dim) from complex multimodal metadata. Evaluating 30+ state-of-the-art MLLMs, we uncover a persistent ...

Originally published on April 07, 2026. Curated by AI News.

Related Articles

When Robots Have Their ChatGPT Moment, Remember These Pincers | WIRED
Llms

When Robots Have Their ChatGPT Moment, Remember These Pincers | WIRED

From sorting chicken nuggets to screwing in light bulbs, Eka’s robots are eerily lifelike. But do they have real physical smarts?

Wired - AI · 13 min ·
Llms

87% Cost Savings & Sub-3s Latency: I built a "Warm-Cache" harness for persistent Claude agents.

**The "Goldfish Problem" is expensive. I decided to fix the plumbing.** Most Claude implementations leave 90% of their money on the table...

Reddit - Artificial Intelligence · 1 min ·
Llms

What are people using for low-latency autocomplete in production? [P]

I’ve been looking into autocomplete/typeahead systems recently, especially in contexts where latency really matters (e.g. search-as-you-t...

Reddit - Machine Learning · 1 min ·
General Motors is adding Gemini to four million cars | The Verge
Llms

General Motors is adding Gemini to four million cars | The Verge

General Motors is planning to bring Google’s Gemini AI assistant to around four million vehicles across the US.

The Verge - AI · 4 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime