[2508.21112] EO-1: An Open Unified Embodied Foundation Model for General Robot Control
Summary
The EO-1 model is introduced as a unified foundation for general robot control, enhancing multimodal reasoning through a large dataset and innovative training methods.
Why It Matters
This research addresses the limitations of current vision-language-action models in robotics, aiming to achieve human-level flexibility in multimodal interactions. The EO-1 model and its dataset, EO-Data1.5M, could significantly advance the field of embodied intelligence, impacting various applications in robotics and AI.
Key Takeaways
- EO-1 integrates multimodal inputs for enhanced robot control.
- The EO-Data1.5M dataset supports interleaved vision-text-action learning.
- Innovative training methods improve generalization in robotic tasks.
- The model aims for human-like flexibility in multimodal reasoning.
- Research findings could influence future developments in embodied AI.
Computer Science > Robotics arXiv:2508.21112 (cs) [Submitted on 28 Aug 2025 (v1), last revised 25 Feb 2026 (this version, v5)] Title:EO-1: An Open Unified Embodied Foundation Model for General Robot Control Authors:Delin Qu, Haoming Song, Qizhi Chen, Zhaoqing Chen, Xianqiang Gao, Dong Wang, Xinyi Ye, Qi Lv, Modi Shi, Guanghui Ren, Cheng Ruan, Maoqing Yao, Haoran Yang, Jiacheng Bao, Bin Zhao, Xuelong Li View a PDF of the paper titled EO-1: An Open Unified Embodied Foundation Model for General Robot Control, by Delin Qu and 15 other authors View PDF HTML (experimental) Abstract:The human ability to seamlessly perform multimodal reasoning and physical interaction in the open world is a core goal for general purpose embodied intelligent systems. Recent vision-language-action (VLA) models, which are co-trained on large-scale robot and visual-text data, have demonstrated notable progress in general robot control. However, they still fail to achieve human-level flexibility in interleaved reasoning and interaction. In this work, we introduce EO-Robotics, consists of EO-1 model and EO-Data1.5M dataset. EO-1 is a unified embodied foundation model that achieves superior performance in multimodal embodied reasoning and robot control through interleaved vision-text-action pre-training. The development of EO-1 is based on two key pillars: (i) a unified architecture that processes multimodal inputs indiscriminately (image, text, video, and action), and (ii) a massive, high-quality multim...