[2602.12684] Xiaomi-Robotics-0: An Open-Sourced Vision-Language-Action Model with Real-Time Execution
Summary
Xiaomi-Robotics-0 is an advanced open-sourced vision-language-action model designed for real-time execution, showcasing state-of-the-art performance in robotic tasks.
Why It Matters
This research introduces a significant advancement in robotics by optimizing a vision-language-action model for real-time applications, which can enhance the capabilities of robots in various tasks. The open-sourcing of the model promotes further research and development in the field, potentially leading to more innovative applications in robotics and AI.
Key Takeaways
- Xiaomi-Robotics-0 achieves high performance in real-time robotic tasks.
- The model is pre-trained on extensive data, enhancing its action-generation capabilities.
- Innovative techniques were developed to reduce inference latency during real-robot rollouts.
- The model has been validated in both simulation and real-world scenarios, showing superior success rates.
- Code and model checkpoints are available for public use, fostering community research.
Computer Science > Robotics arXiv:2602.12684 (cs) [Submitted on 13 Feb 2026] Title:Xiaomi-Robotics-0: An Open-Sourced Vision-Language-Action Model with Real-Time Execution Authors:Rui Cai, Jun Guo, Xinze He, Piaopiao Jin, Jie Li, Bingxuan Lin, Futeng Liu, Wei Liu, Fei Ma, Kun Ma, Feng Qiu, Heng Qu, Yifei Su, Qiao Sun, Dong Wang, Donghao Wang, Yunhong Wang, Rujie Wu, Diyun Xiang, Yu Yang, Hangjun Ye, Yuan Zhang, Quanyun Zhou View a PDF of the paper titled Xiaomi-Robotics-0: An Open-Sourced Vision-Language-Action Model with Real-Time Execution, by Rui Cai and 22 other authors View PDF HTML (experimental) Abstract:In this report, we introduce Xiaomi-Robotics-0, an advanced vision-language-action (VLA) model optimized for high performance and fast and smooth real-time execution. The key to our method lies in a carefully designed training recipe and deployment strategy. Xiaomi-Robotics-0 is first pre-trained on large-scale cross-embodiment robot trajectories and vision-language data, endowing it with broad and generalizable action-generation capabilities while avoiding catastrophic forgetting of the visual-semantic knowledge of the underlying pre-trained VLM. During post-training, we propose several techniques for training the VLA model for asynchronous execution to address the inference latency during real-robot rollouts. During deployment, we carefully align the timesteps of consecutive predicted action chunks to ensure continuous and seamless real-time rollouts. We evaluate ...