[2511.16175] Mantis: A Versatile Vision-Language-Action Model with Disentangled Visual Foresight
Summary
The paper introduces Mantis, a Vision-Language-Action model that enhances visual foresight through a novel framework, achieving superior performance in action comprehension and reasoning.
Why It Matters
Mantis addresses significant challenges in Vision-Language-Action models, such as high-dimensional visual state predictions and poor reasoning capabilities. By disentangling visual foresight, it improves model efficiency and effectiveness, which is crucial for advancements in AI applications like robotics and automation.
Key Takeaways
- Mantis employs Disentangled Visual Foresight to improve action prediction.
- The model achieves a 96.7% success rate on the LIBERO benchmark.
- It outperforms existing models in instruction-following and reasoning.
- Mantis is pretrained on diverse datasets, enhancing its generalization capabilities.
- The code and model weights are available for the open-source community.
Computer Science > Computer Vision and Pattern Recognition arXiv:2511.16175 (cs) [Submitted on 20 Nov 2025 (v1), last revised 23 Feb 2026 (this version, v2)] Title:Mantis: A Versatile Vision-Language-Action Model with Disentangled Visual Foresight Authors:Yi Yang, Xueqi Li, Yiyang Chen, Jin Song, Yihan Wang, Zipeng Xiao, Jiadi Su, You Qiaoben, Pengfei Liu, Zhijie Deng View a PDF of the paper titled Mantis: A Versatile Vision-Language-Action Model with Disentangled Visual Foresight, by Yi Yang and 9 other authors View PDF HTML (experimental) Abstract:Recent advances in Vision-Language-Action (VLA) models demonstrate that visual signals can effectively complement sparse action supervisions. However, letting VLA directly predict high-dimensional visual states can distribute model capacity and incur prohibitive training cost, while compressing visual states into more compact supervisory signals inevitably incurs information bottlenecks. Moreover, existing methods often suffer from poor comprehension and reasoning capabilities due to the neglect of language supervision. This paper introduces Mantis, a novel framework featuring a Disentangled Visual Foresight (DVF) to tackle these issues. Specifically, Mantis decouples visual foresight prediction from the backbone with the combination of meta queries and a diffusion Transformer (DiT) head. With the current visual state provided to the DiT via a residual connection, a simple next-state prediction objective enables the meta querie...