[2602.15872] MARVL: Multi-Stage Guidance for Robotic Manipulation via Vision-Language Models
Summary
The paper presents MARVL, a novel approach for robotic manipulation that utilizes Vision-Language Models (VLMs) to enhance task performance through multi-stage guidance and improved reward design.
Why It Matters
As robotics increasingly integrates AI, effective reward design is crucial for enhancing the efficiency of reinforcement learning. MARVL addresses limitations in existing VLMs, providing a scalable solution that improves task execution in robotic systems, which is essential for advancements in automation and AI-driven robotics.
Key Takeaways
- MARVL enhances reward design for robotic manipulation using VLMs.
- It decomposes tasks into multi-stage subtasks for better trajectory sensitivity.
- Empirical results show MARVL outperforms existing methods on the Meta-World benchmark.
- The approach improves sample efficiency and robustness in sparse-reward tasks.
- MARVL addresses issues of spatial grounding and task semantics in VLMs.
Computer Science > Robotics arXiv:2602.15872 (cs) [Submitted on 28 Jan 2026] Title:MARVL: Multi-Stage Guidance for Robotic Manipulation via Vision-Language Models Authors:Xunlan Zhou, Xuanlin Chen, Shaowei Zhang, Xiangkun Li, ShengHua Wan, Xiaohai Hu, Yuan Lei, Le Gan, De-chuan Zhan View a PDF of the paper titled MARVL: Multi-Stage Guidance for Robotic Manipulation via Vision-Language Models, by Xunlan Zhou and 8 other authors View PDF HTML (experimental) Abstract:Designing dense reward functions is pivotal for efficient robotic Reinforcement Learning (RL). However, most dense rewards rely on manual engineering, which fundamentally limits the scalability and automation of reinforcement learning. While Vision-Language Models (VLMs) offer a promising path to reward design, naive VLM rewards often misalign with task progress, struggle with spatial grounding, and show limited understanding of task semantics. To address these issues, we propose MARVL-Multi-stAge guidance for Robotic manipulation via Vision-Language models. MARVL fine-tunes a VLM for spatial and semantic consistency and decomposes tasks into multi-stage subtasks with task direction projection for trajectory sensitivity. Empirically, MARVL significantly outperforms existing VLM-reward methods on the Meta-World benchmark, demonstrating superior sample efficiency and robustness on sparse-reward manipulation tasks. Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) ...