Llms Machine Learning Robotics Computer Vision

[2602.15872] MARVL: Multi-Stage Guidance for Robotic Manipulation via Vision-Language Models

arXiv - Machine Learning February 19, 2026 3 min read Article

Summary

The paper presents MARVL, a novel approach for robotic manipulation that utilizes Vision-Language Models (VLMs) to enhance task performance through multi-stage guidance and improved reward design.

Why It Matters

As robotics increasingly integrates AI, effective reward design is crucial for enhancing the efficiency of reinforcement learning. MARVL addresses limitations in existing VLMs, providing a scalable solution that improves task execution in robotic systems, which is essential for advancements in automation and AI-driven robotics.

Key Takeaways

MARVL enhances reward design for robotic manipulation using VLMs.
It decomposes tasks into multi-stage subtasks for better trajectory sensitivity.
Empirical results show MARVL outperforms existing methods on the Meta-World benchmark.
The approach improves sample efficiency and robustness in sparse-reward tasks.
MARVL addresses issues of spatial grounding and task semantics in VLMs.

Computer Science > Robotics arXiv:2602.15872 (cs) [Submitted on 28 Jan 2026] Title:MARVL: Multi-Stage Guidance for Robotic Manipulation via Vision-Language Models Authors:Xunlan Zhou, Xuanlin Chen, Shaowei Zhang, Xiangkun Li, ShengHua Wan, Xiaohai Hu, Yuan Lei, Le Gan, De-chuan Zhan View a PDF of the paper titled MARVL: Multi-Stage Guidance for Robotic Manipulation via Vision-Language Models, by Xunlan Zhou and 8 other authors View PDF HTML (experimental) Abstract:Designing dense reward functions is pivotal for efficient robotic Reinforcement Learning (RL). However, most dense rewards rely on manual engineering, which fundamentally limits the scalability and automation of reinforcement learning. While Vision-Language Models (VLMs) offer a promising path to reward design, naive VLM rewards often misalign with task progress, struggle with spatial grounding, and show limited understanding of task semantics. To address these issues, we propose MARVL-Multi-stAge guidance for Robotic manipulation via Vision-Language models. MARVL fine-tunes a VLM for spatial and semantic consistency and decomposes tasks into multi-stage subtasks with task direction projection for trajectory sensitivity. Empirically, MARVL significantly outperforms existing VLM-reward methods on the Meta-World benchmark, demonstrating superior sample efficiency and robustness on sparse-reward manipulation tasks. Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) ...

Read Original Article

[2602.15872] MARVL: Multi-Stage Guidance for Robotic Manipulation via Vision-Language Models

Summary

Why It Matters

Key Takeaways

Related Articles

[D] Howcome Muon is only being used for Transformers?

[P] I trained a language model from scratch for a low resource language and got it running fully on-device on Android (no GPU, demo)

Paper Finds That Leading AI Chatbots Like ChatGPT and Claude Remain Incredibly Sycophantic, Resulting in Twisted Effects on Users

Popular AI gateway startup LiteLLM ditches controversial startup Delve | TechCrunch

No comments

Stay updated with AI News