[2505.13180] ViPlan: A Benchmark for Visual Planning with Symbolic Predicates and Vision-Language Models
About this article
Abstract page for arXiv paper 2505.13180: ViPlan: A Benchmark for Visual Planning with Symbolic Predicates and Vision-Language Models
Computer Science > Artificial Intelligence arXiv:2505.13180 (cs) [Submitted on 19 May 2025 (v1), last revised 2 Mar 2026 (this version, v2)] Title:ViPlan: A Benchmark for Visual Planning with Symbolic Predicates and Vision-Language Models Authors:Matteo Merler, Nicola Dainese, Minttu Alakuijala, Giovanni Bonetta, Pietro Ferrazzi, Yu Tian, Bernardo Magnini, Pekka Marttinen View a PDF of the paper titled ViPlan: A Benchmark for Visual Planning with Symbolic Predicates and Vision-Language Models, by Matteo Merler and 7 other authors View PDF HTML (experimental) Abstract:Integrating Large Language Models with symbolic planners is a promising direction for obtaining verifiable and grounded plans, with recent work extending this idea to visual domains using Vision-Language Models (VLMs). However, a rigorous comparison with methods that plan directly with VLMs is missing, due to a lack of visual benchmarks that support symbolic planning. We present ViPlan, the first open-source benchmark for comparing VLM-grounded symbolic approaches (VLM-as-grounder) with direct VLM planning methods (VLM-as-planner). ViPlan introduces a series of increasingly challenging tasks in two visual domains: a visual variant of the classic Blocksworld planning problem and a simulated household robotics environment. We find VLM-as-grounder methods to outperform direct VLM planning in Blocksworld (solving 46% of the tasks against 9%), where image grounding is both crucial and accurate. However, in the hous...