[2602.20119] NovaPlan: Zero-Shot Long-Horizon Manipulation via Closed-Loop Video Language Planning
Summary
NovaPlan introduces a framework for zero-shot long-horizon manipulation in robotics, integrating video language planning with geometrically grounded execution to enhance task performance without prior training.
Why It Matters
This research addresses a significant challenge in robotics—performing complex tasks without prior demonstrations. By combining high-level semantic reasoning with low-level execution, NovaPlan enhances the capabilities of robots in real-world scenarios, potentially transforming automation in various industries.
Key Takeaways
- NovaPlan enables robots to perform long-horizon tasks with zero-shot learning.
- The framework integrates video language models with closed-loop execution for improved task management.
- Robots can autonomously recover from errors during task execution, enhancing reliability.
- Utilizes both object keypoints and human hand poses to inform robot actions.
- Demonstrated effectiveness on complex assembly tasks and the Functional Manipulation Benchmark.
Computer Science > Robotics arXiv:2602.20119 (cs) [Submitted on 23 Feb 2026] Title:NovaPlan: Zero-Shot Long-Horizon Manipulation via Closed-Loop Video Language Planning Authors:Jiahui Fu, Junyu Nan, Lingfeng Sun, Hongyu Li, Jianing Qian, Jennifer L. Barry, Kris Kitani, George Konidaris View a PDF of the paper titled NovaPlan: Zero-Shot Long-Horizon Manipulation via Closed-Loop Video Language Planning, by Jiahui Fu and 7 other authors View PDF HTML (experimental) Abstract:Solving long-horizon tasks requires robots to integrate high-level semantic reasoning with low-level physical interaction. While vision-language models (VLMs) and video generation models can decompose tasks and imagine outcomes, they often lack the physical grounding necessary for real-world execution. We introduce NovaPlan, a hierarchical framework that unifies closed-loop VLM and video planning with geometrically grounded robot execution for zero-shot long-horizon manipulation. At the high level, a VLM planner decomposes tasks into sub-goals and monitors robot execution in a closed loop, enabling the system to recover from single-step failures through autonomous re-planning. To compute low-level robot actions, we extract and utilize both task-relevant object keypoints and human hand poses as kinematic priors from the generated videos, and employ a switching mechanism to choose the better one as a reference for robot actions, maintaining stable execution even under heavy occlusion or depth inaccuracy. We ...