[2602.18374] Zero-shot Interactive Perception
Summary
The paper presents Zero-Shot Interactive Perception (ZS-IP), a framework that enhances robotic manipulation through a memory-driven Vision Language Model, improving performance in complex environments.
Why It Matters
As robotics increasingly integrates AI for complex tasks, ZS-IP offers a novel approach to enhance robots' interaction capabilities in partially observable scenarios. This advancement could significantly impact fields such as automation, manufacturing, and service robotics, where effective manipulation of objects is crucial.
Key Takeaways
- ZS-IP combines multi-strategy manipulation with a memory-driven Vision Language Model.
- The Enhanced Observation module introduces pushlines for improved visual perception.
- ZS-IP outperforms traditional methods in pushing tasks while maintaining non-target elements.
- The framework is tested on a 7-DOF Franka Panda arm in diverse scenarios.
- This research addresses challenges in occlusion and ambiguity in robotic tasks.
Computer Science > Robotics arXiv:2602.18374 (cs) [Submitted on 20 Feb 2026] Title:Zero-shot Interactive Perception Authors:Venkatesh Sripada, Frank Guerin, Amir Ghalamzan View a PDF of the paper titled Zero-shot Interactive Perception, by Venkatesh Sripada and 2 other authors View PDF HTML (experimental) Abstract:Interactive perception (IP) enables robots to extract hidden information in their workspace and execute manipulation plans by physically interacting with objects and altering the state of the environment -- crucial for resolving occlusions and ambiguity in complex, partially observable scenarios. We present Zero-Shot IP (ZS-IP), a novel framework that couples multi-strategy manipulation (pushing and grasping) with a memory-driven Vision Language Model (VLM) to guide robotic interactions and resolve semantic queries. ZS-IP integrates three key components: (1) an Enhanced Observation (EO) module that augments the VLM's visual perception with both conventional keypoints and our proposed pushlines -- a novel 2D visual augmentation tailored to pushing actions, (2) a memory-guided action module that reinforces semantic reasoning through context lookup, and (3) a robotic controller that executes pushing, pulling, or grasping based on VLM output. Unlike grid-based augmentations optimized for pick-and-place, pushlines capture affordances for contact-rich actions, substantially improving pushing performance. We evaluate ZS-IP on a 7-DOF Franka Panda arm across diverse scen...