[2602.21531] LiLo-VLA: Compositional Long-Horizon Manipulation via Linked Object-Centric Policies
Summary
The paper introduces LiLo-VLA, a modular framework for long-horizon manipulation in robotics, enhancing performance through object-centric policies and robust failure recovery.
Why It Matters
As robots increasingly operate in unstructured environments, mastering long-horizon manipulation is crucial. LiLo-VLA addresses the challenges of sequencing skills and environmental sensitivity, offering a promising solution for general-purpose robotics, which could lead to more adaptable and efficient robotic systems.
Key Takeaways
- LiLo-VLA enables zero-shot generalization to new long-horizon tasks.
- The framework decouples transport and interaction for enhanced robustness.
- Achieves a 69% success rate in simulations and 85% in real-world tasks.
- Modularity allows for dynamic replanning and effective failure recovery.
- Outperforms existing models like Pi0.5 and OpenVLA-OFT significantly.
Computer Science > Robotics arXiv:2602.21531 (cs) [Submitted on 25 Feb 2026] Title:LiLo-VLA: Compositional Long-Horizon Manipulation via Linked Object-Centric Policies Authors:Yue Yang, Shuo Cheng, Yu Fang, Homanga Bharadhwaj, Mingyu Ding, Gedas Bertasius, Daniel Szafir View a PDF of the paper titled LiLo-VLA: Compositional Long-Horizon Manipulation via Linked Object-Centric Policies, by Yue Yang and 6 other authors View PDF HTML (experimental) Abstract:General-purpose robots must master long-horizon manipulation, defined as tasks involving multiple kinematic structure changes (e.g., attaching or detaching objects) in unstructured environments. While Vision-Language-Action (VLA) models offer the potential to master diverse atomic skills, they struggle with the combinatorial complexity of sequencing them and are prone to cascading failures due to environmental sensitivity. To address these challenges, we propose LiLo-VLA (Linked Local VLA), a modular framework capable of zero-shot generalization to novel long-horizon tasks without ever being trained on them. Our approach decouples transport from interaction: a Reaching Module handles global motion, while an Interaction Module employs an object-centric VLA to process isolated objects of interest, ensuring robustness against irrelevant visual features and invariance to spatial configurations. Crucially, this modularity facilitates robust failure recovery through dynamic replanning and skill reuse, effectively mitigating the c...