[2602.20659] Recursive Belief Vision Language Model
Summary
The Recursive Belief Vision Language Model (RB-VLA) addresses limitations in current vision-language-action models by introducing a belief-centric architecture that enhances long-horizon manipulation capabilities under partial observability.
Why It Matters
This research is significant as it tackles the challenges faced by existing vision-language-action models, particularly in complex tasks requiring sustained reasoning and memory efficiency. By improving task execution and reducing inference latency, RB-VLA has implications for robotics and AI applications where long-term planning and adaptability are critical.
Key Takeaways
- RB-VLA improves long-horizon manipulation in AI by maintaining a compact latent state.
- The model reduces inference latency by up to 5x compared to existing approaches.
- It eliminates memory growth across timesteps, enhancing efficiency.
- The belief module significantly boosts success rates in multi-stage tasks.
- RB-VLA outperforms prior models on benchmarks for pick-and-place and stacking tasks.
Computer Science > Artificial Intelligence arXiv:2602.20659 (cs) [Submitted on 24 Feb 2026] Title:Recursive Belief Vision Language Model Authors:Vaidehi Bagaria, Bijo Sebastian, Nirav Patel View a PDF of the paper titled Recursive Belief Vision Language Model, by Vaidehi Bagaria and 2 other authors View PDF HTML (experimental) Abstract:Current vision-language-action (VLA) models struggle with long-horizon manipulation under partial observability. Most existing approaches remain observation-driven, relying on short context windows or repeated queries to vision-language models (VLMs). This leads to loss of task progress, action repetition under perceptual aliasing, and high inference latency. Semantic reasoning alone is not the primary bottleneck in long-horizon manipulation. Instead, VLAs lack persistent, action-conditioned state representations and exhibit limited temporal and physical reasoning, making them ill-suited for multi-stage control. This paper introduces RB-VLA, a belief-centric architecture trained with self-supervised world-model objectives that maintains a compact latent state encoding task-relevant history, dynamics, and object interactions. Queried once for high-level intent, the VLM provides task specification, while the belief tracks task progress and enables phase-aware, causally grounded control under partial observability without storing raw observations or scaling memory with time. The belief and intent jointly condition a diffusion policy for robust ...