[2602.15882] FUTURE-VLA: Forecasting Unified Trajectories Under Real-time Execution
Summary
FUTURE-VLA introduces a unified architecture for real-time trajectory forecasting in robotics, enhancing spatiotemporal reasoning and predictive capabilities.
Why It Matters
This research addresses the critical challenge of latency in processing long video streams for robotic applications. By improving real-time forecasting and control, FUTURE-VLA could significantly enhance the efficiency and effectiveness of robotic systems in dynamic environments, paving the way for more advanced human-robot interactions.
Key Takeaways
- FUTURE-VLA reformulates long-horizon control as a sequence-generation task.
- Utilizes a dual-sided efficiency paradigm for real-time performance.
- Achieves state-of-the-art success rates on multiple benchmarks.
- Enables interactive execution gating for dynamic behavior validation.
- Maintains low inference latency while processing extensive histories.
Computer Science > Robotics arXiv:2602.15882 (cs) [Submitted on 5 Feb 2026] Title:FUTURE-VLA: Forecasting Unified Trajectories Under Real-time Execution Authors:Jingjing Fan, Yushan Liu, Shoujie Li, Botao Ren, Siyuan Li, Xiao-Ping Zhang, Wenbo Ding, Zhidong Deng View a PDF of the paper titled FUTURE-VLA: Forecasting Unified Trajectories Under Real-time Execution, by Jingjing Fan and 7 other authors View PDF HTML (experimental) Abstract:General vision-language models increasingly support unified spatiotemporal reasoning over long video streams, yet deploying such capabilities on robots remains constrained by the prohibitive latency of processing long-horizon histories and generating high-dimensional future predictions. To bridge this gap, we present FUTURE-VLA, a unified architecture that reformulates long-horizon control and future forecasting as a monolithic sequence-generation task. Adopting a dual-sided efficiency paradigm, FUTURE-VLA leverages a temporally adaptive compression strategy to maximize spatiotemporal information density, enabling the ingestion of extensive multi-view histories while maintaining constant inference latency. Simultaneously, it performs latent-space autoregression to align actionable dynamics with reviewable visual look-aheads in a single forward pass. These real-time predictive capabilities further enable a prediction-guided Human-In-the-Loop mechanism via interactive execution gating, allowing operators to dynamically validate behaviors based...