[2602.20200] Global Prior Meets Local Consistency: Dual-Memory Augmented Vision-Language-Action Model for Efficient Robotic Manipulation

[2602.20200] Global Prior Meets Local Consistency: Dual-Memory Augmented Vision-Language-Action Model for Efficient Robotic Manipulation

arXiv - AI 4 min read Article

Summary

The paper presents OptimusVLA, a dual-memory framework for robotic manipulation that enhances efficiency and robustness in action generation through improved memory mechanisms.

Why It Matters

As robotics increasingly relies on advanced AI models for manipulation tasks, OptimusVLA addresses critical limitations in existing Vision-Language-Action frameworks, improving both performance and inference speed. This innovation is significant for applications in automation and AI-driven robotics.

Key Takeaways

  • OptimusVLA introduces Global Prior Memory and Local Consistency Memory to enhance action generation in robotic manipulation.
  • The framework significantly reduces inference time and improves task success rates across various benchmarks.
  • By leveraging historical action sequences, OptimusVLA ensures better temporal consistency and task awareness.

Computer Science > Robotics arXiv:2602.20200 (cs) [Submitted on 22 Feb 2026] Title:Global Prior Meets Local Consistency: Dual-Memory Augmented Vision-Language-Action Model for Efficient Robotic Manipulation Authors:Zaijing Li, Bing Hu, Rui Shao, Gongwei Chen, Dongmei Jiang, Pengwei Xie, Jianye Hao, Liqiang Nie View a PDF of the paper titled Global Prior Meets Local Consistency: Dual-Memory Augmented Vision-Language-Action Model for Efficient Robotic Manipulation, by Zaijing Li and 7 other authors View PDF HTML (experimental) Abstract:Hierarchical Vision-Language-Action (VLA) models have rapidly become a dominant paradigm for robotic manipulation. It typically comprising a Vision-Language backbone for perception and understanding, together with a generative policy for action generation. However, its performance is increasingly bottlenecked by the action generation proceess. (i) Low inference efficiency. A pronounced distributional gap between isotropic noise priors and target action distributions, which increases denoising steps and the incidence of infeasible samples. (ii) Poor robustness. Existing policies condition solely on the current observation, neglecting the constraint of history sequence and thus lacking awareness of task progress and temporal consistency. To address these issues, we introduce OptimusVLA, a dual-memory VLA framework with Global Prior Memory (GPM) and Local Consistency Memory (LCM). GPM replaces Gaussian noise with task-level priors retrieved from ...

Related Articles

[2603.18940] Entropy trajectory shape predicts LLM reasoning reliability: A diagnostic study of uncertainty dynamics in chain-of-thought
Llms

[2603.18940] Entropy trajectory shape predicts LLM reasoning reliability: A diagnostic study of uncertainty dynamics in chain-of-thought

Abstract page for arXiv paper 2603.18940: Entropy trajectory shape predicts LLM reasoning reliability: A diagnostic study of uncertainty ...

arXiv - Machine Learning · 3 min ·
[2512.20620] Uncovering Patterns of Brain Activity from EEG Data Consistently Associated with Cybersickness Using Neural Network Interpretability Maps
Machine Learning

[2512.20620] Uncovering Patterns of Brain Activity from EEG Data Consistently Associated with Cybersickness Using Neural Network Interpretability Maps

Abstract page for arXiv paper 2512.20620: Uncovering Patterns of Brain Activity from EEG Data Consistently Associated with Cybersickness ...

arXiv - Machine Learning · 4 min ·
[2512.13607] Nemotron-Cascade: Scaling Cascaded Reinforcement Learning for General-Purpose Reasoning Models
Machine Learning

[2512.13607] Nemotron-Cascade: Scaling Cascaded Reinforcement Learning for General-Purpose Reasoning Models

Abstract page for arXiv paper 2512.13607: Nemotron-Cascade: Scaling Cascaded Reinforcement Learning for General-Purpose Reasoning Models

arXiv - Machine Learning · 4 min ·
[2512.02650] Hear What Matters! Text-conditioned Selective Video-to-Audio Generation
Machine Learning

[2512.02650] Hear What Matters! Text-conditioned Selective Video-to-Audio Generation

Abstract page for arXiv paper 2512.02650: Hear What Matters! Text-conditioned Selective Video-to-Audio Generation

arXiv - Machine Learning · 3 min ·
More in Machine Learning: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime