[2602.12691] ALOE: Action-Level Off-Policy Evaluation for Vision-Language-Action Model Post-Training

[2602.12691] ALOE: Action-Level Off-Policy Evaluation for Vision-Language-Action Model Post-Training

arXiv - AI 4 min read Article

Summary

The paper presents ALOE, an action-level off-policy evaluation framework aimed at enhancing vision-language-action models through reinforcement learning, demonstrating improved efficiency in real-world tasks.

Why It Matters

ALOE addresses the limitations of traditional on-policy evaluation methods in reinforcement learning, which can hinder the learning process of complex models. By allowing for off-policy evaluation, it enhances the training effectiveness of vision-language-action systems, which are increasingly relevant in robotics and AI applications.

Key Takeaways

  • ALOE improves learning efficiency for vision-language-action models.
  • The framework utilizes action-level evaluation to enhance credit assignment.
  • It supports stable policy improvement in real-world manipulation tasks.
  • ALOE demonstrates effectiveness across diverse tasks, including smartphone packing and laundry folding.
  • The approach reintroduces off-policy reinforcement learning reliably.

Computer Science > Robotics arXiv:2602.12691 (cs) [Submitted on 13 Feb 2026] Title:ALOE: Action-Level Off-Policy Evaluation for Vision-Language-Action Model Post-Training Authors:Rushuai Yang, Hecheng Wang, Chiming Liu, Xiaohan Yan, Yunlong Wang, Xuan Du, Shuoyu Yue, Yongcheng Liu, Chuheng Zhang, Lizhe Qi, Yi Chen, Wei Shan, Maoqing Yao View a PDF of the paper titled ALOE: Action-Level Off-Policy Evaluation for Vision-Language-Action Model Post-Training, by Rushuai Yang and 12 other authors View PDF HTML (experimental) Abstract:We study how to improve large foundation vision-language-action (VLA) systems through online reinforcement learning (RL) in real-world settings. Central to this process is the value function, which provides learning signals to guide VLA learning from experience. In practice, the value function is estimated from trajectory fragments collected from different data sources, including historical policies and intermittent human interventions. Estimating the value function of current behavior quality from the mixture data is inherently an off-policy evaluation problem. However, prior work often adopts conservative on-policy estimation for stability, which avoids direct evaluation of the current high-capacity policy and limits learning effectiveness. In this paper, we propose ALOE, an action-level off-policy evaluation framework for VLA post-training. ALOE applies chunking-based temporal-difference bootstrapping to evaluate individual action sequences inste...

Related Articles

Improving AI models’ ability to explain their predictions
Machine Learning

Improving AI models’ ability to explain their predictions

AI News - General · 9 min ·
Machine Learning

[D] TMLR reviews seem more reliable than ICML/NeurIPS/ICLR

This year I submitted a paper to ICML for the first time. I have also experienced the review process at TMLR and ICLR. From my observatio...

Reddit - Machine Learning · 1 min ·
Machine Learning

[D] icml, no rebuttal ack so far..

Almost all the papers I reviewed have received at least one ack, but I haven’t gotten a single rebuttal acknowledgment yet. Is there anyo...

Reddit - Machine Learning · 1 min ·
UMKC Announces New Master of Science in Artificial Intelligence
Ai Infrastructure

UMKC Announces New Master of Science in Artificial Intelligence

UMKC announces a new Master of Science in Artificial Intelligence program aimed at addressing workforce demand for AI expertise, set to l...

AI News - General · 4 min ·
More in Machine Learning: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime