[2602.12691] ALOE: Action-Level Off-Policy Evaluation for Vision-Language-Action Model Post-Training
Summary
The paper presents ALOE, an action-level off-policy evaluation framework aimed at enhancing vision-language-action models through reinforcement learning, demonstrating improved efficiency in real-world tasks.
Why It Matters
ALOE addresses the limitations of traditional on-policy evaluation methods in reinforcement learning, which can hinder the learning process of complex models. By allowing for off-policy evaluation, it enhances the training effectiveness of vision-language-action systems, which are increasingly relevant in robotics and AI applications.
Key Takeaways
- ALOE improves learning efficiency for vision-language-action models.
- The framework utilizes action-level evaluation to enhance credit assignment.
- It supports stable policy improvement in real-world manipulation tasks.
- ALOE demonstrates effectiveness across diverse tasks, including smartphone packing and laundry folding.
- The approach reintroduces off-policy reinforcement learning reliably.
Computer Science > Robotics arXiv:2602.12691 (cs) [Submitted on 13 Feb 2026] Title:ALOE: Action-Level Off-Policy Evaluation for Vision-Language-Action Model Post-Training Authors:Rushuai Yang, Hecheng Wang, Chiming Liu, Xiaohan Yan, Yunlong Wang, Xuan Du, Shuoyu Yue, Yongcheng Liu, Chuheng Zhang, Lizhe Qi, Yi Chen, Wei Shan, Maoqing Yao View a PDF of the paper titled ALOE: Action-Level Off-Policy Evaluation for Vision-Language-Action Model Post-Training, by Rushuai Yang and 12 other authors View PDF HTML (experimental) Abstract:We study how to improve large foundation vision-language-action (VLA) systems through online reinforcement learning (RL) in real-world settings. Central to this process is the value function, which provides learning signals to guide VLA learning from experience. In practice, the value function is estimated from trajectory fragments collected from different data sources, including historical policies and intermittent human interventions. Estimating the value function of current behavior quality from the mixture data is inherently an off-policy evaluation problem. However, prior work often adopts conservative on-policy estimation for stability, which avoids direct evaluation of the current high-capacity policy and limits learning effectiveness. In this paper, we propose ALOE, an action-level off-policy evaluation framework for VLA post-training. ALOE applies chunking-based temporal-difference bootstrapping to evaluate individual action sequences inste...