[2602.19313] TOPReward: Token Probabilities as Hidden Zero-Shot Rewards for Robotics
Summary
The paper introduces TOPReward, a novel method leveraging token probabilities from Vision-Language Models to enhance reinforcement learning in robotics, achieving significant improvements in task progress estimation.
Why It Matters
TOPReward addresses the challenges of low sample efficiency and sparse rewards in reinforcement learning for robotics. By utilizing pretrained Vision-Language Models, it offers a more effective way to estimate task progress, which is crucial for advancing robotic capabilities in real-world applications.
Key Takeaways
- TOPReward improves task progress estimation in robotics using token probabilities.
- Achieves 0.947 mean Value-Order Correlation, outperforming existing methods.
- Demonstrates versatility for applications like success detection and behavior cloning.
Computer Science > Robotics arXiv:2602.19313 (cs) [Submitted on 22 Feb 2026] Title:TOPReward: Token Probabilities as Hidden Zero-Shot Rewards for Robotics Authors:Shirui Chen, Cole Harrison, Ying-Chun Lee, Angela Jin Yang, Zhongzheng Ren, Lillian J. Ratliff, Jiafei Duan, Dieter Fox, Ranjay Krishna View a PDF of the paper titled TOPReward: Token Probabilities as Hidden Zero-Shot Rewards for Robotics, by Shirui Chen and 8 other authors View PDF HTML (experimental) Abstract:While Vision-Language-Action (VLA) models have seen rapid progress in pretraining, their advancement in Reinforcement Learning (RL) remains hampered by low sample efficiency and sparse rewards in real-world settings. Developing generalizable process reward models is essential for providing the fine-grained feedback necessary to bridge this gap, yet existing temporal value functions often fail to generalize beyond their training domains. We introduce TOPReward, a novel, probabilistically grounded temporal value function that leverages the latent world knowledge of pretrained video Vision-Language Models (VLMs) to estimate robotic task progress. Unlike prior methods that prompt VLMs to directly output progress values, which are prone to numerical misrepresentation, TOPReward extracts task progress directly from the VLM's internal token logits. In zero-shot evaluations across 130+ distinct real-world tasks and multiple robot platforms (e.g., Franka, YAM, SO-100/101), TOPReward achieves 0.947 mean Value-Order ...