[2602.03584] $V_0$: A Generalist Value Model for Any Policy at State Zero
About this article
Abstract page for arXiv paper 2602.03584: $V_0$: A Generalist Value Model for Any Policy at State Zero
Computer Science > Computation and Language arXiv:2602.03584 (cs) [Submitted on 3 Feb 2026 (v1), last revised 31 Mar 2026 (this version, v2)] Title:$V_0$: A Generalist Value Model for Any Policy at State Zero Authors:Yi-Kai Zhang, Zhiyuan Yao, Hongyan Hao, Yueqing Sun, Qi Gu, Hui Su, Xunliang Cai, De-Chuan Zhan, Han-Jia Ye View a PDF of the paper titled $V_0$: A Generalist Value Model for Any Policy at State Zero, by Yi-Kai Zhang and Zhiyuan Yao and Hongyan Hao and Yueqing Sun and Qi Gu and Hui Su and Xunliang Cai and De-Chuan Zhan and Han-Jia Ye View PDF HTML (experimental) Abstract:Policy gradient methods rely on a baseline to measure the relative advantage of an action, ensuring the model reinforces behaviors that outperform its current average capability. In the training of Large Language Models (LLMs) using Actor-Critic methods (e.g., PPO), this baseline is typically estimated by a Value Model (Critic) often as large as the policy model itself. However, as the policy continuously evolves, the value model requires expensive, synchronous incremental training to accurately track the shifting capabilities of the policy. To avoid this overhead, Group Relative Policy Optimization (GRPO) eliminates the coupled value model by using the average reward of a group of rollouts as the baseline; yet, this approach necessitates extensive sampling to maintain estimation stability. In this paper, we propose $V_0$, a Generalist Value Model capable of estimating the expected performance...