Llms Machine Learning Ai Agents Data Science

[2602.17025] WS-GRPO: Weakly-Supervised Group-Relative Policy Optimization for Rollout-Efficient Reasoning

arXiv - Machine Learning February 20, 2026 4 min read Article

Summary

The paper introduces WS-GRPO, a method for improving rollout efficiency in language model training by providing correctness-aware guidance, reducing unnecessary deliberation while maintaining accuracy.

Why It Matters

As language models become more complex, optimizing their reasoning efficiency is crucial. WS-GRPO addresses the challenges of overthinking in model training, offering a solution that balances accuracy and efficiency, which is vital for practical applications in AI.

Key Takeaways

WS-GRPO enhances rollout efficiency by converting terminal rewards into guidance for partial trajectories.
The method reduces redundant deliberation while maintaining the accuracy of language models.
WS-GRPO provides a solution to the calibration issues associated with global length penalties in reasoning tasks.

Computer Science > Machine Learning arXiv:2602.17025 (cs) [Submitted on 19 Feb 2026] Title:WS-GRPO: Weakly-Supervised Group-Relative Policy Optimization for Rollout-Efficient Reasoning Authors:Gagan Mundada, Zihan Huang, Rohan Surana, Sheldon Yu, Jennifer Yuntong Zhang, Xintong Li, Tong Yu, Lina Yao, Jingbo Shang, Julian McAuley, Junda Wu View a PDF of the paper titled WS-GRPO: Weakly-Supervised Group-Relative Policy Optimization for Rollout-Efficient Reasoning, by Gagan Mundada and 10 other authors View PDF HTML (experimental) Abstract:Group Relative Policy Optimization (GRPO) is effective for training language models on complex reasoning. However, since the objective is defined relative to a group of sampled trajectories, extended deliberation can create more chances to realize relative gains, leading to inefficient reasoning and overthinking, and complicating the trade-off between correctness and rollout efficiency. Controlling this behavior is difficult in practice, considering (i) Length penalties are hard to calibrate because longer rollouts may reflect harder problems that require longer reasoning, penalizing tokens risks truncating useful reasoning along with redundant continuation; and (ii) supervision that directly indicates when to continue or stop is typically unavailable beyond final answer correctness. We propose Weakly Supervised GRPO (WS-GRPO), which improves rollout efficiency by converting terminal rewards into correctness-aware guidance over partial traj...

Read Original Article