[2603.22117] On the Direction of RLVR Updates for LLM Reasoning: Identification and Exploitation
About this article
Abstract page for arXiv paper 2603.22117: On the Direction of RLVR Updates for LLM Reasoning: Identification and Exploitation
Computer Science > Machine Learning arXiv:2603.22117 (cs) [Submitted on 23 Mar 2026] Title:On the Direction of RLVR Updates for LLM Reasoning: Identification and Exploitation Authors:Kexin Huang, Haoming Meng, Junkang Wu, Jinda Lu, Chiyu Ma, Ziqian Chen, Xue Wang, Bolin Ding, Jiancan Wu, Xiang Wang, Xiangnan He, Guoyin Wang, Jingren Zhou View a PDF of the paper titled On the Direction of RLVR Updates for LLM Reasoning: Identification and Exploitation, by Kexin Huang and 12 other authors View PDF HTML (experimental) Abstract:Reinforcement learning with verifiable rewards (RLVR) has substantially improved the reasoning capabilities of large language models. While existing analyses identify that RLVR-induced changes are sparse, they primarily focus on the \textbf{magnitude} of these updates, largely overlooking their \textbf{direction}. In this work, we argue that the direction of updates is a more critical lens for understanding RLVR's effects, which can be captured by the signed, token-level log probability difference $\Delta\log p$ between the base and final RLVR models. Through statistical analysis and token-replacement interventions, we demonstrate that $\Delta\log p$ more effectively identifies sparse, yet reasoning-critical updates than magnitude-based metrics (\eg divergence or entropy). Building on this insight, we propose two practical applications: (1) a \textit{test-time extrapolation} method that amplifies the policy along the learned $\Delta\log p$ direction to ...