[2603.00963] Stabilizing Policy Optimization via Logits Convexity
About this article
Abstract page for arXiv paper 2603.00963: Stabilizing Policy Optimization via Logits Convexity
Computer Science > Machine Learning arXiv:2603.00963 (cs) [Submitted on 1 Mar 2026] Title:Stabilizing Policy Optimization via Logits Convexity Authors:Hongzhan Chen, Tao Yang, Yuhua Zhu, Shiping Gao, Xiaojun Quan, Ting Yao View a PDF of the paper titled Stabilizing Policy Optimization via Logits Convexity, by Hongzhan Chen and 5 other authors View PDF HTML (experimental) Abstract:While reinforcement learning (RL) has been central to the recent success of large language models (LLMs), RL optimization is notoriously unstable, especially when compared to supervised fine-tuning (SFT). In this work, we investigate the stability gap between SFT and RL from a gradient-based perspective, and show that the convexity of the SFT loss with respect to model logits plays a key role in enabling stable training. Our theoretical analysis demonstrates that this property induces favorable gradient directionality during optimization. In contrast, Proximal Policy Optimization (PPO), a widely adopted policy gradient algorithm utilizing a clipped surrogate objective, lacks this stabilizing property. Motivated by this observation, we propose Logits Convex Optimization (LCO), a simple yet effective policy optimization framework that aligns the learned policy with an optimal target derived from the original RL objective, thereby emulating the stabilizing effects of logits-level convexity. Extensive experiments across multiple model families show that our LCO framework consistently improves training...