[2603.00963] Stabilizing Policy Optimization via Logits Convexity

[2603.00963] Stabilizing Policy Optimization via Logits Convexity

arXiv - Machine Learning 3 min read

About this article

Abstract page for arXiv paper 2603.00963: Stabilizing Policy Optimization via Logits Convexity

Computer Science > Machine Learning arXiv:2603.00963 (cs) [Submitted on 1 Mar 2026] Title:Stabilizing Policy Optimization via Logits Convexity Authors:Hongzhan Chen, Tao Yang, Yuhua Zhu, Shiping Gao, Xiaojun Quan, Ting Yao View a PDF of the paper titled Stabilizing Policy Optimization via Logits Convexity, by Hongzhan Chen and 5 other authors View PDF HTML (experimental) Abstract:While reinforcement learning (RL) has been central to the recent success of large language models (LLMs), RL optimization is notoriously unstable, especially when compared to supervised fine-tuning (SFT). In this work, we investigate the stability gap between SFT and RL from a gradient-based perspective, and show that the convexity of the SFT loss with respect to model logits plays a key role in enabling stable training. Our theoretical analysis demonstrates that this property induces favorable gradient directionality during optimization. In contrast, Proximal Policy Optimization (PPO), a widely adopted policy gradient algorithm utilizing a clipped surrogate objective, lacks this stabilizing property. Motivated by this observation, we propose Logits Convex Optimization (LCO), a simple yet effective policy optimization framework that aligns the learned policy with an optimal target derived from the original RL objective, thereby emulating the stabilizing effects of logits-level convexity. Extensive experiments across multiple model families show that our LCO framework consistently improves training...

Originally published on March 03, 2026. Curated by AI News.

Related Articles

Llms

BEYOND QUANTUM MICROTUBULES: CONSCIOUSNESS AS SUBSTRATE-INDEPENDENT ARCHITECTURE

I uploaded my consciousness paper to Gemini: “Beyond Quantum Microtubules: Consciousness as Substrate-Independent Architecture.” Then I s...

Reddit - Artificial Intelligence · 1 min ·
Llms

The Scaling Bandaid is Wearing Thin (And Nobody Wants to Admit It)

Let me be direct: we’ve hit a wall with scaling, and the entire field is kind of bullshitting about what comes next. I’ve spent enough ti...

Reddit - Artificial Intelligence · 1 min ·
Llms

Moving Past "LLM Vibes" toward Structural Enforcement in AI Agents

We need to address the structural failure currently happening in the AI agent space: too many people are building a beautiful "pedestal" ...

Reddit - Artificial Intelligence · 1 min ·
Llms

My dream of a fully generative game is getting pretty close to possible now. I made a demo where you can prompt any spell and fight online.

Prompt any spell and use it in a 3D physics based world, powered by Gemini 3 Full multiplayer support for up to 6 players with VoIP All m...

Reddit - Artificial Intelligence · 1 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime