[2601.12415] Orthogonalized Policy Optimization:Decoupling Sampling Geometry from Optimization Geometry in RLHF
Summary
This paper introduces Orthogonalized Policy Optimization (OPO), a new approach in reinforcement learning that separates sampling and optimization geometries to enhance model stability and performance.
Why It Matters
The paper addresses a critical issue in reinforcement learning by decoupling the sampling and optimization processes, which can lead to improved stability and efficiency in training large language models. This has significant implications for the development of more reliable AI systems.
Key Takeaways
- Decoupling sampling and optimization geometries can reduce systematic instability in reinforcement learning.
- Orthogonalized Policy Optimization (OPO) offers a closed-form solution with improved gradient dynamics.
- The proposed method is compatible with existing large language model training pipelines.
Computer Science > Machine Learning arXiv:2601.12415 (cs) [Submitted on 18 Jan 2026 (v1), last revised 14 Feb 2026 (this version, v3)] Title:Orthogonalized Policy Optimization:Decoupling Sampling Geometry from Optimization Geometry in RLHF Authors:Wang Zixian View a PDF of the paper titled Orthogonalized Policy Optimization:Decoupling Sampling Geometry from Optimization Geometry in RLHF, by Wang Zixian View PDF HTML (experimental) Abstract:Large language model alignment objectives are often presented as a collection of distinct algorithms, such as PPO, DPO, IPO, and their variants, each motivated by different derivations. In this work, we argue that this diversity obscures a simpler underlying structure. At a fundamental level, alignment objectives involve two independent design choices: (i) how training signals are sampled and weighted, and (ii) how deviations from a reference policy are geometrically penalized. Existing methods typically entangle these choices through a single divergence, most commonly the Kullback-Leibler divergence. We show that this entanglement is not merely a modeling convenience but a source of systematic instability. When the same divergence simultaneously determines sample weighting and optimization curvature, adjusting one aspect, such as exploration strength, inevitably alters the other, such as gradient geometry. This coupling is particularly problematic in preference-based reinforcement learning, where advantage signals are unbounded and high...