[2602.21269] Group Orthogonalized Policy Optimization:Group Policy Optimization as Orthogonal Projection in Hilbert Space
Summary
The paper introduces Group Orthogonalized Policy Optimization (GOPO), a novel algorithm for aligning large language models using Hilbert space geometry, improving optimization efficiency and stability.
Why It Matters
GOPO represents a significant advancement in the field of machine learning by addressing the limitations of traditional optimization methods in high-dimensional spaces. Its focus on maintaining stable gradient dynamics and entropy preservation could lead to better performance in AI applications, particularly in natural language processing and reasoning tasks.
Key Takeaways
- GOPO utilizes Hilbert space geometry for improved policy optimization.
- The algorithm reduces optimization constraints to a linear orthogonality condition.
- GOPO achieves competitive generalization on mathematical reasoning benchmarks.
- It maintains stable gradient dynamics and entropy preservation.
- The method avoids heuristic clipping, enhancing performance in challenging scenarios.
Computer Science > Machine Learning arXiv:2602.21269 (cs) [Submitted on 24 Feb 2026] Title:Group Orthogonalized Policy Optimization:Group Policy Optimization as Orthogonal Projection in Hilbert Space Authors:Wang Zixian View a PDF of the paper titled Group Orthogonalized Policy Optimization:Group Policy Optimization as Orthogonal Projection in Hilbert Space, by Wang Zixian View PDF HTML (experimental) Abstract:We present Group Orthogonalized Policy Optimization (GOPO), a new alignment algorithm for large language models derived from the geometry of Hilbert function spaces. Instead of optimizing on the probability simplex and inheriting the exponential curvature of Kullback-Leibler divergence, GOPO lifts alignment into the Hilbert space L2(pi_k) of square-integrable functions with respect to the reference policy. Within this space, the simplex constraint reduces to a linear orthogonality condition <v, 1> = 0, defining a codimension-one subspace H0. Minimizing distance to an unconstrained target u_star yields the work-dissipation functional J(v) = <g, v> - (mu / 2) ||v||^2, whose maximizer follows directly from the Hilbert projection theorem. Enforcing the boundary v >= -1 produces a bounded Hilbert projection that induces exact sparsity, assigning zero probability to catastrophically poor actions through a closed-form threshold. To connect this functional theory with practice, GOPO projects from infinite-dimensional L2(pi_k) to a finite empirical subspace induced by group s...