[2601.08427] Silence the Judge: Reinforcement Learning with

[2601.08427] Silence the Judge: Reinforcement Learning with Self-Verifier via Latent Geometric Clustering

arXiv - Machine Learning March 03, 2026 4 min read

About this article

Abstract page for arXiv paper 2601.08427: Silence the Judge: Reinforcement Learning with Self-Verifier via Latent Geometric Clustering

Computer Science > Computation and Language arXiv:2601.08427 (cs) [Submitted on 13 Jan 2026 (v1), last revised 1 Mar 2026 (this version, v2)] Title:Silence the Judge: Reinforcement Learning with Self-Verifier via Latent Geometric Clustering Authors:Nonghai Zhang, Weitao Ma, Zhanyu Ma, Jun Xu, Jiuchong Gao, Jinghua Hao, Renqing He, Jingwen Xu View a PDF of the paper titled Silence the Judge: Reinforcement Learning with Self-Verifier via Latent Geometric Clustering, by Nonghai Zhang and 7 other authors View PDF HTML (experimental) Abstract:Group Relative Policy Optimization (GRPO) significantly enhances the reasoning performance of Large Language Models (LLMs). However, this success heavily relies on expensive external verifiers or human rules. Such dependency not only leads to significant computational costs and training latency, but also yields sparse rewards that hinder optimization efficiency. To address these challenges, we propose Latent-GRPO, a framework that derives intrinsic rewards directly from latent space geometry. Crucially, our empirical analysis reveals a compelling geometric property: terminal token representations of correct reasoning trajectories form dense clusters with high intra-class similarity, whereas incorrect trajectories remain scattered as outliers. In light of this discovery, we introduce the Iterative Robust Centroid Estimation (IRCE) algorithm, which generates dense, continuous rewards by mitigating magnitude fluctuations via spherical project...

Originally published on March 03, 2026. Curated by AI News.

Llms

Claude Mythos and misguided open-weight fearmongering

AI Tools & Products · 9 min · about 2 hours ago

Llms

Anthropic Agrees to Rent CoreWeave AI Capacity to Power Claude

AI Tools & Products · 1 min · about 2 hours ago

Llms

CoreWeave strikes a deal to power Anthropic's Claude AI models — and the stock surges 12%

AI Tools & Products · 3 min · about 2 hours ago

Llms

Walmart’s AI Push Links Gemini App Experience With U.S. Manufacturing Shift

AI Tools & Products · 6 min · about 2 hours ago

[2601.08427] Silence the Judge: Reinforcement Learning with Self-Verifier via Latent Geometric Clustering

About this article

Related Articles

Claude Mythos and misguided open-weight fearmongering

Anthropic Agrees to Rent CoreWeave AI Capacity to Power Claude

CoreWeave strikes a deal to power Anthropic's Claude AI models — and the stock surges 12%

Walmart’s AI Push Links Gemini App Experience With U.S. Manufacturing Shift

No comments

Stay updated with AI News