[2509.22611] Quantile Advantage Estimation: Stabilizing RLVR for LLM

[2509.22611] Quantile Advantage Estimation: Stabilizing RLVR for LLM Reasoning

arXiv - Machine Learning March 03, 2026 4 min read

About this article

Abstract page for arXiv paper 2509.22611: Quantile Advantage Estimation: Stabilizing RLVR for LLM Reasoning

Computer Science > Machine Learning arXiv:2509.22611 (cs) [Submitted on 26 Sep 2025 (v1), last revised 28 Feb 2026 (this version, v2)] Title:Quantile Advantage Estimation: Stabilizing RLVR for LLM Reasoning Authors:Junkang Wu, Kexin Huang, Jiancan Wu, An Zhang, Xiang Wang, Xiangnan He View a PDF of the paper titled Quantile Advantage Estimation: Stabilizing RLVR for LLM Reasoning, by Junkang Wu and 5 other authors View PDF HTML (experimental) Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) strengthens LLM reasoning, but training often oscillates between {entropy collapse} and {entropy explosion}. We trace both hazards to the mean baseline used in value-free RL (e.g., GRPO and DAPO), which improperly penalizes negative-advantage samples under reward outliers. We propose {Quantile Advantage Estimation} (QAE), replacing the mean with a group-wise K-quantile baseline. QAE induces a response-level, two-regime gate: on hard queries (p <= 1 - K) it reinforces rare successes, while on easy queries (p > 1 - K) it targets remaining failures. Under first-order softmax updates, we prove {two-sided entropy safety}, giving lower and upper bounds on one-step entropy change that curb explosion and prevent collapse. Empirically, this minimal modification stabilizes entropy, sparsifies credit assignment (with tuned K, roughly 80% of responses receive zero advantage), and yields sustained pass@1 gains on Qwen3-8B/14B-Base across AIME 2024/2025 and AMC 2023. These results ident...

Originally published on March 03, 2026. Curated by AI News.

Llms

Claude code x n8n

Hi everyone, I’ve been exploring MCP and integrating tools like n8n with Claude Code, and I’m trying to understand how practical this rea...

Reddit - Artificial Intelligence · 1 min · about 1 hour ago

Llms

LLM comprehension question

Basically, does anyone else also get a really strange sense of lingering confusion and non-comprehension when an LLM explains a complex c...

Reddit - Artificial Intelligence · 1 min · about 1 hour ago

Llms

Curated 550+ free AI tools useful for building projects (LLMs, APIs, local models, RAG, agents)

Over the last few days I was collecting free or low cost AI tools that are actually useful if you want to build stuff, not just try rando...

Reddit - Artificial Intelligence · 1 min · about 5 hours ago

Llms

Claude Mythos and misguided open-weight fearmongering

AI Tools & Products · 9 min · about 9 hours ago

[2509.22611] Quantile Advantage Estimation: Stabilizing RLVR for LLM Reasoning

About this article

Related Articles

Claude code x n8n

LLM comprehension question

Curated 550+ free AI tools useful for building projects (LLMs, APIs, local models, RAG, agents)

Claude Mythos and misguided open-weight fearmongering

No comments

Stay updated with AI News