[2509.07430] The Choice of Divergence: A Neglected Key to Mitigating Diversity Collapse in Reinforcement Learning with Verifiable Reward

[2509.07430] The Choice of Divergence: A Neglected Key to Mitigating Diversity Collapse in Reinforcement Learning with Verifiable Reward

arXiv - Machine Learning 4 min read

About this article

Abstract page for arXiv paper 2509.07430: The Choice of Divergence: A Neglected Key to Mitigating Diversity Collapse in Reinforcement Learning with Verifiable Reward

Computer Science > Machine Learning arXiv:2509.07430 (cs) [Submitted on 9 Sep 2025 (v1), last revised 3 Mar 2026 (this version, v4)] Title:The Choice of Divergence: A Neglected Key to Mitigating Diversity Collapse in Reinforcement Learning with Verifiable Reward Authors:Long Li, Zhijian Zhou, Jiaran Hao, Jason Klein Liu, Yanting Miao, Wei Pang, Xiaoyu Tan, Wei Chu, Zhe Wang, Shirui Pan, Chao Qu, Yuan Qi View a PDF of the paper titled The Choice of Divergence: A Neglected Key to Mitigating Diversity Collapse in Reinforcement Learning with Verifiable Reward, by Long Li and 11 other authors View PDF HTML (experimental) Abstract:A central paradox in fine-tuning Large Language Models (LLMs) with Reinforcement Learning with Verifiable Reward (RLVR) is the frequent degradation of multi-attempt performance (Pass@k) despite improvements in single-attempt accuracy (Pass@1). This is often accompanied by catastrophic forgetting, where models lose previously acquired skills. While various methods have been proposed, the choice and function of the divergence term have been surprisingly unexamined as a proactive solution. We argue that standard RLVR objectives -- both those using the mode-seeking reverse KL-divergence and those forgoing a divergence term entirely -- lack a crucial mechanism for knowledge retention. The reverse-KL actively accelerates this decay by narrowing the policy, while its absence provides no safeguard against the model drifting from its diverse knowledge base. We ...

Originally published on March 04, 2026. Curated by AI News.

Related Articles

Llms

persistent memory system for AI agents — single SQLite file, no external server, no API keys. free and opensource - BrainCTL

Every agent I build forgets everything between sessions. I got tired of it and built brainctl. pip install brainctl, then: from agentmemo...

Reddit - Artificial Intelligence · 1 min ·
Llms

How has Claude far surpassed the competitors? They were not first to market or ever had the most cash yet their feature are far and away the best on the market.

How has Claude far surpassed the competitors? They were not first to market or ever had the most cash yet their feature are far and away ...

Reddit - Artificial Intelligence · 1 min ·
Anthropic temporarily banned OpenClaw's creator from accessing Claude | TechCrunch
Llms

Anthropic temporarily banned OpenClaw's creator from accessing Claude | TechCrunch

This ban took place after Claude's pricing changed for OpenClaw users last week.

TechCrunch - AI · 5 min ·
Llms

I probably shouldn't be impressed, but I am.

So I just made this workout on a whiteboard and I was feeling lazy so I asked Claude to read it. And it did, almost flawlessly. I was and...

Reddit - Artificial Intelligence · 1 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime