[2509.07430] The Choice of Divergence: A Neglected Key to Mitigating

[2509.07430] The Choice of Divergence: A Neglected Key to Mitigating Diversity Collapse in Reinforcement Learning with Verifiable Reward

arXiv - Machine Learning March 04, 2026 4 min read

About this article

Abstract page for arXiv paper 2509.07430: The Choice of Divergence: A Neglected Key to Mitigating Diversity Collapse in Reinforcement Learning with Verifiable Reward

Computer Science > Machine Learning arXiv:2509.07430 (cs) [Submitted on 9 Sep 2025 (v1), last revised 3 Mar 2026 (this version, v4)] Title:The Choice of Divergence: A Neglected Key to Mitigating Diversity Collapse in Reinforcement Learning with Verifiable Reward Authors:Long Li, Zhijian Zhou, Jiaran Hao, Jason Klein Liu, Yanting Miao, Wei Pang, Xiaoyu Tan, Wei Chu, Zhe Wang, Shirui Pan, Chao Qu, Yuan Qi View a PDF of the paper titled The Choice of Divergence: A Neglected Key to Mitigating Diversity Collapse in Reinforcement Learning with Verifiable Reward, by Long Li and 11 other authors View PDF HTML (experimental) Abstract:A central paradox in fine-tuning Large Language Models (LLMs) with Reinforcement Learning with Verifiable Reward (RLVR) is the frequent degradation of multi-attempt performance (Pass@k) despite improvements in single-attempt accuracy (Pass@1). This is often accompanied by catastrophic forgetting, where models lose previously acquired skills. While various methods have been proposed, the choice and function of the divergence term have been surprisingly unexamined as a proactive solution. We argue that standard RLVR objectives -- both those using the mode-seeking reverse KL-divergence and those forgoing a divergence term entirely -- lack a crucial mechanism for knowledge retention. The reverse-KL actively accelerates this decay by narrowing the policy, while its absence provides no safeguard against the model drifting from its diverse knowledge base. We ...

Originally published on March 04, 2026. Curated by AI News.

Llms

persistent memory system for AI agents — single SQLite file, no external server, no API keys. free and opensource - BrainCTL

Every agent I build forgets everything between sessions. I got tired of it and built brainctl. pip install brainctl, then: from agentmemo...

Reddit - Artificial Intelligence · 1 min · 6 minutes ago

Llms

How has Claude far surpassed the competitors? They were not first to market or ever had the most cash yet their feature are far and away the best on the market.

How has Claude far surpassed the competitors? They were not first to market or ever had the most cash yet their feature are far and away ...

Reddit - Artificial Intelligence · 1 min · 6 minutes ago

Llms

Anthropic temporarily banned OpenClaw's creator from accessing Claude | TechCrunch

This ban took place after Claude's pricing changed for OpenClaw users last week.

TechCrunch - AI · 5 min · about 2 hours ago

Llms

I probably shouldn't be impressed, but I am.

So I just made this workout on a whiteboard and I was feeling lazy so I asked Claude to read it. And it did, almost flawlessly. I was and...

Reddit - Artificial Intelligence · 1 min · about 2 hours ago

[2509.07430] The Choice of Divergence: A Neglected Key to Mitigating Diversity Collapse in Reinforcement Learning with Verifiable Reward

About this article

Related Articles

persistent memory system for AI agents — single SQLite file, no external server, no API keys. free and opensource - BrainCTL

How has Claude far surpassed the competitors? They were not first to market or ever had the most cash yet their feature are far and away the best on the market.

Anthropic temporarily banned OpenClaw's creator from accessing Claude | TechCrunch

I probably shouldn't be impressed, but I am.

No comments

Stay updated with AI News