[2602.13103] R-Diverse: Mitigating Diversity Illusion in Self-Play LLM Training

[2602.13103] R-Diverse: Mitigating Diversity Illusion in Self-Play LLM Training

arXiv - Machine Learning 4 min read Article

Summary

The paper presents R-Diverse, a method aimed at addressing the Diversity Illusion in self-play training for large language models (LLMs), enhancing their reasoning capabilities through innovative techniques.

Why It Matters

As LLMs become increasingly integral to AI applications, ensuring their training methods yield consistent improvements is crucial. R-Diverse addresses a significant challenge in self-play training, potentially leading to more robust AI systems capable of sustained performance across various tasks.

Key Takeaways

  • R-Diverse identifies and mitigates Diversity Illusion in LLM training.
  • Introduces Memory-Augmented Penalty (MAP) to enhance training diversity.
  • Skill-Aware Measurement (SAM) evaluates reasoning skills rather than superficial question variation.
  • Demonstrated sustained performance improvements across multiple benchmarks.
  • Code availability encourages further research and application in the field.

Computer Science > Machine Learning arXiv:2602.13103 (cs) [Submitted on 13 Feb 2026] Title:R-Diverse: Mitigating Diversity Illusion in Self-Play LLM Training Authors:Gengsheng Li, Jinghan He, Shijie Wang, Dan Zhang, Ruiqi Liu, Renrui Zhang, Zijun Yao, Junfeng Fang, Haiyun Guo, Jinqiao Wang View a PDF of the paper titled R-Diverse: Mitigating Diversity Illusion in Self-Play LLM Training, by Gengsheng Li and 9 other authors View PDF HTML (experimental) Abstract:Self-play bootstraps LLM reasoning through an iterative Challenger-Solver loop: the Challenger is trained to generate questions that target the Solver's capabilities, and the Solver is optimized on the generated data to expand its reasoning skills. However, existing frameworks like R-Zero often exhibit non-sustained improvement, where early gains degrade as self-play continues. We identify a key failure mode, Diversity Illusion, where the Solver's training signals appear diverse yet collapse into recurring underlying patterns. It manifests as (1) Local Diversity Illusion, where diversity is enforced only within-batch, inducing cross-iteration mode cycling; and (2) Surface Diversity Illusion, where questions vary superficially but require near-identical reasoning skills. To mitigate them, we propose R-Diverse with two aligned innovations: Memory-Augmented Penalty (MAP), which uses a persistent memory bank to discourage recycling across iterations, and Skill-Aware Measurement (SAM), which evaluates diversity by the reas...

Related Articles

Llms

main skill in software engineering in 2026 is knowing what to ask Claude, not knowing how to code. and I can’t decide if that’s depressing or just the next abstraction layer.

Been writing code professionally for 8+ years. I’m now mass spending more time describing features in plain english than writing actual c...

Reddit - Artificial Intelligence · 1 min ·
Llms

Can we even achieve AGI with LLMs, why do AI bros still believe we can?

I've heard mixed discussions around this. Although not much evidence just rhetoric from the AGI will come from LLMs camp. submitted by /u...

Reddit - Artificial Intelligence · 1 min ·
Llms

You can now prompt OpenClaw into existence. fully 1st party on top of Claude Code

OpenClaw is basically banned from Claude ¯_(ツ)_/¯ Claude Code has Telegram support.. so what if we just, made it always stay on? turns ou...

Reddit - Artificial Intelligence · 1 min ·
Anthropic Teams Up With Its Rivals to Keep AI From Hacking Everything
Llms

Anthropic Teams Up With Its Rivals to Keep AI From Hacking Everything

The AI lab's Project Glasswing will bring together Apple, Google, and more than 45 other organizations. They'll use the new Claude Mythos...

Wired - AI · 7 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime