[2509.15194] Evolving Language Models without Labels: Majority Drives Selection, Novelty Promotes Variation

[2509.15194] Evolving Language Models without Labels: Majority Drives Selection, Novelty Promotes Variation

arXiv - Machine Learning 4 min read Article

Summary

The paper presents EVOL-RL, a novel framework for evolving language models without labels, balancing majority-driven stability with novelty-driven exploration to enhance model performance and generalization.

Why It Matters

As language models increasingly require self-improvement mechanisms, EVOL-RL offers a significant advancement by addressing the limitations of current label-dependent methods. This framework not only enhances the diversity and reasoning capabilities of models but also has implications for their deployment in real-world applications where labeled data may be scarce.

Key Takeaways

  • EVOL-RL utilizes a majority-voted answer for stability while promoting exploration through novelty-aware rewards.
  • The framework effectively prevents diversity collapse in language models, improving both in-domain and out-of-domain performance.
  • Evaluation results show significant performance improvements over baseline models, indicating the effectiveness of the proposed method.

Computer Science > Machine Learning arXiv:2509.15194 (cs) [Submitted on 18 Sep 2025 (v1), last revised 18 Feb 2026 (this version, v3)] Title:Evolving Language Models without Labels: Majority Drives Selection, Novelty Promotes Variation Authors:Yujun Zhou, Zhenwen Liang, Haolin Liu, Wenhao Yu, Kishan Panaganti, Linfeng Song, Dian Yu, Xiangliang Zhang, Haitao Mi, Dong Yu View a PDF of the paper titled Evolving Language Models without Labels: Majority Drives Selection, Novelty Promotes Variation, by Yujun Zhou and 9 other authors View PDF HTML (experimental) Abstract:Large language models (LLMs) are increasingly trained with reinforcement learning from verifiable rewards (RLVR), yet real-world deployment demands models that can self-improve without labels or external judges. Existing self-improvement approaches primarily rely on self-confirmation signals (e.g., confidence, entropy, or consistency) to generate rewards. This reliance drives models toward over-confident, majority-favored solutions, causing an entropy collapse that degrades pass@n and reasoning complexity. To address this, we propose EVOL-RL, a label-free framework that mirrors the evolutionary principle of balancing selection with variation. Concretely, EVOL-RL retains the majority-voted answer as an anchor for stability, but adds a novelty-aware reward that scores each sampled solution by how different its reasoning is from other concurrently generated responses. This majority-for-stability + novelty-for-explor...

Related Articles

Llms

[R] Looking for arXiv cs.LG endorser, inference monitoring using information geometry

Hi r/MachineLearning, I’m looking for an arXiv endorser in cs.LG for a paper on inference-time distribution shift detection for deployed ...

Reddit - Machine Learning · 1 min ·
Llms

How LLM sycophancy got the US into the Iran quagmire

submitted by /u/sow_oats [link] [comments]

Reddit - Artificial Intelligence · 1 min ·
Llms

Kept hitting ChatGPT and Claude limits during real work. This is the free setup I ended up using

I do a lot of writing and random problem solving for work. Mostly long drafts, edits, and breaking down ideas. Around Jan I kept hitting ...

Reddit - Artificial Intelligence · 1 min ·
Llms

Is ChatGPT changing the way we think too much already?

Back in the day, I got ChatGPT Plus mostly for work and to help me write better and do stuff faster. But now I use it for almost everythi...

Reddit - Artificial Intelligence · 1 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime