Llms Machine Learning Ai Infrastructure Nlp

[2509.15194] Evolving Language Models without Labels: Majority Drives Selection, Novelty Promotes Variation

arXiv - Machine Learning February 19, 2026 4 min read Article

Summary

The paper presents EVOL-RL, a novel framework for evolving language models without labels, balancing majority-driven stability with novelty-driven exploration to enhance model performance and generalization.

Why It Matters

As language models increasingly require self-improvement mechanisms, EVOL-RL offers a significant advancement by addressing the limitations of current label-dependent methods. This framework not only enhances the diversity and reasoning capabilities of models but also has implications for their deployment in real-world applications where labeled data may be scarce.

Key Takeaways

EVOL-RL utilizes a majority-voted answer for stability while promoting exploration through novelty-aware rewards.
The framework effectively prevents diversity collapse in language models, improving both in-domain and out-of-domain performance.
Evaluation results show significant performance improvements over baseline models, indicating the effectiveness of the proposed method.

Computer Science > Machine Learning arXiv:2509.15194 (cs) [Submitted on 18 Sep 2025 (v1), last revised 18 Feb 2026 (this version, v3)] Title:Evolving Language Models without Labels: Majority Drives Selection, Novelty Promotes Variation Authors:Yujun Zhou, Zhenwen Liang, Haolin Liu, Wenhao Yu, Kishan Panaganti, Linfeng Song, Dian Yu, Xiangliang Zhang, Haitao Mi, Dong Yu View a PDF of the paper titled Evolving Language Models without Labels: Majority Drives Selection, Novelty Promotes Variation, by Yujun Zhou and 9 other authors View PDF HTML (experimental) Abstract:Large language models (LLMs) are increasingly trained with reinforcement learning from verifiable rewards (RLVR), yet real-world deployment demands models that can self-improve without labels or external judges. Existing self-improvement approaches primarily rely on self-confirmation signals (e.g., confidence, entropy, or consistency) to generate rewards. This reliance drives models toward over-confident, majority-favored solutions, causing an entropy collapse that degrades pass@n and reasoning complexity. To address this, we propose EVOL-RL, a label-free framework that mirrors the evolutionary principle of balancing selection with variation. Concretely, EVOL-RL retains the majority-voted answer as an anchor for stability, but adds a novelty-aware reward that scores each sampled solution by how different its reasoning is from other concurrently generated responses. This majority-for-stability + novelty-for-explor...

Read Original Article

[2509.15194] Evolving Language Models without Labels: Majority Drives Selection, Novelty Promotes Variation

Summary

Why It Matters

Key Takeaways

Related Articles

[R] Looking for arXiv cs.LG endorser, inference monitoring using information geometry

How LLM sycophancy got the US into the Iran quagmire

Kept hitting ChatGPT and Claude limits during real work. This is the free setup I ended up using

Is ChatGPT changing the way we think too much already?

No comments

Stay updated with AI News