[2509.15194] Evolving Language Models without Labels: Majority Drives Selection, Novelty Promotes Variation
Summary
The paper presents EVOL-RL, a novel framework for evolving language models without labels, balancing majority-driven stability with novelty-driven exploration to enhance model performance and generalization.
Why It Matters
As language models increasingly require self-improvement mechanisms, EVOL-RL offers a significant advancement by addressing the limitations of current label-dependent methods. This framework not only enhances the diversity and reasoning capabilities of models but also has implications for their deployment in real-world applications where labeled data may be scarce.
Key Takeaways
- EVOL-RL utilizes a majority-voted answer for stability while promoting exploration through novelty-aware rewards.
- The framework effectively prevents diversity collapse in language models, improving both in-domain and out-of-domain performance.
- Evaluation results show significant performance improvements over baseline models, indicating the effectiveness of the proposed method.
Computer Science > Machine Learning arXiv:2509.15194 (cs) [Submitted on 18 Sep 2025 (v1), last revised 18 Feb 2026 (this version, v3)] Title:Evolving Language Models without Labels: Majority Drives Selection, Novelty Promotes Variation Authors:Yujun Zhou, Zhenwen Liang, Haolin Liu, Wenhao Yu, Kishan Panaganti, Linfeng Song, Dian Yu, Xiangliang Zhang, Haitao Mi, Dong Yu View a PDF of the paper titled Evolving Language Models without Labels: Majority Drives Selection, Novelty Promotes Variation, by Yujun Zhou and 9 other authors View PDF HTML (experimental) Abstract:Large language models (LLMs) are increasingly trained with reinforcement learning from verifiable rewards (RLVR), yet real-world deployment demands models that can self-improve without labels or external judges. Existing self-improvement approaches primarily rely on self-confirmation signals (e.g., confidence, entropy, or consistency) to generate rewards. This reliance drives models toward over-confident, majority-favored solutions, causing an entropy collapse that degrades pass@n and reasoning complexity. To address this, we propose EVOL-RL, a label-free framework that mirrors the evolutionary principle of balancing selection with variation. Concretely, EVOL-RL retains the majority-voted answer as an anchor for stability, but adds a novelty-aware reward that scores each sampled solution by how different its reasoning is from other concurrently generated responses. This majority-for-stability + novelty-for-explor...