[2602.14697] Evolutionary System Prompt Learning can Facilitate Reinforcement Learning for LLMs
Summary
The paper proposes Evolutionary System Prompt Learning (E-SPL) to enhance reinforcement learning in large language models (LLMs) by evolving system prompts alongside model weights, improving performance in reasoning tasks.
Why It Matters
This research is significant as it addresses the challenge of enhancing LLMs' self-improvement capabilities. By integrating evolutionary strategies with reinforcement learning, it offers a novel approach to boost sample efficiency and generalization, which are critical for advancing AI applications.
Key Takeaways
- E-SPL combines reinforcement learning with evolutionary strategies for LLMs.
- The method improves both model context and weights simultaneously.
- E-SPL enhances performance in reasoning tasks and generalization.
- The approach shows significant gains in sample efficiency.
- E-SPL outperforms traditional reflective prompt evolution methods.
Computer Science > Artificial Intelligence arXiv:2602.14697 (cs) [Submitted on 16 Feb 2026] Title:Evolutionary System Prompt Learning can Facilitate Reinforcement Learning for LLMs Authors:Lunjun Zhang, Ryan Chen, Bradly C. Stadie View a PDF of the paper titled Evolutionary System Prompt Learning can Facilitate Reinforcement Learning for LLMs, by Lunjun Zhang and 2 other authors View PDF Abstract:Building agentic systems that can autonomously self-improve from experience is a longstanding goal of AI. Large language models (LLMs) today primarily self-improve via two mechanisms: self-reflection for context updates, and reinforcement learning (RL) for weight updates. In this work, we propose Evolutionary System Prompt Learning (E-SPL), a method for jointly improving model contexts and model weights. In each RL iteration, E-SPL selects multiple system prompts and runs rollouts with each in parallel. It applies RL updates to model weights conditioned on each system prompt, and evolutionary updates to the system prompt population via LLM-driven mutation and crossover. Each system prompt has a TrueSkill rating for evolutionary selection, updated from relative performance within each RL iteration batch. E-SPL encourages a natural division between declarative knowledge encoded in prompts and procedural knowledge encoded in weights, resulting in improved performance across reasoning and agentic tasks. For instance, in an easy-to-hard (AIME $\rightarrow$ BeyondAIME) generalization se...