[2602.21158] SELAUR: Self Evolving LLM Agent via Uncertainty-aware Rewards
Summary
The paper presents SELAUR, a reinforcement learning framework that enhances large language models (LLMs) by integrating uncertainty into reward design, improving exploration and learning stability.
Why It Matters
As LLMs are increasingly used in decision-making, understanding and incorporating uncertainty can lead to more effective learning strategies. This research addresses a critical gap in reward design, potentially enhancing the performance of LLMs in complex tasks.
Key Takeaways
- SELAUR incorporates uncertainty into reward design for LLMs.
- The framework improves exploration efficiency and learning stability.
- Experiments show significant success rate improvements over strong baselines.
- Ablation studies confirm the benefits of uncertainty signals.
- The approach is applicable to multi-step decision-making tasks.
Computer Science > Machine Learning arXiv:2602.21158 (cs) [Submitted on 24 Feb 2026] Title:SELAUR: Self Evolving LLM Agent via Uncertainty-aware Rewards Authors:Dengjia Zhang, Xiaoou Liu, Lu Cheng, Yaqing Wang, Kenton Murray, Hua Wei View a PDF of the paper titled SELAUR: Self Evolving LLM Agent via Uncertainty-aware Rewards, by Dengjia Zhang and 5 other authors View PDF HTML (experimental) Abstract:Large language models (LLMs) are increasingly deployed as multi-step decision-making agents, where effective reward design is essential for guiding learning. Although recent work explores various forms of reward shaping and step-level credit assignment, a key signal remains largely overlooked: the intrinsic uncertainty of LLMs. Uncertainty reflects model confidence, reveals where exploration is needed, and offers valuable learning cues even in failed trajectories. We introduce SELAUR: Self Evolving LLM Agent via Uncertainty-aware Rewards, a reinforcement learning framework that incorporates uncertainty directly into the reward design. SELAUR integrates entropy-, least-confidence-, and margin-based metrics into a combined token-level uncertainty estimate, providing dense confidence-aligned supervision, and employs a failure-aware reward reshaping mechanism that injects these uncertainty signals into step- and trajectory-level rewards to improve exploration efficiency and learning stability. Experiments on two benchmarks, ALFWorld and WebShop, show that our method consistently im...