[2602.22296] UpSkill: Mutual Information Skill Learning for Structured Response Diversity in LLMs
Summary
The paper presents UpSkill, a method that enhances response diversity in large language models (LLMs) through Mutual Information Skill Learning, improving multi-attempt performance without sacrificing accuracy.
Why It Matters
As LLMs are increasingly used in various applications, ensuring diverse and accurate responses is crucial. UpSkill addresses the challenge of optimizing response diversity while maintaining correctness, which can lead to more robust AI systems and better user experiences.
Key Takeaways
- UpSkill utilizes Mutual Information Skill Learning to enhance response diversity in LLMs.
- The method improves multi-attempt metrics, showing gains in pass@k accuracy without degrading pass@1.
- Empirical and theoretical evidence supports the link between mutual information objectives and performance improvements.
Computer Science > Machine Learning arXiv:2602.22296 (cs) [Submitted on 25 Feb 2026] Title:UpSkill: Mutual Information Skill Learning for Structured Response Diversity in LLMs Authors:Devan Shah, Owen Yang, Daniel Yang, Chongyi Zheng, Benjamin Eysenbach View a PDF of the paper titled UpSkill: Mutual Information Skill Learning for Structured Response Diversity in LLMs, by Devan Shah and 4 other authors View PDF HTML (experimental) Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has improved the reasoning abilities of large language models (LLMs) on mathematics and programming tasks, but standard approaches that optimize single-attempt accuracy can inadvertently suppress response diversity across repeated attempts, narrowing exploration and overlooking underrepresented strategies. We introduce UpSkill, a training time method that adapts Mutual Information Skill Learning (MISL) to LLMs for optimizing pass@k correctness. We propose a novel reward that we implement within Group Relative Policy Optimization (GRPO): a token-level mutual information (MI) reward that encourages trajectory specificity to z. Experiments on GSM8K with three open-weight models, Llama 3.1-8B, Qwen 2.5-7B, and R1-Distilled-Qwen2.5-Math-1.5B, show that UpSkill improves multi-attempt metrics on the stronger base models, yielding mean gains of ~3% in pass@k for both Qwen and Llama without degrading pass@1. Additionally, we find both empirical and theoretical evidence that improvements in pass...