Llms Machine Learning Generative Ai

[2602.22296] UpSkill: Mutual Information Skill Learning for Structured Response Diversity in LLMs

arXiv - AI February 27, 2026 3 min read Article

Summary

The paper presents UpSkill, a method that enhances response diversity in large language models (LLMs) through Mutual Information Skill Learning, improving multi-attempt performance without sacrificing accuracy.

Why It Matters

As LLMs are increasingly used in various applications, ensuring diverse and accurate responses is crucial. UpSkill addresses the challenge of optimizing response diversity while maintaining correctness, which can lead to more robust AI systems and better user experiences.

Key Takeaways

UpSkill utilizes Mutual Information Skill Learning to enhance response diversity in LLMs.
The method improves multi-attempt metrics, showing gains in pass@k accuracy without degrading pass@1.
Empirical and theoretical evidence supports the link between mutual information objectives and performance improvements.

Computer Science > Machine Learning arXiv:2602.22296 (cs) [Submitted on 25 Feb 2026] Title:UpSkill: Mutual Information Skill Learning for Structured Response Diversity in LLMs Authors:Devan Shah, Owen Yang, Daniel Yang, Chongyi Zheng, Benjamin Eysenbach View a PDF of the paper titled UpSkill: Mutual Information Skill Learning for Structured Response Diversity in LLMs, by Devan Shah and 4 other authors View PDF HTML (experimental) Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has improved the reasoning abilities of large language models (LLMs) on mathematics and programming tasks, but standard approaches that optimize single-attempt accuracy can inadvertently suppress response diversity across repeated attempts, narrowing exploration and overlooking underrepresented strategies. We introduce UpSkill, a training time method that adapts Mutual Information Skill Learning (MISL) to LLMs for optimizing pass@k correctness. We propose a novel reward that we implement within Group Relative Policy Optimization (GRPO): a token-level mutual information (MI) reward that encourages trajectory specificity to z. Experiments on GSM8K with three open-weight models, Llama 3.1-8B, Qwen 2.5-7B, and R1-Distilled-Qwen2.5-Math-1.5B, show that UpSkill improves multi-attempt metrics on the stronger base models, yielding mean gains of ~3% in pass@k for both Qwen and Llama without degrading pass@1. Additionally, we find both empirical and theoretical evidence that improvements in pass...

Read Original Article

[2602.22296] UpSkill: Mutual Information Skill Learning for Structured Response Diversity in LLMs

Summary

Why It Matters

Key Takeaways

Related Articles

You can now use ChatGPT with Apple’s CarPlay | The Verge

Have Companies Began Adopting Claude Co-Work at an Enterprise Level?

What I learned about multi-agent coordination running 9 specialized Claude agents

[D] The problem with comparing AI memory system benchmarks — different evaluation methods make scores meaningless

No comments

Stay updated with AI News