[2602.22296] UpSkill: Mutual Information Skill Learning for Structured Response Diversity in LLMs

[2602.22296] UpSkill: Mutual Information Skill Learning for Structured Response Diversity in LLMs

arXiv - AI 3 min read Article

Summary

The paper presents UpSkill, a method that enhances response diversity in large language models (LLMs) through Mutual Information Skill Learning, improving multi-attempt performance without sacrificing accuracy.

Why It Matters

As LLMs are increasingly used in various applications, ensuring diverse and accurate responses is crucial. UpSkill addresses the challenge of optimizing response diversity while maintaining correctness, which can lead to more robust AI systems and better user experiences.

Key Takeaways

  • UpSkill utilizes Mutual Information Skill Learning to enhance response diversity in LLMs.
  • The method improves multi-attempt metrics, showing gains in pass@k accuracy without degrading pass@1.
  • Empirical and theoretical evidence supports the link between mutual information objectives and performance improvements.

Computer Science > Machine Learning arXiv:2602.22296 (cs) [Submitted on 25 Feb 2026] Title:UpSkill: Mutual Information Skill Learning for Structured Response Diversity in LLMs Authors:Devan Shah, Owen Yang, Daniel Yang, Chongyi Zheng, Benjamin Eysenbach View a PDF of the paper titled UpSkill: Mutual Information Skill Learning for Structured Response Diversity in LLMs, by Devan Shah and 4 other authors View PDF HTML (experimental) Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has improved the reasoning abilities of large language models (LLMs) on mathematics and programming tasks, but standard approaches that optimize single-attempt accuracy can inadvertently suppress response diversity across repeated attempts, narrowing exploration and overlooking underrepresented strategies. We introduce UpSkill, a training time method that adapts Mutual Information Skill Learning (MISL) to LLMs for optimizing pass@k correctness. We propose a novel reward that we implement within Group Relative Policy Optimization (GRPO): a token-level mutual information (MI) reward that encourages trajectory specificity to z. Experiments on GSM8K with three open-weight models, Llama 3.1-8B, Qwen 2.5-7B, and R1-Distilled-Qwen2.5-Math-1.5B, show that UpSkill improves multi-attempt metrics on the stronger base models, yielding mean gains of ~3% in pass@k for both Qwen and Llama without degrading pass@1. Additionally, we find both empirical and theoretical evidence that improvements in pass...

Related Articles

You can now use ChatGPT with Apple’s CarPlay | The Verge
Llms

You can now use ChatGPT with Apple’s CarPlay | The Verge

ChatGPT is now accessible from your CarPlay dashboard if you have iOS 26.4 or newer and the latest version of the ChatGPT app.

The Verge - AI · 3 min ·
Llms

Have Companies Began Adopting Claude Co-Work at an Enterprise Level?

Hi Guys, My company is considering purchasing the Claude Enterprise plan. The main two constraints are: - Being able to block usage of Cl...

Reddit - Artificial Intelligence · 1 min ·
Llms

What I learned about multi-agent coordination running 9 specialized Claude agents

I've been experimenting with multi-agent AI systems and ended up building something more ambitious than I originally planned: a fully ope...

Reddit - Artificial Intelligence · 1 min ·
Llms

[D] The problem with comparing AI memory system benchmarks — different evaluation methods make scores meaningless

I've been reviewing how various AI memory systems evaluate their performance and noticed a fundamental issue with cross-system comparison...

Reddit - Machine Learning · 1 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime