[2602.21492] GradAlign: Gradient-Aligned Data Selection for LLM Reinforcement Learning
Summary
The paper presents GradAlign, a novel method for selecting training data in reinforcement learning for large language models, enhancing performance through gradient alignment.
Why It Matters
As reinforcement learning becomes increasingly vital for training large language models, ensuring high-quality training data is crucial. GradAlign addresses the challenges of data selection by aligning gradients, which could lead to more efficient training processes and improved model performance in dynamic environments.
Key Takeaways
- GradAlign improves data selection for LLMs in reinforcement learning.
- The method uses gradient alignment to enhance training stability.
- Evaluated across various challenging data regimes, it outperforms existing methods.
- Addresses the issue of non-stationarity in reinforcement learning.
- Releases implementation for further research and application.
Computer Science > Machine Learning arXiv:2602.21492 (cs) [Submitted on 25 Feb 2026] Title:GradAlign: Gradient-Aligned Data Selection for LLM Reinforcement Learning Authors:Ningyuan Yang, Weihua Du, Weiwei Sun, Sean Welleck, Yiming Yang View a PDF of the paper titled GradAlign: Gradient-Aligned Data Selection for LLM Reinforcement Learning, by Ningyuan Yang and 4 other authors View PDF HTML (experimental) Abstract:Reinforcement learning (RL) has become a central post-training paradigm for large language models (LLMs), but its performance is highly sensitive to the quality of training problems. This sensitivity stems from the non-stationarity of RL: rollouts are generated by an evolving policy, and learning is shaped by exploration and reward feedback, unlike supervised fine-tuning (SFT) with fixed trajectories. As a result, prior work often relies on manual curation or simple heuristic filters (e.g., accuracy), which can admit incorrect or low-utility problems. We propose GradAlign, a gradient-aligned data selection method for LLM reinforcement learning that uses a small, trusted validation set to prioritize training problems whose policy gradients align with validation gradients, yielding an adaptive curriculum. We evaluate GradAlign across three challenging data regimes: unreliable reward signals, distribution imbalance, and low-utility training corpus, showing that GradAlign consistently outperforms existing baselines, underscoring the importance of directional gradient...