[2602.13773] On Representation Redundancy in Large-Scale Instruction Tuning Data Selection
Summary
This paper explores the issue of representation redundancy in large-scale instruction tuning data selection for language models, proposing a novel framework to enhance data quality.
Why It Matters
As large language models (LLMs) become increasingly prevalent, ensuring high-quality training data is essential for their performance. This research addresses a critical gap in data selection methods, potentially leading to more efficient and effective model training.
Key Takeaways
- High-quality data selection can outperform larger, noisy datasets.
- The proposed Compressed Representation Data Selection (CRDS) framework reduces redundancy in semantic embeddings.
- CRDS-W variant achieves significant performance improvements with only 3.5% of the data.
Computer Science > Machine Learning arXiv:2602.13773 (cs) [Submitted on 14 Feb 2026] Title:On Representation Redundancy in Large-Scale Instruction Tuning Data Selection Authors:Youwei Shu, Shaomian Zheng, Dingnan Jin, Wenjie Qu, Ziyao Guo, Qing Cui, Jun Zhou, Jiaheng Zhang View a PDF of the paper titled On Representation Redundancy in Large-Scale Instruction Tuning Data Selection, by Youwei Shu and 7 other authors View PDF HTML (experimental) Abstract:Data quality is a crucial factor in large language models training. While prior work has shown that models trained on smaller, high-quality datasets can outperform those trained on much larger but noisy or low-quality corpora, systematic methods for industrial-scale data selection in instruction tuning remain underexplored. In this work, we study instruction-tuning data selection through the lens of semantic representation similarity and identify a key limitation of state-of-the-art LLM encoders: they produce highly redundant semantic embeddings. To mitigate this redundancy, we propose Compressed Representation Data Selection (CRDS), a novel framework with two variants. CRDS-R applies Rademacher random projection followed by concatenation of transformer hidden-layer representations, while CRDS-W employs whitening-based dimensionality reduction to improve representational quality. Experimental results demonstrate that both variants substantially enhance data quality and consistently outperform state-of-the-art representation-b...