[2603.04981] Rethinking Representativeness and Diversity in Dynamic Data Selection
About this article
Abstract page for arXiv paper 2603.04981: Rethinking Representativeness and Diversity in Dynamic Data Selection
Computer Science > Artificial Intelligence arXiv:2603.04981 (cs) [Submitted on 5 Mar 2026] Title:Rethinking Representativeness and Diversity in Dynamic Data Selection Authors:Yuzhe Zhou, Zhenglin Hua, Haiyun Guo, Yuheng Jia View a PDF of the paper titled Rethinking Representativeness and Diversity in Dynamic Data Selection, by Yuzhe Zhou and 3 other authors View PDF HTML (experimental) Abstract:Dynamic data selection accelerates training by sampling a changing subset of the dataset while preserving accuracy. We rethink two core notions underlying sample evaluation: representativeness and diversity. Instead of local geometric centrality, we define representativeness as coverage of dataset-level common or high-frequency feature factors. Instead of within-subset dispersion, we define diversity at the process level, requiring the selection trajectory to gradually include complementary rare factors over training. Based on this view, we propose a dynamic selection framework with three components. First, we score representativeness in a plug-in feature space to prioritize samples covering frequent factors. We instantiate this with a sparse autoencoder trained on the target dataset, using sparse unit activations to summarize both individual samples and dataset-wide factor statistics. Second, we realize process-level diversity by combining rare-factor sampling with a Usage-Frequency Penalty that promotes sample rotation, provably discourages monopoly, and reduces gradient bias. Thi...