[2602.10388] Less is Enough: Synthesizing Diverse Data in Feature Space of LLMs
Summary
The paper introduces a novel metric, Feature Activation Coverage (FAC), to measure data diversity in large language models (LLMs) and presents a framework for synthesizing diverse data to enhance model performance across various tasks.
Why It Matters
As LLMs become increasingly integral to AI applications, understanding how to effectively synthesize diverse training data is crucial for improving their performance. This research offers a new approach that could lead to more robust and capable models, addressing a significant gap in current methodologies.
Key Takeaways
- Introduces Feature Activation Coverage (FAC) for measuring data diversity.
- Presents a framework for synthesizing data that improves model performance.
- Demonstrates effectiveness across various tasks like instruction following and toxicity detection.
- Identifies a shared feature space across different model families for knowledge transfer.
- Provides a practical methodology for data-centric optimization of LLMs.
Computer Science > Computation and Language arXiv:2602.10388 (cs) [Submitted on 11 Feb 2026 (v1), last revised 12 Feb 2026 (this version, v2)] Title:Less is Enough: Synthesizing Diverse Data in Feature Space of LLMs Authors:Zhongzhi Li, Xuansheng Wu, Yijiang Li, Lijie Hu, Ninghao Liu View a PDF of the paper titled Less is Enough: Synthesizing Diverse Data in Feature Space of LLMs, by Zhongzhi Li and 4 other authors View PDF HTML (experimental) Abstract:The diversity of post-training data is critical for effective downstream performance in large language models (LLMs). Many existing approaches to constructing post-training data quantify diversity using text-based metrics that capture linguistic variation, but such metrics provide only weak signals for the task-relevant features that determine downstream performance. In this work, we introduce Feature Activation Coverage (FAC) which measures data diversity in an interpretable feature space. Building upon this metric, we further propose a diversity-driven data synthesis framework, named FAC Synthesis, that first uses a sparse autoencoder to identify missing features from a seed dataset, and then generates synthetic samples that explicitly reflect these features. Experiments show that our approach consistently improves both data diversity and downstream performance on various tasks, including instruction following, toxicity detection, reward modeling, and behavior steering. Interestingly, we identify a shared, interpretable fe...