[2602.10388] Less is Enough: Synthesizing Diverse Data in Feature Space of LLMs

[2602.10388] Less is Enough: Synthesizing Diverse Data in Feature Space of LLMs

arXiv - AI 4 min read Article

Summary

The paper introduces a novel metric, Feature Activation Coverage (FAC), to measure data diversity in large language models (LLMs) and presents a framework for synthesizing diverse data to enhance model performance across various tasks.

Why It Matters

As LLMs become increasingly integral to AI applications, understanding how to effectively synthesize diverse training data is crucial for improving their performance. This research offers a new approach that could lead to more robust and capable models, addressing a significant gap in current methodologies.

Key Takeaways

  • Introduces Feature Activation Coverage (FAC) for measuring data diversity.
  • Presents a framework for synthesizing data that improves model performance.
  • Demonstrates effectiveness across various tasks like instruction following and toxicity detection.
  • Identifies a shared feature space across different model families for knowledge transfer.
  • Provides a practical methodology for data-centric optimization of LLMs.

Computer Science > Computation and Language arXiv:2602.10388 (cs) [Submitted on 11 Feb 2026 (v1), last revised 12 Feb 2026 (this version, v2)] Title:Less is Enough: Synthesizing Diverse Data in Feature Space of LLMs Authors:Zhongzhi Li, Xuansheng Wu, Yijiang Li, Lijie Hu, Ninghao Liu View a PDF of the paper titled Less is Enough: Synthesizing Diverse Data in Feature Space of LLMs, by Zhongzhi Li and 4 other authors View PDF HTML (experimental) Abstract:The diversity of post-training data is critical for effective downstream performance in large language models (LLMs). Many existing approaches to constructing post-training data quantify diversity using text-based metrics that capture linguistic variation, but such metrics provide only weak signals for the task-relevant features that determine downstream performance. In this work, we introduce Feature Activation Coverage (FAC) which measures data diversity in an interpretable feature space. Building upon this metric, we further propose a diversity-driven data synthesis framework, named FAC Synthesis, that first uses a sparse autoencoder to identify missing features from a seed dataset, and then generates synthetic samples that explicitly reflect these features. Experiments show that our approach consistently improves both data diversity and downstream performance on various tasks, including instruction following, toxicity detection, reward modeling, and behavior steering. Interestingly, we identify a shared, interpretable fe...

Related Articles

Llms

Attention Is All You Need, But All You Can't Afford | Hybrid Attention

Repo: https://codeberg.org/JohannaJuntos/Sisyphus I've been building a small Rust-focused language model from scratch in PyTorch. Not a f...

Reddit - Artificial Intelligence · 1 min ·
The “Agony” or ChatGPT: Would You Let AI Write Your Wedding Speech?
Llms

The “Agony” or ChatGPT: Would You Let AI Write Your Wedding Speech?

AI Tools & Products · 12 min ·
Anthropic expands partnership with Google and Broadcom for multiple gigawatts of next-generation compute
Llms

Anthropic expands partnership with Google and Broadcom for multiple gigawatts of next-generation compute

AI Tools & Products · 3 min ·
How I use Claude for strategy, Gemini for research and ChatGPT for 'the grind'
Llms

How I use Claude for strategy, Gemini for research and ChatGPT for 'the grind'

AI Tools & Products · 9 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime