[2502.14560] Less is More: Improving LLM Alignment via Preference Data Selection
Summary
This article discusses a novel approach to improving large language model (LLM) alignment through effective preference data selection, enhancing data efficiency and model performance.
Why It Matters
As LLMs become increasingly integral to AI applications, optimizing their alignment with human preferences is crucial. This research highlights the importance of data selection in enhancing model performance, offering a potential pathway to more efficient and effective AI systems.
Key Takeaways
- Direct Preference Optimization (DPO) can be enhanced through improved data selection.
- A novel margin-maximization principle addresses parameter shrinkage caused by noisy data.
- Using only 10% of the Ultrafeedback dataset can lead to significant performance improvements.
- The proposed Bayesian Aggregation approach unifies multiple margin sources for better preference probability.
- The findings suggest high redundancy in data construction methods, indicating potential for more efficient training.
Computer Science > Machine Learning arXiv:2502.14560 (cs) [Submitted on 20 Feb 2025 (v1), last revised 15 Feb 2026 (this version, v4)] Title:Less is More: Improving LLM Alignment via Preference Data Selection Authors:Xun Deng, Han Zhong, Rui Ai, Fuli Feng, Zheng Wang, Xiangnan He View a PDF of the paper titled Less is More: Improving LLM Alignment via Preference Data Selection, by Xun Deng and 5 other authors View PDF HTML (experimental) Abstract:Direct Preference Optimization (DPO) has emerged as a promising approach for aligning large language models with human preferences. While prior work mainly extends DPO from the aspect of the objective function, we instead improve DPO from the largely overlooked but critical aspect of data selection. Specifically, we address the issue of parameter shrinkage caused by noisy data by proposing a novel margin-maximization principle for dataset curation in DPO training. To further mitigate the noise in different reward models, we propose a Bayesian Aggregation approach that unifies multiple margin sources (external and implicit) into a single preference probability. Extensive experiments in diverse settings demonstrate the consistently high data efficiency of our approach. Remarkably, by using just 10\% of the Ultrafeedback dataset, our approach achieves 3\% to 8\% improvements across various Llama, Mistral, and Qwen models on the AlpacaEval2 benchmark. Furthermore, our approach seamlessly extends to iterative DPO, yielding a roughly 3\...