[2502.14560] Less is More: Improving LLM Alignment via Preference Data Selection

[2502.14560] Less is More: Improving LLM Alignment via Preference Data Selection

arXiv - AI 4 min read Article

Summary

This article discusses a novel approach to improving large language model (LLM) alignment through effective preference data selection, enhancing data efficiency and model performance.

Why It Matters

As LLMs become increasingly integral to AI applications, optimizing their alignment with human preferences is crucial. This research highlights the importance of data selection in enhancing model performance, offering a potential pathway to more efficient and effective AI systems.

Key Takeaways

  • Direct Preference Optimization (DPO) can be enhanced through improved data selection.
  • A novel margin-maximization principle addresses parameter shrinkage caused by noisy data.
  • Using only 10% of the Ultrafeedback dataset can lead to significant performance improvements.
  • The proposed Bayesian Aggregation approach unifies multiple margin sources for better preference probability.
  • The findings suggest high redundancy in data construction methods, indicating potential for more efficient training.

Computer Science > Machine Learning arXiv:2502.14560 (cs) [Submitted on 20 Feb 2025 (v1), last revised 15 Feb 2026 (this version, v4)] Title:Less is More: Improving LLM Alignment via Preference Data Selection Authors:Xun Deng, Han Zhong, Rui Ai, Fuli Feng, Zheng Wang, Xiangnan He View a PDF of the paper titled Less is More: Improving LLM Alignment via Preference Data Selection, by Xun Deng and 5 other authors View PDF HTML (experimental) Abstract:Direct Preference Optimization (DPO) has emerged as a promising approach for aligning large language models with human preferences. While prior work mainly extends DPO from the aspect of the objective function, we instead improve DPO from the largely overlooked but critical aspect of data selection. Specifically, we address the issue of parameter shrinkage caused by noisy data by proposing a novel margin-maximization principle for dataset curation in DPO training. To further mitigate the noise in different reward models, we propose a Bayesian Aggregation approach that unifies multiple margin sources (external and implicit) into a single preference probability. Extensive experiments in diverse settings demonstrate the consistently high data efficiency of our approach. Remarkably, by using just 10\% of the Ultrafeedback dataset, our approach achieves 3\% to 8\% improvements across various Llama, Mistral, and Qwen models on the AlpacaEval2 benchmark. Furthermore, our approach seamlessly extends to iterative DPO, yielding a roughly 3\...

Related Articles

Google Maps can now write captions for your photos using AI | TechCrunch
Llms

Google Maps can now write captions for your photos using AI | TechCrunch

Gemini can now create captions when users are looking to share a photo or video.

TechCrunch - AI · 4 min ·
Llms

ParetoBandit: Budget-Paced Adaptive Routing for Non-Stationary LLM Serving

submitted by /u/PatienceHistorical70 [link] [comments]

Reddit - Machine Learning · 1 min ·
Llms

Stop Overcomplicating AI Workflows. This Is the Simple Framework

I’ve been working on building an agentic AI workflow system for business use cases and one thing became very clear very quickly. This is ...

Reddit - Artificial Intelligence · 1 min ·
Llms

Lemonade 10.1 released for latest improvements for local LLMs on AMD GPUs & NPUs

submitted by /u/Fcking_Chuck [link] [comments]

Reddit - Artificial Intelligence · 1 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime