[2511.00040] Semi-Supervised Preference Optimization with Limited Feedback
Summary
This paper discusses Semi-Supervised Preference Optimization (SSPO), which reduces the need for extensive labeled feedback in preference optimization for language models by utilizing unpaired data effectively.
Why It Matters
The research addresses a significant challenge in machine learning—reducing the resource burden of acquiring labeled data. By demonstrating that SSPO can achieve high performance with minimal labeled samples, this work has implications for making preference optimization more accessible and efficient in AI applications.
Key Takeaways
- SSPO learns from a small number of pairwise preference labels and a large pool of unpaired samples.
- The study proves the existence of an optimal reward threshold for effective pseudo-labeling.
- SSPO shows improved data efficiency, outperforming traditional methods with less labeled data.
- The method maintains human alignment while significantly reducing acquisition costs.
- Extensive experiments validate SSPO's effectiveness across various datasets.
Computer Science > Machine Learning arXiv:2511.00040 (cs) [Submitted on 28 Oct 2025 (v1), last revised 19 Feb 2026 (this version, v3)] Title:Semi-Supervised Preference Optimization with Limited Feedback Authors:Seonggyun Lee, Sungjun Lim, Seojin Park, Soeun Cheon, Kyungwoo Song View a PDF of the paper titled Semi-Supervised Preference Optimization with Limited Feedback, by Seonggyun Lee and 4 other authors View PDF HTML (experimental) Abstract:The field of preference optimization has made outstanding contributions to the alignment of language models with human preferences. Despite these advancements, recent methods still rely heavily on substantial paired (labeled) feedback data, leading to substantial resource expenditures. To address these challenges, we study the problem of Semi-Supervised Preference Optimization (SSPO) in which the idea is to learn from both a small number of pairwise preference labels and a large pool of unpaired samples simultaneously. Our key theoretical contribution proves the existence of an optimal reward threshold capable of separating winning and losing responses with high probability, which enables a principled pseudo-labeling of unpaired data. By leveraging these pseudo-labels, SSPO effectively distills latent preferences from large-scale unpaired data, thus maintaining human alignment while drastically reducing acquisition costs. Extensive experiments across datasets validate this remarkable data efficiency; for instance, SSPO trained with M...