[2602.19041] Back to Blackwell: Closing the Loop on Intransitivity in Multi-Objective Preference Fine-Tuning
Summary
This article presents a novel approach to addressing intransitive preferences in multi-objective preference fine-tuning (PFT) through a game-theoretic solution called Maximum Entropy Blackwell Winner (MaxEntBW) and introduces the PROSPER algorithm for efficient computation.
Why It Matters
Intransitive preferences can hinder the effectiveness of multi-objective decision-making, particularly in AI applications like fine-tuning large language models. This research proposes a significant advancement in handling these complexities, potentially improving AI performance and decision-making processes.
Key Takeaways
- Intransitive preferences complicate the identification of optimal policies in multi-objective PFT.
- The MaxEntBW solution provides a well-defined approach to manage these preferences.
- The PROSPER algorithm efficiently computes solutions without scalarization, enhancing performance in fine-tuning tasks.
- Empirical results show PROSPER outperforms existing methods in instruction following and chat benchmarks.
- The research contributes to the development of more robust AI systems capable of handling complex preference structures.
Computer Science > Machine Learning arXiv:2602.19041 (cs) [Submitted on 22 Feb 2026] Title:Back to Blackwell: Closing the Loop on Intransitivity in Multi-Objective Preference Fine-Tuning Authors:Jiahao Zhang, Lujing Zhang, Keltin Grimes, Zhuohao Yu, Gokul Swamy, Zhiwei Steven Wu View a PDF of the paper titled Back to Blackwell: Closing the Loop on Intransitivity in Multi-Objective Preference Fine-Tuning, by Jiahao Zhang and 5 other authors View PDF HTML (experimental) Abstract:A recurring challenge in preference fine-tuning (PFT) is handling $\textit{intransitive}$ (i.e., cyclic) preferences. Intransitive preferences often stem from either $\textit{(i)}$ inconsistent rankings along a single objective or $\textit{(ii)}$ scalarizing multiple objectives into a single metric. Regardless of their source, the downstream implication of intransitive preferences is the same: there is no well-defined optimal policy, breaking a core assumption of the standard PFT pipeline. In response, we propose a novel, game-theoretic solution concept -- the $\textit{Maximum Entropy Blackwell Winner}$ ($\textit{MaxEntBW}$) -- that is well-defined under multi-objective intransitive preferences. To enable computing MaxEntBWs at scale, we derive $\texttt{PROSPER}$: a provably efficient PFT algorithm. Unlike prior self-play techniques, $\texttt{PROSPER}$ directly handles multiple objectives without requiring scalarization. We then apply $\texttt{PROSPER}$ to the problem of fine-tuning large language mo...