[2603.20453] Reinforcement Learning from Multi-Source Imperfect

[2603.20453] Reinforcement Learning from Multi-Source Imperfect Preferences: Best-of-Both-Regimes Regret

arXiv - Machine Learning March 24, 2026 4 min read

About this article

Abstract page for arXiv paper 2603.20453: Reinforcement Learning from Multi-Source Imperfect Preferences: Best-of-Both-Regimes Regret

Computer Science > Machine Learning arXiv:2603.20453 (cs) [Submitted on 20 Mar 2026] Title:Reinforcement Learning from Multi-Source Imperfect Preferences: Best-of-Both-Regimes Regret Authors:Ming Shi, Yingbin Liang, Ness B. Shroff, Ananthram Swami View a PDF of the paper titled Reinforcement Learning from Multi-Source Imperfect Preferences: Best-of-Both-Regimes Regret, by Ming Shi and 3 other authors View PDF HTML (experimental) Abstract:Reinforcement learning from human feedback (RLHF) replaces hard-to-specify rewards with pairwise trajectory preferences, yet regret-oriented theory often assumes that preference labels are generated consistently from a single ground-truth objective. In practical RLHF systems, however, feedback is typically \emph{multi-source} (annotators, experts, reward models, heuristics) and can exhibit systematic, persistent mismatches due to subjectivity, expertise variation, and annotation/modeling artifacts. We study episodic RL from \emph{multi-source imperfect preferences} through a cumulative imperfection budget: for each source, the total deviation of its preference probabilities from an ideal oracle is at most $\omega$ over $K$ episodes. We propose a unified algorithm with regret $\tilde{O}(\sqrt{K/M}+\omega)$, which exhibits a best-of-both-regimes behavior: it achieves $M$-dependent statistical gains when imperfection is small (where $M$ is the number of sources), while remaining robust with unavoidable additive dependence on $\omega$ when imp...

Originally published on March 24, 2026. Curated by AI News.

Llms

[R] GPT-5.4-mini regressed 22pp on vanilla prompting vs GPT-5-mini. Nobody noticed because benchmarks don't test this. Recursive Language Models solved it.

GPT-5.4-mini produces shorter, terser outputs by default. Vanilla accuracy dropped from 69.5% to 47.2% across 12 tasks (1,800 evals). The...

Reddit - Machine Learning · 1 min · about 3 hours ago

Ai Startups

Top 10 AI certifications and courses for 2026

This article reviews the top 10 AI certifications and courses for 2026, highlighting their significance in a rapidly evolving field and t...

AI Events · 15 min · about 3 hours ago

Machine Learning

Hub Group Using AI, Machine Learning for Real-Time Visibility of Shipments

Hub Group says it’s using artificial intelligence and machine learning to leverage data from its GPS-equipped container fleet to give cus...

AI Events · 4 min · about 3 hours ago

Ai Infrastructure

UMKC Announces New Master of Science in Artificial Intelligence

UMKC announces a new Master of Science in Artificial Intelligence program aimed at addressing workforce demand for AI expertise, set to l...

AI News - General · 4 min · about 4 hours ago

[2603.20453] Reinforcement Learning from Multi-Source Imperfect Preferences: Best-of-Both-Regimes Regret

About this article

Related Articles

[R] GPT-5.4-mini regressed 22pp on vanilla prompting vs GPT-5-mini. Nobody noticed because benchmarks don't test this. Recursive Language Models solved it.

Top 10 AI certifications and courses for 2026

Hub Group Using AI, Machine Learning for Real-Time Visibility of Shipments

UMKC Announces New Master of Science in Artificial Intelligence

No comments

Stay updated with AI News