Llms Machine Learning Ai Safety Ai Agents

[2602.21765] Generalisation of RLHF under Reward Shift and Clipped KL Regularisation

arXiv - Machine Learning February 26, 2026 4 min read Article

Summary

This paper explores the generalization of Reinforcement Learning from Human Feedback (RLHF) under conditions of reward shift and clipped KL regularization, providing theoretical insights and practical implications.

Why It Matters

Understanding the generalization of RLHF is crucial for improving the alignment and adaptability of large language models. This research addresses gaps in the theoretical framework, particularly regarding how reward shifts and regularization techniques affect model performance, which is vital for developing more robust AI systems.

Key Takeaways

The paper develops a generalization theory for RLHF that accounts for reward shifts and clipped KL regularization.
Generalization error in RLHF arises from sampling errors, reward shifts, and KL clipping errors.
Practical implications include determining optimal KL clipping thresholds and budget allocation for prompts and rollouts.

Computer Science > Machine Learning arXiv:2602.21765 (cs) [Submitted on 25 Feb 2026] Title:Generalisation of RLHF under Reward Shift and Clipped KL Regularisation Authors:Kenton Tang, Yuzhu Chen, Fengxiang He View a PDF of the paper titled Generalisation of RLHF under Reward Shift and Clipped KL Regularisation, by Kenton Tang and 2 other authors View PDF HTML (experimental) Abstract:Alignment and adaptation in large language models heavily rely on reinforcement learning from human feedback (RLHF); yet, theoretical understanding of its generalisability remains premature, especially when the learned reward could shift, and the KL control is estimated and clipped. To address this issue, we develop generalisation theory for RLHF that explicitly accounts for (1) \emph{reward shift}: reward models are trained on preference data from earlier or mixed behaviour policies while RLHF optimises the current policy on its own rollouts; and (2) \emph{clipped KL regularisation}: the KL regulariser is estimated from sampled log-probability ratios and then clipped for stabilisation, resulting in an error to RLHF. We present generalisation bounds for RLHF, suggesting that the generalisation error stems from a sampling error from prompts and rollouts, a reward shift error, and a KL clipping error. We also discuss special cases of (1) initialising RLHF parameters with a uniform prior over a finite space, and (2) training RLHF by stochastic gradient descent, as an Ornstein-Uhlenbeck process. Th...

Read Original Article

Llms

Mercor says it was hit by cyberattack tied to compromise of open-source LiteLLM project | TechCrunch

The AI recruiting startup confirmed a security incident after an extortion hacking crew took credit for stealing data from the company's ...

TechCrunch - AI · 4 min · 6 minutes ago

Llms

Is the Mirage Effect a bug, or is it Geometric Reconstruction in action? A framework for why VLMs perform better "hallucinating" than guessing, and what that may tell us about what's really inside these models

Last week, a team from Stanford and UCSF (Asadi, O'Sullivan, Fei-Fei Li, Euan Ashley et al.) dropped two companion papers. The first, MAR...

Reddit - Artificial Intelligence · 1 min · about 2 hours ago

Llms

Paper Finds That Leading AI Chatbots Like ChatGPT and Claude Remain Incredibly Sycophantic, Resulting in Twisted Effects on Users

https://futurism.com/artificial-intelligence/paper-ai-chatbots-chatgpt-claude-sycophantic Your AI chatbot isn’t neutral. Trust its advice...

Reddit - Artificial Intelligence · 1 min · about 2 hours ago

Llms

Claude Code leak exposes a Tamagotchi-style ‘pet’ and an always-on agent | The Verge

Anthropic says “human error” resulted in a leak that exposed Claude Code’s source code. The leaked code, which has since been copied to G...

The Verge - AI · 4 min · about 3 hours ago

[2602.21765] Generalisation of RLHF under Reward Shift and Clipped KL Regularisation

Summary

Why It Matters

Key Takeaways

Related Articles

Mercor says it was hit by cyberattack tied to compromise of open-source LiteLLM project | TechCrunch

Is the Mirage Effect a bug, or is it Geometric Reconstruction in action? A framework for why VLMs perform better "hallucinating" than guessing, and what that may tell us about what's really inside these models

Paper Finds That Leading AI Chatbots Like ChatGPT and Claude Remain Incredibly Sycophantic, Resulting in Twisted Effects on Users

Claude Code leak exposes a Tamagotchi-style ‘pet’ and an always-on agent | The Verge

No comments

Stay updated with AI News