[2602.10067] Features as Rewards: Scalable Supervision for Open-Ended Tasks via Interpretability
Summary
The paper introduces a novel approach to using features as rewards in reinforcement learning for open-ended tasks, focusing on reducing hallucinations in language models.
Why It Matters
This research addresses the critical issue of hallucinations in language models, proposing a scalable supervision method that leverages interpretability. By enhancing the reliability of AI outputs, it has implications for various applications in natural language processing and AI safety.
Key Takeaways
- Introduces features as rewards for scalable supervision in AI.
- Develops a reinforcement learning pipeline to reduce hallucinations.
- Demonstrates a 58% reduction in hallucination likelihood in tested models.
- Utilizes a novel probing framework to identify and correct hallucinated claims.
- Highlights the importance of interpretability in learning open-ended tasks.
Computer Science > Machine Learning arXiv:2602.10067 (cs) [Submitted on 10 Feb 2026 (v1), last revised 18 Feb 2026 (this version, v3)] Title:Features as Rewards: Scalable Supervision for Open-Ended Tasks via Interpretability Authors:Aaditya Vikram Prasad, Connor Watts, Jack Merullo, Dhruvil Gala, Owen Lewis, Thomas McGrath, Ekdeep Singh Lubana View a PDF of the paper titled Features as Rewards: Scalable Supervision for Open-Ended Tasks via Interpretability, by Aaditya Vikram Prasad and 6 other authors View PDF Abstract:Language models trained on large-scale datasets have been shown to learn features that encode abstract concepts such as factuality or intent. Such features are traditionally used for test-time monitoring or steering. We present an alternative affordance: features as scalable supervision for open-ended tasks. We consider the case of hallucination-reduction as a desirable, yet open-ended behavior and design a reinforcement learning (RL) pipeline, titled RLFR (Reinforcement Learning from Feature Rewards), that uses features as reward functions. Grounded in a novel probing framework that identifies candidate hallucinated claims, our pipeline teaches a model to intervene and correct its completions when it is uncertain of their factuality. Furthermore, the pipeline enables scalable test-time compute, guided once more by our reward features. This end-to-end process operationalized on Gemma-3-12B-IT results in a policy that is 58% less likely to hallucinate compare...