[2604.07484] ConsistRM: Improving Generative Reward Models via Consistency-Aware Self-Training

[2604.07484] ConsistRM: Improving Generative Reward Models via Consistency-Aware Self-Training

arXiv - Machine Learning 4 min read

About this article

Abstract page for arXiv paper 2604.07484: ConsistRM: Improving Generative Reward Models via Consistency-Aware Self-Training

Computer Science > Artificial Intelligence arXiv:2604.07484 (cs) [Submitted on 8 Apr 2026 (v1), last revised 20 Apr 2026 (this version, v2)] Title:ConsistRM: Improving Generative Reward Models via Consistency-Aware Self-Training Authors:Yu Liang, Liangxin Liu, Longzheng Wang, Yan Wang, Yueyang Zhang, Long Xia, Zhiyuan Sun, Daiting Shi View a PDF of the paper titled ConsistRM: Improving Generative Reward Models via Consistency-Aware Self-Training, by Yu Liang and 7 other authors View PDF HTML (experimental) Abstract:Generative reward models (GRMs) have emerged as a promising approach for aligning Large Language Models (LLMs) with human preferences by offering greater representational capacity and flexibility than traditional scalar reward models. However, GRMs face two major challenges: reliance on costly human-annotated data restricts scalability, and self-training approaches often suffer from instability and vulnerability to reward hacking. To address these issues, we propose ConsistRM, a self-training framework that enables effective and stable GRM training without human annotations. ConsistRM incorporates the Consistency-Aware Answer Reward, which produces reliable pseudo-labels with temporal consistency, thereby providing more stable model optimization. Moreover, the Consistency-Aware Critique Reward is introduced to assess semantic consistency across multiple critiques and allocates fine-grained and differentiated rewards. Experiments on five benchmark datasets across...

Originally published on April 21, 2026. Curated by AI News.

Related Articles

Llms

Project Idea. Dream display project. 3 LLMs spitball the idea and tech specs and programs needed.

submitted by /u/Ok_Nectarine_4445 [link] [comments]

Reddit - Artificial Intelligence · 1 min ·
[2604.07562] Reasoning-Based Refinement of Unsupervised Text Clusters with LLMs
Llms

[2604.07562] Reasoning-Based Refinement of Unsupervised Text Clusters with LLMs

Abstract page for arXiv paper 2604.07562: Reasoning-Based Refinement of Unsupervised Text Clusters with LLMs

arXiv - Machine Learning · 4 min ·
[2603.05863] ReflexiCoder: Teaching Large Language Models to Self-Reflect on Generated Code and Self-Correct It via Reinforcement Learning
Llms

[2603.05863] ReflexiCoder: Teaching Large Language Models to Self-Reflect on Generated Code and Self-Correct It via Reinforcement Learning

Abstract page for arXiv paper 2603.05863: ReflexiCoder: Teaching Large Language Models to Self-Reflect on Generated Code and Self-Correct...

arXiv - Machine Learning · 4 min ·
[2601.21278] GeoRC: A Benchmark for Geolocation Reasoning Chains
Llms

[2601.21278] GeoRC: A Benchmark for Geolocation Reasoning Chains

Abstract page for arXiv paper 2601.21278: GeoRC: A Benchmark for Geolocation Reasoning Chains

arXiv - Machine Learning · 4 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime