[2602.15849] Preference Optimization for Review Question Generation Improves Writing Quality
Summary
This article presents IntelliReward, a novel model for generating review questions that enhances writing quality by aligning with human preferences, showing measurable improvements in reasoning and writing benchmarks.
Why It Matters
The research addresses a critical gap in peer review processes, where existing models often produce superficial questions. By improving the quality of generated questions, this work has implications for enhancing academic writing and peer review standards, ultimately benefiting the research community.
Key Takeaways
- IntelliReward improves question generation for peer reviews.
- The model outperforms existing baselines in predicting human preferences.
- Significant gains in reasoning and writing benchmarks were observed.
- The implementation and annotations are publicly available for further research.
- Quality of review questions correlates with broader writing capabilities.
Computer Science > Computation and Language arXiv:2602.15849 (cs) [Submitted on 23 Jan 2026] Title:Preference Optimization for Review Question Generation Improves Writing Quality Authors:Karun Sharma, Vidushee Vats, Shengzhi Li, Yuxiang Wang, Zhongtian Sun, Prayag Tiwari View a PDF of the paper titled Preference Optimization for Review Question Generation Improves Writing Quality, by Karun Sharma and 5 other authors View PDF HTML (experimental) Abstract:Peer review relies on substantive, evidence-based questions, yet existing LLM-based approaches often generate surface-level queries, drawing over 50\% of their question tokens from a paper's first page. To bridge this gap, we develop IntelliReward, a novel reward model built from a frozen autoregressive LLM with trainable multi-head transformers over the final 50 token states, which outperforms API-based SFT baselines in predicting expert-level human preferences. By applying Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO) with IntelliReward, we train IntelliAsk, a question-generation model aligned with human standards of effort, evidence, and grounding. We find consistent improvements on reasoning and writing benchmarks, suggesting reviewer-question quality correlates with broader capabilities. Compared to the Qwen3-32B base model, IntelliAsk shows measurable gains across diverse benchmarks, specifically improving performance on reasoning tasks like MuSR (68.3 vs 64.7 Acc) and complex writing evaluations such...