[2603.21016] Mitigating Selection Bias in Large Language Models via Permutation-Aware GRPO
About this article
Abstract page for arXiv paper 2603.21016: Mitigating Selection Bias in Large Language Models via Permutation-Aware GRPO
Computer Science > Computation and Language arXiv:2603.21016 (cs) [Submitted on 22 Mar 2026] Title:Mitigating Selection Bias in Large Language Models via Permutation-Aware GRPO Authors:Jinquan Zheng, Jia Yuan, Jiacheng Yao, Chenyang Gu, Pujun Zheng, Guoxiu He View a PDF of the paper titled Mitigating Selection Bias in Large Language Models via Permutation-Aware GRPO, by Jinquan Zheng and 5 other authors View PDF HTML (experimental) Abstract:Large language models (LLMs) used for multiple-choice and pairwise evaluation tasks often exhibit selection bias due to non-semantic factors like option positions and label symbols. Existing inference-time debiasing is costly and may harm reasoning, while pointwise training ignores that the same question should yield consistent answers across permutations. To address this issue, we propose Permutation-Aware Group Relative Policy Optimization (PA-GRPO), which mitigates selection bias by enforcing permutation-consistent semantic reasoning. PA-GRPO constructs a permutation group for each instance by generating multiple candidate permutations, and optimizes the model using two complementary mechanisms: (1) cross-permutation advantage, which computes advantages relative to the mean reward over all permutations of the same instance, and (2) consistency-aware reward, which encourages the model to produce consistent decisions across different permutations. Experimental results demonstrate that PA-GRPO outperforms strong baselines across seven b...