[2603.24124] The Alignment Tax: Response Homogenization in Aligned LLMs and Its Implications for Uncertainty Estimation
About this article
Abstract page for arXiv paper 2603.24124: The Alignment Tax: Response Homogenization in Aligned LLMs and Its Implications for Uncertainty Estimation
Computer Science > Machine Learning arXiv:2603.24124 (cs) [Submitted on 25 Mar 2026] Title:The Alignment Tax: Response Homogenization in Aligned LLMs and Its Implications for Uncertainty Estimation Authors:Mingyi Liu View a PDF of the paper titled The Alignment Tax: Response Homogenization in Aligned LLMs and Its Implications for Uncertainty Estimation, by Mingyi Liu View PDF HTML (experimental) Abstract:RLHF-aligned language models exhibit response homogenization: on TruthfulQA (n=790), 40-79% of questions produce a single semantic cluster across 10 i.i.d. samples. On affected questions, sampling-based uncertainty methods have zero discriminative power (AUROC=0.500), while free token entropy retains signal (0.603). This alignment tax is task-dependent: on GSM8K (n=500), token entropy achieves 0.724 (Cohen's d=0.81). A base-vs-instruct ablation confirms the causal role of alignment: the base model shows 1.0% single-cluster rate vs. 28.5% for the instruct model (p < 10^{-6}). A training stage ablation (Base 0.0% -> SFT 1.5% -> DPO 4.0% SCR) localizes the cause to DPO, not SFT. Cross-family replication on four model families reveals alignment tax severity varies by family and scale. We validate across 22 experiments, 5 benchmarks, 4 model families, and 3 model scales (3B-14B), with Jaccard, embedding, and NLI-based baselines at three DeBERTa scales (all ~0.51 AUROC). Cross-embedder validation with two independent embedding families rules out coupling bias. Cross-dataset vali...