[2604.02485] Failing to Falsify: Evaluating and Mitigating Confirmation Bias in Language Models
About this article
Abstract page for arXiv paper 2604.02485: Failing to Falsify: Evaluating and Mitigating Confirmation Bias in Language Models
Computer Science > Computation and Language arXiv:2604.02485 (cs) [Submitted on 2 Apr 2026] Title:Failing to Falsify: Evaluating and Mitigating Confirmation Bias in Language Models Authors:Ayush Rajesh Jhaveri, Anthony GX-Chen, Ilia Sucholutsky, Eunsol Choi View a PDF of the paper titled Failing to Falsify: Evaluating and Mitigating Confirmation Bias in Language Models, by Ayush Rajesh Jhaveri and 3 other authors View PDF HTML (experimental) Abstract:Confirmation bias, the tendency to seek evidence that supports rather than challenges one's belief, hinders one's reasoning ability. We examine whether large language models (LLMs) exhibit confirmation bias by adapting the rule-discovery study from human psychology: given a sequence of three numbers (a "triple"), an agent engages in an interactive feedback loop where it (1) proposes a new triple, (2) receives feedback on whether it satisfies the hidden rule, and (3) guesses the rule. Across eleven LLMs of multiple families and scales, we find that LLMs exhibit confirmation bias, often proposing triples to confirm their hypothesis rather than trying to falsify it. This leads to slower and less frequent discovery of the hidden rule. We further explore intervention strategies (e.g., encouraging the agent to consider counter examples) developed for humans. We find prompting LLMs with such instruction consistently decreases confirmation bias in LLMs, improving rule discovery rates from 42% to 56% on average. Lastly, we mitigate con...