[2605.07186] The Text Uncanny Valley: Non-Monotonic Performance Degradation in LLM Information Retrieval
About this article
Abstract page for arXiv paper 2605.07186: The Text Uncanny Valley: Non-Monotonic Performance Degradation in LLM Information Retrieval
Computer Science > Computation and Language arXiv:2605.07186 (cs) [Submitted on 8 May 2026] Title:The Text Uncanny Valley: Non-Monotonic Performance Degradation in LLM Information Retrieval Authors:Zekai Tong, Ruiyao Xu, Aryan Shrivastava, Chenhao Tan, Ari Holtzman View a PDF of the paper titled The Text Uncanny Valley: Non-Monotonic Performance Degradation in LLM Information Retrieval, by Zekai Tong and 4 other authors View PDF HTML (experimental) Abstract:Existing Large Language Model (LLM) benchmarks primarily focus on syntactically correct inputs, leaving a significant gap in evaluation on imperfect text. In this work, we study how word-boundary corruption affects how LLMs detect targeted information. By inserting whitespace characters within words to break them into fragments, LLMs' detection accuracy follows a U-shaped curve with the increase in insertion rate. We refer to this curve as the Text Uncanny Valley. To explain such observation, we propose a mode transition hypothesis: LLMs operate in a word-level mode for near-normal text and a character-level mode for heavily fragmented text, with the valley marking the disordered transition where neither mode is effective. Four experiments and one analysis are consistent with this account: in-context learning fails to rescue valley-bottom performance; regularizing the perturbation substantially reduces the U-shape; a math reasoning task replicates the U-shape for Gemini 3.0 Flash but not for stronger models, suggesting ...