[2504.04372] Assessing the Impact of Code Changes on the Fault Localizability of Large Language Models
About this article
Abstract page for arXiv paper 2504.04372: Assessing the Impact of Code Changes on the Fault Localizability of Large Language Models
Computer Science > Software Engineering arXiv:2504.04372 (cs) [Submitted on 6 Apr 2025 (v1), last revised 5 Mar 2026 (this version, v4)] Title:Assessing the Impact of Code Changes on the Fault Localizability of Large Language Models Authors:Sabaat Haroon, Ahmad Faraz Khan, Ahmad Humayun, Waris Gill, Abdul Haddi Amjad, Ali R. Butt, Mohammad Taha Khan, Muhammad Ali Gulzar View a PDF of the paper titled Assessing the Impact of Code Changes on the Fault Localizability of Large Language Models, by Sabaat Haroon and 7 other authors View PDF HTML (experimental) Abstract:Generative Large Language Models (LLMs) are increasingly used in non-generative software maintenance tasks, such as fault localization (FL). Success in FL depends on a models ability to reason about program semantics beyond surface-level syntactic and lexical features. However, widely used LLM benchmarks primarily evaluate code generation, which differs fundamentally from semantic program reasoning. Meanwhile, traditional FL benchmarks such as Defect4J and BugsInPy are either not scalable or obsolete, as their datasets have become part of LLM training data, leading to biased results. This paper presents the first large-scale empirical investigation into the robustness of LLMs fault localizability. Inspired by mutation testing, we develop an end-to-end evaluation framework that addresses key limitations in existing LLM evaluation, including data contamination, scalability, automation, and extensibility. Using real-...