[2603.00539] Are LLMs Reliable Code Reviewers? Systematic Overcorrection in Requirement Conformance Judgement
About this article
Abstract page for arXiv paper 2603.00539: Are LLMs Reliable Code Reviewers? Systematic Overcorrection in Requirement Conformance Judgement
Computer Science > Software Engineering arXiv:2603.00539 (cs) [Submitted on 28 Feb 2026] Title:Are LLMs Reliable Code Reviewers? Systematic Overcorrection in Requirement Conformance Judgement Authors:Haolin Jin, Huaming Chen View a PDF of the paper titled Are LLMs Reliable Code Reviewers? Systematic Overcorrection in Requirement Conformance Judgement, by Haolin Jin and 1 other authors View PDF HTML (experimental) Abstract:Large language models (LLMs) have become essential tools in software development, widely used for requirements engineering, code generation and review tasks. Software engineers often rely on LLMs to verify if code implementation satisfy task requirements, thereby ensuring code robustness and accuracy. However, it remains unclear whether LLMs can reliably determine code against the given task descriptions, which is usually in a form of natural language specifications. In this paper, we uncover a systematic failure of LLMs in matching code to natural language requirements. Specifically, with widely adopted benchmarks and unified prompts design, we demonstrate that LLMs frequently misclassify correct code implementation as non-compliant or defective. Surprisingly, we find that more detailed prompt design, particularly with those requiring explanations and proposed corrections, leads to higher misjudgment rates, highlighting critical reliability issues for LLM-based code assistants. We further analyze the mechanisms driving these failures and evaluate the rel...