[2603.27745] Needle in the Repo: A Benchmark for Maintainability in AI-Generated Repository Edits
About this article
Abstract page for arXiv paper 2603.27745: Needle in the Repo: A Benchmark for Maintainability in AI-Generated Repository Edits
Computer Science > Software Engineering arXiv:2603.27745 (cs) [Submitted on 29 Mar 2026] Title:Needle in the Repo: A Benchmark for Maintainability in AI-Generated Repository Edits Authors:Haichao Zhu, Qian Zhang, Jiyuan Wang, Zhaorui Yang, Yuxin Qiu View a PDF of the paper titled Needle in the Repo: A Benchmark for Maintainability in AI-Generated Repository Edits, by Haichao Zhu and Qian Zhang and Jiyuan Wang and Zhaorui Yang and Yuxin Qiu View PDF Abstract:AI coding agents can now complete complex programming tasks, but existing evaluations largely emphasize behavioral correctness and often overlook maintainability risks such as weak modularity or testability. We present Needle in the Repo (NITR), a diagnostic probe-and-oracle framework for evaluating whether behaviorally correct repository edits preserve maintainable structure. NITR distills recurring software engineering wisdom into controlled probes embedded in small, realistic multi-file codebases, each designed so that success depends primarily on one targeted maintainability dimension. Each probe is paired with a hidden evaluation harness that combines functional tests for required behavior with structural oracles that encode the targeted maintainability constraint and return interpretable diagnoses. Using NITR, we evaluate 23 coding configurations across GPT, Claude, Gemini, and Qwen families in both direct-inference and agent-based settings. Current AI coding systems remain far from robust: on average, configurati...