[2603.23443] Evaluating LLM-Based Test Generation Under Software Evolution
About this article
Abstract page for arXiv paper 2603.23443: Evaluating LLM-Based Test Generation Under Software Evolution
Computer Science > Software Engineering arXiv:2603.23443 (cs) [Submitted on 24 Mar 2026] Title:Evaluating LLM-Based Test Generation Under Software Evolution Authors:Sabaat Haroon, Mohammad Taha Khan, Muhammad Ali Gulzar View a PDF of the paper titled Evaluating LLM-Based Test Generation Under Software Evolution, by Sabaat Haroon and 2 other authors View PDF HTML (experimental) Abstract:Large Language Models (LLMs) are increasingly used for automated unit test generation. However, it remains unclear whether these tests reflect genuine reasoning about program behavior or simply reproduce superficial patterns learned during training. If the latter dominates, LLM-generated tests may exhibit weaknesses such as reduced coverage, missed regressions, and undetected faults. Understanding how LLMs generate tests and how those tests respond to code evolution is therefore essential. We present a large-scale empirical study of LLM-based test generation under program changes. Using an automated mutation-driven framework, we analyze how generated tests react to semantic-altering changes (SAC) and semantic-preserving changes (SPC) across eight LLMs and 22,374 program variants. LLMs achieve strong baseline results, reaching 79% line coverage and 76% branch coverage with fully passing test suites on the original programs. However, performance degrades as programs evolve. Under SACs, the pass rate of newly generated tests drops to 66%, and branch coverage declines to 60%. More than 99% of fa...