[2509.22237] FeatBench: Towards More Realistic Evaluation of Feature-level Code Generation
Summary
The paper introduces FeatBench, a new benchmark for evaluating feature-level code generation in Large Language Models (LLMs), addressing limitations of existing benchmarks.
Why It Matters
FeatBench aims to enhance the evaluation of LLMs in realistic software development scenarios by providing task inputs without code hints and employing an evolving data pipeline. This is significant for improving the reliability of code generation tools, which are increasingly used in software engineering.
Key Takeaways
- FeatBench introduces realistic task inputs devoid of code hints.
- The benchmark employs an automated pipeline to mitigate data contamination.
- Initial results show a challenge for LLMs, with a maximum resolved rate of 29.94%.
- The study reveals a tendency for aggressive implementation leading to regressions.
- All resources related to FeatBench are made available for community research.
Computer Science > Computation and Language arXiv:2509.22237 (cs) [Submitted on 26 Sep 2025 (v1), last revised 18 Feb 2026 (this version, v2)] Title:FeatBench: Towards More Realistic Evaluation of Feature-level Code Generation Authors:Haorui Chen, Chengze Li, Jia Li View a PDF of the paper titled FeatBench: Towards More Realistic Evaluation of Feature-level Code Generation, by Haorui Chen and 2 other authors View PDF HTML (experimental) Abstract:Evaluating Large Language Models (LLMs) on repository-level feature implementation is a critical frontier in software engineering. However, establishing a benchmark that faithfully mirrors realistic development scenarios remains a significant challenge. Existing feature-level benchmarks generally suffer from two primary limitations: unrealistic task inputs enriched with code hints and significant data leakage risks due to their static nature. To address these limitations, we propose a new benchmark - FeatBench, which introduces the following advances: (1) Realistic Task Inputs. Task inputs consist solely of natural language requirements, strictly devoid of code hints (e.g., function signatures). This format mirrors realistic software development by requiring agents to independently bridge the gap between abstract user intent and concrete code changes. (2) Evolving Data. FeatBench employs a fully automated pipeline to construct new benchmark versions from the latest repositories, effectively mitigating data contamination. The initia...