[2603.26567] Beyond Code Snippets: Benchmarking LLMs on Repository-Level Question Answering
About this article
Abstract page for arXiv paper 2603.26567: Beyond Code Snippets: Benchmarking LLMs on Repository-Level Question Answering
Computer Science > Software Engineering arXiv:2603.26567 (cs) [Submitted on 27 Mar 2026] Title:Beyond Code Snippets: Benchmarking LLMs on Repository-Level Question Answering Authors:Yoseph Berhanu Alebachew, Hunter Leary, Swanand Vaishampayan, Chris Brown View a PDF of the paper titled Beyond Code Snippets: Benchmarking LLMs on Repository-Level Question Answering, by Yoseph Berhanu Alebachew and Hunter Leary and Swanand Vaishampayan and Chris Brown View PDF HTML (experimental) Abstract:Large Language Models (LLMs) have shown impressive capabilities across software engineering tasks, including question answering (QA). However, most studies and benchmarks focus on isolated functions or single-file snippets, overlooking the challenges of real-world program comprehension, which often spans multiple files and system-level dependencies. In this work, we introduce StackRepoQA, the first multi-project, repository-level question answering dataset constructed from 1,318 real developer questions and accepted answers across 134 open-source Java projects. Using this dataset, we systematically evaluate two widely used LLMs (Claude 3.5 Sonnet and GPT-4o) under both direct prompting and agentic configurations. We compare baseline performance with retrieval-augmented generation methods that leverage file-level retrieval and graph-based representations of structural dependencies. Our results show that LLMs achieve moderate accuracy at baseline, with performance improving when structural sig...