[2505.12185] EVALOOOP: A Self-Consistency-Centered Framework for Assessing Large Language Model Robustness in Programming
Summary
The paper introduces EVALOOOP, a framework for assessing the robustness of large language models (LLMs) in programming tasks through self-consistency, highlighting its effectiveness over traditional adversarial methods.
Why It Matters
As LLMs become integral in software development, ensuring their reliability is crucial. EVALOOOP addresses current evaluation limitations, providing a more accurate measure of robustness that reflects real-world coding scenarios, thus enhancing AI reliability in programming.
Key Takeaways
- EVALOOOP evaluates LLM robustness using a self-consistency framework.
- It introduces the Average Sustainable Loops (ASL) metric for quantifying robustness.
- The framework reveals that robustness does not always correlate with initial performance.
- EVALOOOP was tested on 96 LLMs, showing significant drops in accuracy across iterations.
- This approach offers a unified metric for assessing semantic integrity in coding tasks.
Computer Science > Software Engineering arXiv:2505.12185 (cs) [Submitted on 18 May 2025 (v1), last revised 15 Feb 2026 (this version, v5)] Title:EVALOOOP: A Self-Consistency-Centered Framework for Assessing Large Language Model Robustness in Programming Authors:Sen Fang, Weiyuan Ding, Mengshi Zhang, Zihao Chen, Bowen Xu View a PDF of the paper titled EVALOOOP: A Self-Consistency-Centered Framework for Assessing Large Language Model Robustness in Programming, by Sen Fang and 4 other authors View PDF HTML (experimental) Abstract:Evaluating the programming robustness of large language models (LLMs) is paramount for ensuring their reliability in AI-based software development. However, adversarial attacks exhibit fundamental limitations that compromise fair robustness assessment: they demonstrate contradictory evaluation outcomes where different attack strategies tend to favor different models, and more critically, they operate solely through external perturbations, failing to capture the intrinsic stability essential for autonomous coding agents where subsequent inputs are endogenously generated by the model itself. We introduce EVALOOOP, a novel assessment framework that evaluates robustness from a self-consistency perspective, leveraging the natural duality inherent in software engineering tasks (e.g., code generation and code summarization). EVALOOOP establishes a self-contained feedback loop where an LLM iteratively transforms between code and natural language until functi...