[2505.12185] EVALOOOP: A Self-Consistency-Centered Framework for Assessing Large Language Model Robustness in Programming

[2505.12185] EVALOOOP: A Self-Consistency-Centered Framework for Assessing Large Language Model Robustness in Programming

arXiv - Machine Learning 4 min read Article

Summary

The paper introduces EVALOOOP, a framework for assessing the robustness of large language models (LLMs) in programming tasks through self-consistency, highlighting its effectiveness over traditional adversarial methods.

Why It Matters

As LLMs become integral in software development, ensuring their reliability is crucial. EVALOOOP addresses current evaluation limitations, providing a more accurate measure of robustness that reflects real-world coding scenarios, thus enhancing AI reliability in programming.

Key Takeaways

  • EVALOOOP evaluates LLM robustness using a self-consistency framework.
  • It introduces the Average Sustainable Loops (ASL) metric for quantifying robustness.
  • The framework reveals that robustness does not always correlate with initial performance.
  • EVALOOOP was tested on 96 LLMs, showing significant drops in accuracy across iterations.
  • This approach offers a unified metric for assessing semantic integrity in coding tasks.

Computer Science > Software Engineering arXiv:2505.12185 (cs) [Submitted on 18 May 2025 (v1), last revised 15 Feb 2026 (this version, v5)] Title:EVALOOOP: A Self-Consistency-Centered Framework for Assessing Large Language Model Robustness in Programming Authors:Sen Fang, Weiyuan Ding, Mengshi Zhang, Zihao Chen, Bowen Xu View a PDF of the paper titled EVALOOOP: A Self-Consistency-Centered Framework for Assessing Large Language Model Robustness in Programming, by Sen Fang and 4 other authors View PDF HTML (experimental) Abstract:Evaluating the programming robustness of large language models (LLMs) is paramount for ensuring their reliability in AI-based software development. However, adversarial attacks exhibit fundamental limitations that compromise fair robustness assessment: they demonstrate contradictory evaluation outcomes where different attack strategies tend to favor different models, and more critically, they operate solely through external perturbations, failing to capture the intrinsic stability essential for autonomous coding agents where subsequent inputs are endogenously generated by the model itself. We introduce EVALOOOP, a novel assessment framework that evaluates robustness from a self-consistency perspective, leveraging the natural duality inherent in software engineering tasks (e.g., code generation and code summarization). EVALOOOP establishes a self-contained feedback loop where an LLM iteratively transforms between code and natural language until functi...

Related Articles

Llms

Claude Opus 4.6 API at 40% below Anthropic pricing – try free before you pay anything

Hey everyone I've set up a self-hosted API gateway using [New-API](QuantumNous/new-ap) to manage and distribute Claude Opus 4.6 access ac...

Reddit - Artificial Intelligence · 1 min ·
Hackers Are Posting the Claude Code Leak With Bonus Malware | WIRED
Llms

Hackers Are Posting the Claude Code Leak With Bonus Malware | WIRED

Plus: The FBI says a recent hack of its wiretap tools poses a national security risk, attackers stole Cisco source code as part of an ong...

Wired - AI · 9 min ·
Llms

People anxious about deviating from what AI tells them to do?

My friend came over yesterday to dye her hair. She had asked ChatGPT for the 'correct' way to do it. Chat told her to dye the ends first,...

Reddit - Artificial Intelligence · 1 min ·
Llms

ChatGPT on trial: A landmark test of AI liability in the practice of law

AI Tools & Products ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime