Llms Machine Learning Data Science Ai Safety

[2602.17183] Robustness and Reasoning Fidelity of Large Language Models in Long-Context Code Question Answering

arXiv - AI February 20, 2026 3 min read Article

Summary

This article examines the robustness and reasoning fidelity of large language models (LLMs) in long-context code question answering, revealing significant performance limitations under various conditions.

Why It Matters

As LLMs are increasingly used in software engineering, understanding their limitations in reasoning over long code contexts is crucial for improving their reliability. This study provides insights into the challenges faced by LLMs, which can inform future research and development in AI-assisted coding tools.

Key Takeaways

LLMs show substantial performance drops in long-context code question answering.
The study introduces new datasets for evaluating LLMs in COBOL and Java.
Performance is particularly brittle when irrelevant cues are present.
Current long-context evaluations may not adequately assess LLM capabilities.
Findings can guide improvements in AI-assisted software engineering tools.

Computer Science > Software Engineering arXiv:2602.17183 (cs) [Submitted on 19 Feb 2026] Title:Robustness and Reasoning Fidelity of Large Language Models in Long-Context Code Question Answering Authors:Kishan Maharaj, Nandakishore Menon, Ashita Saxena, Srikanth Tamilselvam View a PDF of the paper titled Robustness and Reasoning Fidelity of Large Language Models in Long-Context Code Question Answering, by Kishan Maharaj and 3 other authors View PDF HTML (experimental) Abstract:Large language models (LLMs) increasingly assist software engineering tasks that require reasoning over long code contexts, yet their robustness under varying input conditions remains unclear. We conduct a systematic study of long-context code question answering using controlled ablations that test sensitivity to answer format, distractors, and context scale. Extending LongCodeBench Python dataset with new COBOL and Java question-answer sets, we evaluate state-of-the-art models under three settings: (i) shuffled multiple-choice options, (ii) open-ended questions and (iii) needle-in-a-haystack contexts containing relevant and adversarially irrelevant information. Results show substantial performance drops in both shuffled multiple-choice options and open-ended questions, and brittle behavior in the presence of irrelevant cues. Our findings highlight limitations of current long-context evaluations and provide a broader benchmark for assessing code reasoning in both legacy and modern systems. Comments: S...

Read Original Article

[2602.17183] Robustness and Reasoning Fidelity of Large Language Models in Long-Context Code Question Answering

Summary

Why It Matters

Key Takeaways

Related Articles

Anthropic Restricts Claude Agent Access Amid AI Automation Boom in Crypto

Iran threatens ‘complete and utter annihilation’ of OpenAI's $30B Stargate AI data center in Abu Dhabi — regime posts video with satellite imagery of ChatGPT-maker's premier 1GW data center

How To Use Claude AI In 2026 - Full Tutorial In Hindi Full Write-up (QcKiaUE9n8)

AI Desktop 98 lets you chat with Claude, ChatGPT, and Gemini through a Windows 98-inspired interface

No comments

Stay updated with AI News