[2602.14188] GPT-5 vs Other LLMs in Long Short-Context Performance
Summary
This paper evaluates the performance of GPT-5 and other LLMs on long short-context tasks, revealing significant gaps between theoretical capabilities and practical performance, especially in complex data scenarios.
Why It Matters
Understanding the limitations and strengths of LLMs like GPT-5 is crucial for developers and researchers working on applications that require processing large volumes of contextual data. This research highlights the need for metrics beyond accuracy to assess model performance effectively.
Key Takeaways
- GPT-5 shows high precision in depression detection despite accuracy drops in long contexts.
- All evaluated models experience significant performance degradation with input volumes over 5K posts.
- The 'lost in the middle' problem has been largely resolved in newer models.
- The study emphasizes the importance of using diverse metrics for evaluating LLM performance.
- The findings are relevant for applications requiring robust understanding of extensive data.
Computer Science > Computation and Language arXiv:2602.14188 (cs) [Submitted on 15 Feb 2026] Title:GPT-5 vs Other LLMs in Long Short-Context Performance Authors:Nima Esmi (1 and 2), Maryam Nezhad-Moghaddam (3), Fatemeh Borhani (3), Asadollah Shahbahrami (2 and 3), Amin Daemdoost (3), Georgi Gaydadjiev (4) ((1) Bernoulli Institute, RUG, Groningen, Netherlands, (2) ISRC, Khazar University, Baku, Azerbaijan, (3) Department of Computer Engineering, University of Guilan, Rasht, Iran, (4) QCE Department, TU Delft, Delft, Netherlands) View a PDF of the paper titled GPT-5 vs Other LLMs in Long Short-Context Performance, by Nima Esmi (1 and 2) and 20 other authors View PDF HTML (experimental) Abstract:With the significant expansion of the context window in Large Language Models (LLMs), these models are theoretically capable of processing millions of tokens in a single pass. However, research indicates a significant gap between this theoretical capacity and the practical ability of models to robustly utilize information within long contexts, especially in tasks that require a comprehensive understanding of numerous details. This paper evaluates the performance of four state-of-the-art models (Grok-4, GPT-4, Gemini 2.5, and GPT-5) on long short-context tasks. For this purpose, three datasets were used: two supplementary datasets for retrieving culinary recipes and math problems, and a primary dataset of 20K social media posts for depression detection. The results show that as the inp...