[2602.17316] Same Meaning, Different Scores: Lexical and Syntactic Sensitivity in LLM Evaluation
Summary
This paper investigates how lexical and syntactic variations affect the evaluation of Large Language Models (LLMs), revealing significant performance discrepancies across models.
Why It Matters
Understanding the sensitivity of LLM evaluations to lexical and syntactic changes is crucial for developing more robust AI systems. This research highlights the limitations of current evaluation benchmarks, suggesting that they may not accurately reflect true model capabilities, which is vital for researchers and developers in the field of AI.
Key Takeaways
- Lexical perturbations significantly degrade LLM performance across tasks.
- Syntactic changes have variable effects, sometimes improving results.
- Model robustness does not correlate with model size, indicating task dependence.
- Current evaluation benchmarks may not effectively measure true linguistic competence.
- Robustness testing should be integrated into standard LLM evaluation practices.
Computer Science > Computation and Language arXiv:2602.17316 (cs) [Submitted on 19 Feb 2026] Title:Same Meaning, Different Scores: Lexical and Syntactic Sensitivity in LLM Evaluation Authors:Bogdan Kostić, Conor Fallon, Julian Risch, Alexander Löser View a PDF of the paper titled Same Meaning, Different Scores: Lexical and Syntactic Sensitivity in LLM Evaluation, by Bogdan Kosti\'c and 3 other authors View PDF HTML (experimental) Abstract:The rapid advancement of Large Language Models (LLMs) has established standardized evaluation benchmarks as the primary instrument for model comparison. Yet, their reliability is increasingly questioned due to sensitivity to shallow variations in input prompts. This paper examines how controlled, truth-conditionally equivalent lexical and syntactic perturbations affect the absolute performance and relative ranking of 23 contemporary LLMs across three benchmarks: MMLU, SQuAD, and AMEGA. We employ two linguistically principled pipelines to generate meaning-preserving variations: one performing synonym substitution for lexical changes, and another using dependency parsing to determine applicable syntactic transformations. Results show that lexical perturbations consistently induce substantial, statistically significant performance degradation across nearly all models and tasks, while syntactic perturbations have more heterogeneous effects, occasionally improving results. Both perturbation types destabilize model leaderboards on complex tasks...