[2602.17316] Same Meaning, Different Scores: Lexical and Syntactic Sensitivity in LLM Evaluation

[2602.17316] Same Meaning, Different Scores: Lexical and Syntactic Sensitivity in LLM Evaluation

arXiv - AI 3 min read Article

Summary

This paper investigates how lexical and syntactic variations affect the evaluation of Large Language Models (LLMs), revealing significant performance discrepancies across models.

Why It Matters

Understanding the sensitivity of LLM evaluations to lexical and syntactic changes is crucial for developing more robust AI systems. This research highlights the limitations of current evaluation benchmarks, suggesting that they may not accurately reflect true model capabilities, which is vital for researchers and developers in the field of AI.

Key Takeaways

  • Lexical perturbations significantly degrade LLM performance across tasks.
  • Syntactic changes have variable effects, sometimes improving results.
  • Model robustness does not correlate with model size, indicating task dependence.
  • Current evaluation benchmarks may not effectively measure true linguistic competence.
  • Robustness testing should be integrated into standard LLM evaluation practices.

Computer Science > Computation and Language arXiv:2602.17316 (cs) [Submitted on 19 Feb 2026] Title:Same Meaning, Different Scores: Lexical and Syntactic Sensitivity in LLM Evaluation Authors:Bogdan Kostić, Conor Fallon, Julian Risch, Alexander Löser View a PDF of the paper titled Same Meaning, Different Scores: Lexical and Syntactic Sensitivity in LLM Evaluation, by Bogdan Kosti\'c and 3 other authors View PDF HTML (experimental) Abstract:The rapid advancement of Large Language Models (LLMs) has established standardized evaluation benchmarks as the primary instrument for model comparison. Yet, their reliability is increasingly questioned due to sensitivity to shallow variations in input prompts. This paper examines how controlled, truth-conditionally equivalent lexical and syntactic perturbations affect the absolute performance and relative ranking of 23 contemporary LLMs across three benchmarks: MMLU, SQuAD, and AMEGA. We employ two linguistically principled pipelines to generate meaning-preserving variations: one performing synonym substitution for lexical changes, and another using dependency parsing to determine applicable syntactic transformations. Results show that lexical perturbations consistently induce substantial, statistically significant performance degradation across nearly all models and tasks, while syntactic perturbations have more heterogeneous effects, occasionally improving results. Both perturbation types destabilize model leaderboards on complex tasks...

Related Articles

Llms

A robot car with a Claude AI brain started a YouTube vlog about its own existence

Not a demo reel. Not a tutorial. A robot narrating its own experience — debugging, falling off shelves, questioning its identity. First-p...

Reddit - Artificial Intelligence · 1 min ·
Llms

Study: LLMs Able to De-Anonymize User Accounts on Reddit, Hacker News & Other "Pseudonymous" Platforms; Report Co-Author Expands, Advises

Advice from the study's co-author: "Be aware that it’s not any single post that identifies you, but the combination of small details acro...

Reddit - Artificial Intelligence · 1 min ·
Llms

do you guys actually trust AI tools with your data?

idk if it’s just me but lately i’ve been thinking about how casually we use stuff like chatgpt and claude for everything like coding, ran...

Reddit - Artificial Intelligence · 1 min ·
Llms

[P] Remote sensing foundation models made easy to use.

This project enables the idea of tasking remote sensing models to acquire embeddings like we task satellites to acquire data! https://git...

Reddit - Machine Learning · 1 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime