[2504.17311] FLUKE: A Linguistically-Driven and Task-Agnostic Framework for Robustness Evaluation
Summary
FLUKE introduces a novel framework for evaluating the robustness of NLP models through controlled linguistic variations, revealing task-dependent vulnerabilities in model performance.
Why It Matters
Understanding model robustness is crucial in NLP, especially as reliance on large language models grows. FLUKE's insights into linguistic variations and their impact on model performance can guide future research and development, ensuring more reliable AI applications.
Key Takeaways
- FLUKE assesses model robustness through systematic linguistic variations.
- The impact of linguistic changes is highly task-dependent.
- LLMs show brittleness to natural modifications, especially syntax and style changes.
- Scaling models improves robustness only for surface-level modifications.
- Robustness to linguistic features does not correlate with generation capabilities.
Computer Science > Computation and Language arXiv:2504.17311 (cs) [Submitted on 24 Apr 2025 (v1), last revised 20 Feb 2026 (this version, v3)] Title:FLUKE: A Linguistically-Driven and Task-Agnostic Framework for Robustness Evaluation Authors:Yulia Otmakhova, Hung Thinh Truong, Rahmad Mahendra, Zenan Zhai, Rongxin Zhu, Daniel Beck, Jey Han Lau View a PDF of the paper titled FLUKE: A Linguistically-Driven and Task-Agnostic Framework for Robustness Evaluation, by Yulia Otmakhova and 6 other authors View PDF HTML (experimental) Abstract:We present FLUKE (Framework for LingUistically-driven and tasK-agnostic robustness Evaluation), a framework for assessing model robustness through systematic minimal variations of test data. FLUKE introduces controlled variations across linguistic levels -- from orthography to dialect and style -- and leverages large language models (LLMs) with human validation to generate modifications. We demonstrate FLUKE's utility by evaluating both fine-tuned models and LLMs across six diverse NLP tasks (four classification and two generation tasks), and reveal that (1) the impact of linguistic variations is highly task-dependent, with some tests being critical for certain tasks but irrelevant for others; (2) LLMs still exhibit significant brittleness to certain linguistic variations, with reasoning LLMs surprisingly showing less robustness on some tasks compared to base models, and scaling improving robustness only for surface-level modifications; (3) mod...