[2511.13494] Language-Guided Invariance Probing of Vision-Language Models
Summary
This article introduces Language-Guided Invariance Probing (LGIP), a benchmark for evaluating the robustness of vision-language models (VLMs) against linguistic perturbations.
Why It Matters
Understanding how VLMs respond to linguistic variations is crucial for improving their reliability in real-world applications. LGIP offers a new diagnostic tool to assess linguistic robustness, which is often overlooked by traditional accuracy metrics.
Key Takeaways
- LGIP measures invariance to paraphrases and sensitivity to semantic changes in image-text matching.
- The benchmark reveals performance disparities among various VLMs, highlighting strengths and weaknesses.
- EVA02-CLIP and large OpenCLIP variants demonstrate favorable invariance-sensitivity balance.
- Standard retrieval metrics may not capture linguistic robustness, necessitating new evaluation methods.
- The findings can guide future research in enhancing VLMs' linguistic capabilities.
Computer Science > Computer Vision and Pattern Recognition arXiv:2511.13494 (cs) [Submitted on 17 Nov 2025] Title:Language-Guided Invariance Probing of Vision-Language Models Authors:Jae Joong Lee View a PDF of the paper titled Language-Guided Invariance Probing of Vision-Language Models, by Jae Joong Lee View PDF HTML (experimental) Abstract:Recent vision-language models (VLMs) such as CLIP, OpenCLIP, EVA02-CLIP and SigLIP achieve strong zero-shot performance, but it is unclear how reliably they respond to controlled linguistic perturbations. We introduce Language-Guided Invariance Probing (LGIP), a benchmark that measures (i) invariance to meaning-preserving paraphrases and (ii) sensitivity to meaning-changing semantic flips in image-text matching. Using 40k MS COCO images with five human captions each, we automatically generate paraphrases and rule-based flips that alter object category, color or count, and summarize model behavior with an invariance error, a semantic sensitivity gap and a positive-rate statistic. Across nine VLMs, EVA02-CLIP and large OpenCLIP variants lie on a favorable invariance-sensitivity frontier, combining low paraphrase-induced variance with consistently higher scores for original captions than for their flipped counterparts. In contrast, SigLIP and SigLIP2 show much larger invariance error and often prefer flipped captions to the human descriptions, especially for object and color edits. These failures are largely invisible to standard retriev...