[2602.18094] OODBench: Out-of-Distribution Benchmark for Large Vision-Language Models
Summary
The paper introduces OODBench, a benchmark for evaluating large vision-language models' performance on out-of-distribution (OOD) data, highlighting significant performance gaps and proposing a new assessment metric.
Why It Matters
As AI systems are increasingly deployed in real-world scenarios, understanding their performance on OOD data is crucial for safety and reliability. OODBench addresses a gap in existing benchmarks, providing a framework for evaluating and improving VLMs in diverse conditions.
Key Takeaways
- OODBench offers a new benchmark to evaluate VLMs on OOD data.
- Current VLMs show notable performance degradation when faced with OOD instances.
- The benchmark includes 40K OOD instance-category pairs for comprehensive assessment.
- An automated assessment metric is proposed to evaluate VLM responses to varying question difficulties.
- The findings aim to guide future research in OOD data acquisition and evaluation.
Computer Science > Computer Vision and Pattern Recognition arXiv:2602.18094 (cs) [Submitted on 20 Feb 2026] Title:OODBench: Out-of-Distribution Benchmark for Large Vision-Language Models Authors:Ling Lin, Yang Bai, Heng Su, Congcong Zhu, Yaoxing Wang, Yang Zhou, Huazhu Fu, Jingrun Chen View a PDF of the paper titled OODBench: Out-of-Distribution Benchmark for Large Vision-Language Models, by Ling Lin and 7 other authors View PDF HTML (experimental) Abstract:Existing Visual-Language Models (VLMs) have achieved significant progress by being trained on massive-scale datasets, typically under the assumption that data are independent and identically distributed (IID). However, in real-world scenarios, it is often impractical to expect that all data processed by an AI system satisfy this assumption. Furthermore, failure to appropriately handle out-of-distribution (OOD) objects may introduce safety risks in real-world applications (e.g., autonomous driving or medical assistance). Unfortunately, current research has not yet provided valid benchmarks that can comprehensively assess the performance of VLMs in response to OOD data. Therefore, we propose OODBench, a predominantly automated method with minimal human verification, for constructing new benchmarks and evaluating the ability of VLMs to process OOD data. OODBench contains 40K instance-level OOD instance-category pairs, and we show that current VLMs still exhibit notable performance degradation on OODBench, even when the und...