[2602.18094] OODBench: Out-of-Distribution Benchmark for Large Vision-Language Models

[2602.18094] OODBench: Out-of-Distribution Benchmark for Large Vision-Language Models

arXiv - AI 4 min read Article

Summary

The paper introduces OODBench, a benchmark for evaluating large vision-language models' performance on out-of-distribution (OOD) data, highlighting significant performance gaps and proposing a new assessment metric.

Why It Matters

As AI systems are increasingly deployed in real-world scenarios, understanding their performance on OOD data is crucial for safety and reliability. OODBench addresses a gap in existing benchmarks, providing a framework for evaluating and improving VLMs in diverse conditions.

Key Takeaways

  • OODBench offers a new benchmark to evaluate VLMs on OOD data.
  • Current VLMs show notable performance degradation when faced with OOD instances.
  • The benchmark includes 40K OOD instance-category pairs for comprehensive assessment.
  • An automated assessment metric is proposed to evaluate VLM responses to varying question difficulties.
  • The findings aim to guide future research in OOD data acquisition and evaluation.

Computer Science > Computer Vision and Pattern Recognition arXiv:2602.18094 (cs) [Submitted on 20 Feb 2026] Title:OODBench: Out-of-Distribution Benchmark for Large Vision-Language Models Authors:Ling Lin, Yang Bai, Heng Su, Congcong Zhu, Yaoxing Wang, Yang Zhou, Huazhu Fu, Jingrun Chen View a PDF of the paper titled OODBench: Out-of-Distribution Benchmark for Large Vision-Language Models, by Ling Lin and 7 other authors View PDF HTML (experimental) Abstract:Existing Visual-Language Models (VLMs) have achieved significant progress by being trained on massive-scale datasets, typically under the assumption that data are independent and identically distributed (IID). However, in real-world scenarios, it is often impractical to expect that all data processed by an AI system satisfy this assumption. Furthermore, failure to appropriately handle out-of-distribution (OOD) objects may introduce safety risks in real-world applications (e.g., autonomous driving or medical assistance). Unfortunately, current research has not yet provided valid benchmarks that can comprehensively assess the performance of VLMs in response to OOD data. Therefore, we propose OODBench, a predominantly automated method with minimal human verification, for constructing new benchmarks and evaluating the ability of VLMs to process OOD data. OODBench contains 40K instance-level OOD instance-category pairs, and we show that current VLMs still exhibit notable performance degradation on OODBench, even when the und...

Related Articles

Llms

I Accidentally Discovered a Security Vulnerability in AI Education — Then Submitted It To a $200K Competition

Last night I was testing Maestro University, the first fully AI-taught university. I walked into their enrollment chatbot and asked it to...

Reddit - Artificial Intelligence · 1 min ·
Llms

Is anyone else concerned with this blatant potential of security / privacy breach?

Recently, when sending a very sensitive email to my brother including my mother’s health information, I wondered what happens if a recipi...

Reddit - Artificial Intelligence · 1 min ·
Llms

An attack class that passes every current LLM filter - no payload, no injection signature, no log trace

https://shapingrooms.com/research I published a paper today on something I've been calling postural manipulation. The short version: ordi...

Reddit - Artificial Intelligence · 1 min ·
Llms

[R] An attack class that passes every current LLM filter - no payload, no injection signature, no log trace

https://shapingrooms.com/research I've been documenting what I'm calling postural manipulation: a specific class of language that install...

Reddit - Machine Learning · 1 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime