[2602.18729] MiSCHiEF: A Benchmark in Minimal-Pairs of Safety and Culture for Holistic Evaluation of Fine-Grained Image-Caption Alignment

[2602.18729] MiSCHiEF: A Benchmark in Minimal-Pairs of Safety and Culture for Holistic Evaluation of Fine-Grained Image-Caption Alignment

arXiv - AI 4 min read Article

Summary

The paper presents MiSCHiEF, a benchmark for evaluating fine-grained image-caption alignment, focusing on safety and cultural contexts, highlighting challenges in current vision-language models.

Why It Matters

As vision-language models are increasingly used in sensitive applications, understanding their limitations in fine-grained image-caption alignment is crucial. MiSCHiEF provides a structured way to assess these models, particularly in scenarios where misinterpretations can have significant real-world consequences, thereby contributing to advancements in AI safety and cultural sensitivity.

Key Takeaways

  • MiSCHiEF introduces datasets for evaluating image-caption alignment in safety and cultural contexts.
  • Models perform better at confirming correct pairs than rejecting incorrect ones, indicating a need for improvement.
  • Persistent modality misalignment challenges highlight the difficulty of achieving precise cross-modal grounding.
  • The benchmark emphasizes the importance of subtle distinctions in socially critical applications.
  • Results suggest a need for enhanced training methods for vision-language models.

Computer Science > Computer Vision and Pattern Recognition arXiv:2602.18729 (cs) [Submitted on 21 Feb 2026] Title:MiSCHiEF: A Benchmark in Minimal-Pairs of Safety and Culture for Holistic Evaluation of Fine-Grained Image-Caption Alignment Authors:Sagarika Banerjee, Tangatar Madi, Advait Swaminathan, Nguyen Dao Minh Anh, Shivank Garg, Kevin Zhu, Vasu Sharma View a PDF of the paper titled MiSCHiEF: A Benchmark in Minimal-Pairs of Safety and Culture for Holistic Evaluation of Fine-Grained Image-Caption Alignment, by Sagarika Banerjee and 6 other authors View PDF Abstract:Fine-grained image-caption alignment is crucial for vision-language models (VLMs), especially in socially critical contexts such as identifying real-world risk scenarios or distinguishing cultural proxies, where correct interpretation hinges on subtle visual or linguistic clues and where minor misinterpretations can lead to significant real-world consequences. We present MiSCHiEF, a set of two benchmarking datasets based on a contrastive pair design in the domains of safety (MiS) and culture (MiC), and evaluate four VLMs on tasks requiring fine-grained differentiation of paired images and captions. In both datasets, each sample contains two minimally differing captions and corresponding minimally differing images. In MiS, the image-caption pairs depict a safe and an unsafe scenario, while in MiC, they depict cultural proxies in two distinct cultural contexts. We find that models generally perform better at co...

Related Articles

Llms

What does Gemini think of you?

I noticed that Gemini was referring back to a lot of queries I've made in the past and was using that knowledge to drive follow up prompt...

Reddit - Artificial Intelligence · 1 min ·
Llms

This app helps you see what LLMs you can run on your hardware

submitted by /u/dev_is_active [link] [comments]

Reddit - Artificial Intelligence · 1 min ·
Llms

TRACER: Learn-to-Defer for LLM Classification with Formal Teacher-Agreement Guarantees

I'm releasing TRACER (Trace-Based Adaptive Cost-Efficient Routing), a library for learning cost-efficient routing policies from LLM trace...

Reddit - Machine Learning · 1 min ·
Mistral AI raises $830M in debt to set up a data center near Paris | TechCrunch
Llms

Mistral AI raises $830M in debt to set up a data center near Paris | TechCrunch

Mistral aims to start operating the data center by the second quarter of 2026.

TechCrunch - AI · 4 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime