[2510.00041] Culture In a Frame: C$^3$B as a Comic-Based Benchmark for Multimodal Culturally Awareness
About this article
Abstract page for arXiv paper 2510.00041: Culture In a Frame: C$^3$B as a Comic-Based Benchmark for Multimodal Culturally Awareness
Computer Science > Computer Vision and Pattern Recognition arXiv:2510.00041 (cs) [Submitted on 27 Sep 2025 (v1), last revised 28 Feb 2026 (this version, v2)] Title:Culture In a Frame: C$^3$B as a Comic-Based Benchmark for Multimodal Culturally Awareness Authors:Yuchen Song, Andong Chen, Wenxin Zhu, Kehai Chen, Xuefeng Bai, Muyun Yang, Tiejun Zhao View a PDF of the paper titled Culture In a Frame: C$^3$B as a Comic-Based Benchmark for Multimodal Culturally Awareness, by Yuchen Song and 6 other authors View PDF HTML (experimental) Abstract:Cultural awareness capabilities have emerged as a critical capability for Multimodal Large Language Models (MLLMs). However, current benchmarks lack progressed difficulty in their task design and are deficient in cross-lingual tasks. Moreover, current benchmarks often use real-world images. Each real-world image typically contains one culture, making these benchmarks relatively easy for MLLMs. Based on this, we propose C$^3$B (Comics Cross-Cultural Benchmark), a novel multicultural, multitask and multilingual cultural awareness capabilities benchmark. C$^3$B comprises over 2000 images and over 18000 QA pairs, constructed on three tasks with progressed difficulties, from basic visual recognition to higher-level cultural conflict understanding, and finally to cultural content generation. We conducted evaluations on 11 open-source MLLMs, revealing a significant performance gap between MLLMs and human performance. The gap demonstrates that C$^...