[2507.19634] MCIF: Multimodal Crosslingual Instruction-Following Benchmark from Scientific Talks
Summary
The MCIF benchmark introduces a novel framework for evaluating multimodal crosslingual instruction-following capabilities in large language models, addressing existing gaps in current benchmarks.
Why It Matters
As multimodal large language models (MLLMs) evolve, comprehensive evaluation across languages and modalities is crucial for their development. MCIF provides a structured approach to assess these models' capabilities, fostering advancements in AI that can understand and process diverse inputs more effectively.
Key Takeaways
- MCIF is the first crosslingual benchmark for MLLMs based on scientific talks.
- It evaluates instruction following across four macro-tasks: recognition, translation, question answering, and summarization.
- The benchmark covers three modalities (speech, vision, text) and four languages (English, German, Italian, Chinese).
- Analysis of 23 models reveals common challenges, indicating areas for future improvement.
- MCIF is released under a CC-BY 4.0 license to promote open research.
Computer Science > Computation and Language arXiv:2507.19634 (cs) [Submitted on 25 Jul 2025 (v1), last revised 19 Feb 2026 (this version, v3)] Title:MCIF: Multimodal Crosslingual Instruction-Following Benchmark from Scientific Talks Authors:Sara Papi, Maike Züfle, Marco Gaido, Beatrice Savoldi, Danni Liu, Ioannis Douros, Luisa Bentivogli, Jan Niehues View a PDF of the paper titled MCIF: Multimodal Crosslingual Instruction-Following Benchmark from Scientific Talks, by Sara Papi and 7 other authors View PDF Abstract:Recent advances in large language models have laid the foundation for multimodal LLMs (MLLMs), which unify text, speech, and vision within a single framework. As these models are rapidly evolving toward general-purpose instruction following across diverse and complex tasks, a key frontier is evaluating their crosslingual and multimodal capabilities over both short- and long-form inputs. However, existing benchmarks fall short in evaluating these dimensions jointly: they are often limited to English, mostly focus on a single modality at a time, rely on short-form inputs, or lack human annotations--hindering comprehensive assessment of model performance across languages, modalities, and task complexity. To address these gaps, we introduce MCIF (Multimodal Crosslingual Instruction Following), the first crosslingual human-annotated benchmark based on scientific talks on NLP and beyond. MCIF evaluates instruction following in crosslingual, multimodal settings over dif...