[2512.05959] M4-RAG: A Massive-Scale Multilingual Multi-Cultural Multimodal RAG
About this article
Abstract page for arXiv paper 2512.05959: M4-RAG: A Massive-Scale Multilingual Multi-Cultural Multimodal RAG
Computer Science > Computation and Language arXiv:2512.05959 (cs) [Submitted on 5 Dec 2025 (v1), last revised 22 Mar 2026 (this version, v2)] Title:M4-RAG: A Massive-Scale Multilingual Multi-Cultural Multimodal RAG Authors:David Anugraha, Patrick Amadeus Irawan, Anshul Singh, En-Shiun Annie Lee, Genta Indra Winata View a PDF of the paper titled M4-RAG: A Massive-Scale Multilingual Multi-Cultural Multimodal RAG, by David Anugraha and 4 other authors View PDF HTML (experimental) Abstract:Vision-language models (VLMs) have achieved strong performance in visual question answering (VQA), yet they remain constrained by static training data. Retrieval-Augmented Generation (RAG) mitigates this limitation by enabling access to up-to-date, culturally grounded, and multilingual information; however, multilingual multimodal RAG remains largely underexplored. We introduce M4-RAG, a massive-scale benchmark spanning 42 languages, 56 regional dialects and registers, and 189 countries, comprising over 80,000 culturally diverse image-question pairs for evaluating retrieval-augmented VQA across languages and modalities. To balance realism with reproducibility, we build a controlled retrieval environment containing millions of carefully curated multilingual documents relevant to the query domains, approximating real-world retrieval conditions while ensuring consistent experimentation. Our systematic evaluation reveals that although RAG consistently benefits smaller VLMs, it fails to scale to ...