[2602.13758] OmniScience: A Large-scale Multi-modal Dataset for Scientific Image Understanding
Summary
The paper introduces OmniScience, a large-scale multi-modal dataset designed to enhance scientific image understanding in AI models, addressing limitations in current datasets.
Why It Matters
OmniScience significantly improves the training of multi-modal large language models by providing a comprehensive dataset that covers various scientific disciplines. This advancement is crucial for enhancing AI's ability to interpret complex scientific imagery, which is essential for research and innovation in fields reliant on visual data.
Key Takeaways
- OmniScience comprises 1.5 million figure-caption-context triplets across 10 scientific disciplines.
- The dataset improves image-text multi-modal similarity scores significantly.
- A dynamic model-routing re-captioning pipeline enhances the quality of image captions.
- The proposed caption QA protocol serves as an effective evaluation tool for visual understanding.
- Models fine-tuned on OmniScience show substantial performance gains in visual comprehension tasks.
Computer Science > Computer Vision and Pattern Recognition arXiv:2602.13758 (cs) [Submitted on 14 Feb 2026] Title:OmniScience: A Large-scale Multi-modal Dataset for Scientific Image Understanding Authors:Haoyi Tao, Chaozheng Huang, Nan Wang, Han Lyu, Linfeng Zhang, Guolin Ke, Xi Fang View a PDF of the paper titled OmniScience: A Large-scale Multi-modal Dataset for Scientific Image Understanding, by Haoyi Tao and 6 other authors View PDF HTML (experimental) Abstract:Multimodal Large Language Models demonstrate strong performance on natural image understanding, yet exhibit limited capability in interpreting scientific images, including but not limited to schematic diagrams, experimental characterizations, and analytical charts. This limitation is particularly pronounced in open-source MLLMs. The gap largely stems from existing datasets with limited domain coverage, coarse structural annotations, and weak semantic grounding. We introduce OmniScience, a large-scale, high-fidelity multi-modal dataset comprising 1.5 million figure-caption-context triplets, spanning more than 10 major scientific disciplines. To obtain image caption data with higher information density and accuracy for multi-modal large-model training, we develop a dynamic model-routing re-captioning pipeline that leverages state-of-the-art multi-modal large language models to generate dense, self-contained descriptions by jointly synthesizing visual features, original figure captions, and corresponding in-text re...