[2603.12249] SciMDR: Advancing Scientific Multimodal Document Reasoning
About this article
Abstract page for arXiv paper 2603.12249: SciMDR: Advancing Scientific Multimodal Document Reasoning
Computer Science > Computation and Language arXiv:2603.12249 (cs) [Submitted on 12 Mar 2026 (v1), last revised 29 Apr 2026 (this version, v2)] Title:SciMDR: Advancing Scientific Multimodal Document Reasoning Authors:Ziyu Chen, Yilun Zhao, Chengye Wang, Rilyn Han, Manasi Patwardhan, Arman Cohan View a PDF of the paper titled SciMDR: Advancing Scientific Multimodal Document Reasoning, by Ziyu Chen and 5 other authors View PDF HTML (experimental) Abstract:Constructing scientific multimodal document reasoning datasets for foundation model training involves an inherent trade-off among scale, faithfulness, and realism. To address this challenge, we introduce the synthesize-and-reground framework, a two-stage pipeline comprising: (1) Claim-Centric QA Synthesis, which generates faithful, isolated QA pairs and reasoning on focused segments, and (2) Document-Scale Regrounding, which programmatically re-embeds these pairs into full-document tasks to ensure realistic complexity. Using this framework, we construct SciMDR, a large-scale training dataset for cross-modal comprehension, comprising 300K QA pairs with explicit reasoning chains across 20K scientific papers. We further construct SciMDR-Eval, an expert-annotated benchmark to evaluate multimodal comprehension within full-length scientific workflows. Experiments demonstrate that models fine-tuned on SciMDR achieve significant improvements across multiple scientific QA benchmarks, particularly in those tasks requiring complex docu...