[2511.16216] FlipVQA: Scaling Multi-modal Instruction Tuning via Textbook-to-Knowledge Synthesis
About this article
Abstract page for arXiv paper 2511.16216: FlipVQA: Scaling Multi-modal Instruction Tuning via Textbook-to-Knowledge Synthesis
Computer Science > Artificial Intelligence arXiv:2511.16216 (cs) [Submitted on 20 Nov 2025 (v1), last revised 30 Mar 2026 (this version, v2)] Title:FlipVQA: Scaling Multi-modal Instruction Tuning via Textbook-to-Knowledge Synthesis Authors:Zhen Hao Wong, Jingwen Deng, Yuzhao Wang, Wenkai Yu, Jihao Huang, Runming He, Chengyu Shen, Hao Liang, Wentao Zhang View a PDF of the paper titled FlipVQA: Scaling Multi-modal Instruction Tuning via Textbook-to-Knowledge Synthesis, by Zhen Hao Wong and 8 other authors View PDF HTML (experimental) Abstract:Textbooks are among the richest repositories of human-verified reasoning knowledge, yet their complex layouts contain multi-column typesetting, cross-page question answer separation, and interleaved figures, make automated extraction of structured QA and VQA pairs extremely challenging. Existing alternatives either synthesize data from scratch, which lacks authentic problem contexts, or rely on costly expert annotation that cannot scale. We propose $\textbf{FlipVQA-Miner}$, an automated pipeline that resolves long-range logical dependencies and cross-page discontinuities in OCR-parsed documents, recovering coherent question--answer--figure associations even when answers reside in separate companion volumes. A subsequent multi-stage curation pipeline transforms these raw extractions into AI-ready supervision signals. Using FlipVQA-Miner, we construct $\textbf{FlipVQA-83K}$, comprising 83K QA and VQA pairs spanning 11 academic disciplines...