[2506.09427] A High-Quality Dataset and Reliable Evaluation for Interleaved Image-Text Generation
About this article
Abstract page for arXiv paper 2506.09427: A High-Quality Dataset and Reliable Evaluation for Interleaved Image-Text Generation
Computer Science > Computer Vision and Pattern Recognition arXiv:2506.09427 (cs) [Submitted on 11 Jun 2025 (v1), last revised 2 Mar 2026 (this version, v2)] Title:A High-Quality Dataset and Reliable Evaluation for Interleaved Image-Text Generation Authors:Yukang Feng, Jianwen Sun, Chuanhao Li, Zizhen Li, Jiaxin Ai, Fanrui Zhang, Yifan Chang, Sizhuo Zhou, Shenglin Zhang, Yu Dai, Kaipeng Zhang View a PDF of the paper titled A High-Quality Dataset and Reliable Evaluation for Interleaved Image-Text Generation, by Yukang Feng and 10 other authors View PDF HTML (experimental) Abstract:Recent advancements in Large Multimodal Models (LMMs) have significantly improved multimodal understanding and generation. However, these models still struggle to generate tightly interleaved image-text outputs, primarily due to the limited scale, quality, and instructional richness of current training datasets. To address this, we introduce InterSyn, a dataset that features: (1) large scale, comprising 1.8M multimodal samples; (2) high quality, supported by our proposed Self-Evaluation with Iterative Refinement (SEIR) method for rigorous automated quality refinement; (3) rich instructional diversity, ensured through diverse well-designed question templates, based on human preferences and covering a 3500-topic hierarchy. These characteristics make InterSyn particularly well-suited for training LMMs in interactive image-text generation capabilities. To evaluate the capabilities, we propose SynJudge,...