[2604.04172] GENFIG1: Visual Summaries of Scholarly Work as a Challenge for Vision-Language Models
About this article
Abstract page for arXiv paper 2604.04172: GENFIG1: Visual Summaries of Scholarly Work as a Challenge for Vision-Language Models
Computer Science > Computer Vision and Pattern Recognition arXiv:2604.04172 (cs) [Submitted on 5 Apr 2026] Title:GENFIG1: Visual Summaries of Scholarly Work as a Challenge for Vision-Language Models Authors:Yaohan Guan, Pristina Wang, Najim Dehak, Alan Yuille, Jieneng Chen, Daniel Khashabi View a PDF of the paper titled GENFIG1: Visual Summaries of Scholarly Work as a Challenge for Vision-Language Models, by Yaohan Guan and 5 other authors View PDF HTML (experimental) Abstract:In many science papers, "Figure 1" serves as the primary visual summary of the core research idea. These figures are visually simple yet conceptually rich, often requiring significant effort and iteration by human authors to get right, highlighting the difficulty of science visual communication. With this intuition, we introduce GENFIG1, a benchmark for generative AI models (e.g., Vision-Language Models). GENFIG1 evaluates models for their ability to produce figures that clearly express and motivate the central idea of a paper (title, abstract, introduction, and figure caption) as input. Solving GENFIG1 requires more than producing visually appealing graphics: the task entails reasoning for text-to-image generation that couples scientific understanding with visual synthesis. Specifically, models must (i) comprehend and grasp the technical concepts of the paper, (ii) identify the most salient ones, and (iii) design a coherent and aesthetically effective graphic that conveys those concepts visually and...