[2601.08026] FigEx2: Visual-Conditioned Panel Detection and Captioning for Scientific Compound Figures
Summary
The paper presents FigEx2, a framework for detecting and captioning panels in scientific compound figures, enhancing understanding and accessibility of complex data visualizations.
Why It Matters
FigEx2 addresses a significant gap in scientific communication by improving the clarity and detail of figure captions, which are often inadequate. This advancement can facilitate better comprehension of research findings across disciplines, particularly in fields like physics and chemistry where visual data is prevalent.
Key Takeaways
- FigEx2 localizes panels and generates detailed captions from compound figures.
- Introduces a noise-aware gated fusion module to enhance captioning accuracy.
- Combines supervised and reinforcement learning for optimized performance.
- Achieves high detection accuracy and outperforms existing models in key metrics.
- Demonstrates strong zero-shot transferability to new scientific domains.
Computer Science > Computer Vision and Pattern Recognition arXiv:2601.08026 (cs) [Submitted on 12 Jan 2026 (v1), last revised 25 Feb 2026 (this version, v3)] Title:FigEx2: Visual-Conditioned Panel Detection and Captioning for Scientific Compound Figures Authors:Jifeng Song, Arun Das, Pan Wang, Hui Ji, Kun Zhao, Yufei Huang View a PDF of the paper titled FigEx2: Visual-Conditioned Panel Detection and Captioning for Scientific Compound Figures, by Jifeng Song and 5 other authors View PDF HTML (experimental) Abstract:Scientific compound figures combine multiple labeled panels into a single image, but captions in real pipelines are often missing or only provide figure-level summaries, making panel-level understanding difficult. In this paper, we propose FigEx2, visual-conditioned framework that localizes panels and generates panel-wise captions directly from the compound figure. To mitigate the impact of diverse phrasing in open-ended captioning, we introduce a noise-aware gated fusion module that adaptively filters token-level features to stabilize the detection query space. Furthermore, we employ a staged optimization strategy combining supervised learning with reinforcement learning (RL), utilizing CLIP-based alignment and BERTScore-based semantic rewards to enforce strict multimodal consistency. To support high-quality supervision, we curate BioSci-Fig-Cap, a refined benchmark for panel-level grounding, alongside cross-disciplinary test suites in physics and chemistry. Exp...