[2510.22391] Top-Down Semantic Refinement for Image Captioning
Summary
This paper introduces Top-Down Semantic Refinement (TDSR) for image captioning, addressing the limitations of Vision-Language Models (VLMs) in generating coherent and detailed descriptions through a novel planning approach.
Why It Matters
The research highlights a significant advancement in image captioning by redefining the process as a hierarchical refinement problem. This approach not only improves narrative coherence but also enhances the ability of VLMs to generate detailed and contextually accurate captions, which is crucial for applications in AI-driven content creation and accessibility.
Key Takeaways
- TDSR redefines image captioning as a goal-oriented hierarchical refinement problem.
- The proposed Monte Carlo Tree Search (MCTS) algorithm significantly reduces computational costs.
- TDSR enhances existing VLMs' performance in fine-grained description and hallucination suppression.
- The framework is adaptable, with an early stopping mechanism based on image complexity.
- Extensive experiments show TDSR achieves state-of-the-art results across multiple benchmarks.
Computer Science > Computer Vision and Pattern Recognition arXiv:2510.22391 (cs) [Submitted on 25 Oct 2025 (v1), last revised 16 Feb 2026 (this version, v2)] Title:Top-Down Semantic Refinement for Image Captioning Authors:Jusheng Zhang, Kaitong Cai, Jing Yang, Jian Wang, Chengpei Tang, Keze Wang View a PDF of the paper titled Top-Down Semantic Refinement for Image Captioning, by Jusheng Zhang and 5 other authors View PDF HTML (experimental) Abstract:Large Vision-Language Models (VLMs) face an inherent contradiction in image captioning: their powerful single-step generation capabilities often lead to a myopic decision-making process. This makes it difficult to maintain global narrative coherence while capturing rich details, a limitation that is particularly pronounced in tasks that require multi-step and complex scene description. To overcome this fundamental challenge, we redefine image captioning as a goal-oriented hierarchical refinement planning problem, and further propose a novel framework, named Top-Down Semantic Refinement (TDSR), which models the generation process as a Markov Decision Process (MDP). However, planning within the vast state space of a VLM presents a significant computational hurdle. Our core contribution, therefore, is the design of a highly efficient Monte Carlo Tree Search (MCTS) algorithm tailored for VLMs. By incorporating a visual-guided parallel expansion and a lightweight value network, our TDSR reduces the call frequency to the expensive VL...