[2602.21655] CCCaption: Dual-Reward Reinforcement Learning for Complete and Correct Image Captioning
Summary
The paper introduces CCCaption, a dual-reward reinforcement learning framework designed to enhance image captioning by optimizing for completeness and correctness, addressing limitations of human-annotated captions.
Why It Matters
Image captioning is crucial for vision-language understanding, yet current methods often rely on subjective human annotations. CCCaption offers a systematic approach to improve caption quality through objective metrics, potentially advancing applications in AI and accessibility.
Key Takeaways
- CCCaption optimizes image captions for both completeness and correctness.
- The framework uses diverse visual language models to enhance training efficiency.
- Penalties are applied for hallucinations in captions to ensure factual accuracy.
- Extensive experiments demonstrate consistent improvements over traditional methods.
- This approach provides a pathway to move beyond reliance on human annotations.
Computer Science > Computer Vision and Pattern Recognition arXiv:2602.21655 (cs) [Submitted on 25 Feb 2026] Title:CCCaption: Dual-Reward Reinforcement Learning for Complete and Correct Image Captioning Authors:Zhijiang Tang, Linhua Wang, Jiaxin Qi, Weihao Jiang, Peng Hou, Anxiang Zeng, Jianqiang Huang View a PDF of the paper titled CCCaption: Dual-Reward Reinforcement Learning for Complete and Correct Image Captioning, by Zhijiang Tang and 6 other authors View PDF HTML (experimental) Abstract:Image captioning remains a fundamental task for vision language understanding, yet ground-truth supervision still relies predominantly on human-annotated references. Because human annotations reflect subjective preferences and expertise, ground-truth captions are often incomplete or even incorrect, which in turn limits caption models. We argue that caption quality should be assessed by two objective aspects: completeness (does the caption cover all salient visual facts?) and correctness (are the descriptions true with respect to the image?). To this end, we introduce CCCaption: a dual-reward reinforcement learning framework with a dedicated fine-tuning corpus that explicitly optimizes these properties to generate \textbf{C}omplete and \textbf{C}orrect \textbf{Captions}. For completeness, we use diverse LVLMs to disentangle the image into a set of visual queries, and reward captions that answer more of these queries, with a dynamic query sampling strategy to improve training efficiency...