[2602.07605] Fine-R1: Make Multi-modal LLMs Excel in Fine-Grained Visual Recognition by Chain-of-Thought Reasoning

[2602.07605] Fine-R1: Make Multi-modal LLMs Excel in Fine-Grained Visual Recognition by Chain-of-Thought Reasoning

arXiv - AI 4 min read

About this article

Abstract page for arXiv paper 2602.07605: Fine-R1: Make Multi-modal LLMs Excel in Fine-Grained Visual Recognition by Chain-of-Thought Reasoning

Computer Science > Computer Vision and Pattern Recognition arXiv:2602.07605 (cs) [Submitted on 7 Feb 2026 (v1), last revised 27 Apr 2026 (this version, v3)] Title:Fine-R1: Make Multi-modal LLMs Excel in Fine-Grained Visual Recognition by Chain-of-Thought Reasoning Authors:Hulingxiao He, Zijun Geng, Yuxin Peng View a PDF of the paper titled Fine-R1: Make Multi-modal LLMs Excel in Fine-Grained Visual Recognition by Chain-of-Thought Reasoning, by Hulingxiao He and 2 other authors View PDF HTML (experimental) Abstract:Any entity in the visual world can be hierarchically grouped based on shared characteristics and mapped to fine-grained sub-categories. While Multi-modal Large Language Models (MLLMs) achieve strong performance on coarse-grained visual tasks, they often struggle with Fine-Grained Visual Recognition (FGVR). Adapting general-purpose MLLMs to FGVR typically requires large amounts of annotated data, which is costly to obtain, leaving a substantial performance gap compared to contrastive CLIP models dedicated for discriminative tasks. Moreover, MLLMs tend to overfit to seen sub-categories and generalize poorly to unseen ones. To address these challenges, we propose Fine-R1, an MLLM tailored for FGVR through an R1-style training framework: (1) Chain-of-Thought Supervised Fine-tuning, where we construct a high-quality FGVR CoT dataset with rationales of "visual analysis, candidate sub-categories, comparison, and prediction", transition the model into a strong open-world...

Originally published on April 29, 2026. Curated by AI News.

Related Articles

[2604.16909] PRISM: Probing Reasoning, Instruction, and Source Memory in LLM Hallucinations
Llms

[2604.16909] PRISM: Probing Reasoning, Instruction, and Source Memory in LLM Hallucinations

Abstract page for arXiv paper 2604.16909: PRISM: Probing Reasoning, Instruction, and Source Memory in LLM Hallucinations

arXiv - AI · 4 min ·
[2604.07802] Latent Anomaly Knowledge Excavation: Unveiling Sparse Sensitive Neurons in Vision-Language Models
Llms

[2604.07802] Latent Anomaly Knowledge Excavation: Unveiling Sparse Sensitive Neurons in Vision-Language Models

Abstract page for arXiv paper 2604.07802: Latent Anomaly Knowledge Excavation: Unveiling Sparse Sensitive Neurons in Vision-Language Models

arXiv - AI · 4 min ·
[2602.07096] RealFin: How Well Do LLMs Reason About Finance When Users Leave Things Unsaid?
Llms

[2602.07096] RealFin: How Well Do LLMs Reason About Finance When Users Leave Things Unsaid?

Abstract page for arXiv paper 2602.07096: RealFin: How Well Do LLMs Reason About Finance When Users Leave Things Unsaid?

arXiv - AI · 3 min ·
[2601.22246] MirrorMark: A Distortion-Free Multi-Bit Watermark for Large Language Models
Llms

[2601.22246] MirrorMark: A Distortion-Free Multi-Bit Watermark for Large Language Models

Abstract page for arXiv paper 2601.22246: MirrorMark: A Distortion-Free Multi-Bit Watermark for Large Language Models

arXiv - AI · 3 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime