[2602.07605] Fine-R1: Make Multi-modal LLMs Excel in Fine-Grained

[2602.07605] Fine-R1: Make Multi-modal LLMs Excel in Fine-Grained Visual Recognition by Chain-of-Thought Reasoning

arXiv - AI April 29, 2026 4 min read

About this article

Abstract page for arXiv paper 2602.07605: Fine-R1: Make Multi-modal LLMs Excel in Fine-Grained Visual Recognition by Chain-of-Thought Reasoning

Computer Science > Computer Vision and Pattern Recognition arXiv:2602.07605 (cs) [Submitted on 7 Feb 2026 (v1), last revised 27 Apr 2026 (this version, v3)] Title:Fine-R1: Make Multi-modal LLMs Excel in Fine-Grained Visual Recognition by Chain-of-Thought Reasoning Authors:Hulingxiao He, Zijun Geng, Yuxin Peng View a PDF of the paper titled Fine-R1: Make Multi-modal LLMs Excel in Fine-Grained Visual Recognition by Chain-of-Thought Reasoning, by Hulingxiao He and 2 other authors View PDF HTML (experimental) Abstract:Any entity in the visual world can be hierarchically grouped based on shared characteristics and mapped to fine-grained sub-categories. While Multi-modal Large Language Models (MLLMs) achieve strong performance on coarse-grained visual tasks, they often struggle with Fine-Grained Visual Recognition (FGVR). Adapting general-purpose MLLMs to FGVR typically requires large amounts of annotated data, which is costly to obtain, leaving a substantial performance gap compared to contrastive CLIP models dedicated for discriminative tasks. Moreover, MLLMs tend to overfit to seen sub-categories and generalize poorly to unseen ones. To address these challenges, we propose Fine-R1, an MLLM tailored for FGVR through an R1-style training framework: (1) Chain-of-Thought Supervised Fine-tuning, where we construct a high-quality FGVR CoT dataset with rationales of "visual analysis, candidate sub-categories, comparison, and prediction", transition the model into a strong open-world...

Originally published on April 29, 2026. Curated by AI News.

Llms

[2604.16909] PRISM: Probing Reasoning, Instruction, and Source Memory in LLM Hallucinations

Abstract page for arXiv paper 2604.16909: PRISM: Probing Reasoning, Instruction, and Source Memory in LLM Hallucinations

arXiv - AI · 4 min · about 2 hours ago

Llms

[2604.07802] Latent Anomaly Knowledge Excavation: Unveiling Sparse Sensitive Neurons in Vision-Language Models

Abstract page for arXiv paper 2604.07802: Latent Anomaly Knowledge Excavation: Unveiling Sparse Sensitive Neurons in Vision-Language Models

arXiv - AI · 4 min · about 2 hours ago

Llms

[2602.07096] RealFin: How Well Do LLMs Reason About Finance When Users Leave Things Unsaid?

Abstract page for arXiv paper 2602.07096: RealFin: How Well Do LLMs Reason About Finance When Users Leave Things Unsaid?

arXiv - AI · 3 min · about 2 hours ago

Llms

[2601.22246] MirrorMark: A Distortion-Free Multi-Bit Watermark for Large Language Models

Abstract page for arXiv paper 2601.22246: MirrorMark: A Distortion-Free Multi-Bit Watermark for Large Language Models

arXiv - AI · 3 min · about 2 hours ago

[2602.07605] Fine-R1: Make Multi-modal LLMs Excel in Fine-Grained Visual Recognition by Chain-of-Thought Reasoning

About this article

Related Articles

[2604.16909] PRISM: Probing Reasoning, Instruction, and Source Memory in LLM Hallucinations

[2604.07802] Latent Anomaly Knowledge Excavation: Unveiling Sparse Sensitive Neurons in Vision-Language Models

[2602.07096] RealFin: How Well Do LLMs Reason About Finance When Users Leave Things Unsaid?

[2601.22246] MirrorMark: A Distortion-Free Multi-Bit Watermark for Large Language Models

No comments

Stay updated with AI News