[2602.14482] TikArt: Aperture-Guided Observation for Fine-Grained Visual Reasoning via Reinforcement Learning
Summary
The paper presents TikArt, an aperture-guided agent for fine-grained visual reasoning in multimodal large language models, enhancing decision-making through a structured observation process.
Why It Matters
TikArt addresses challenges in visual reasoning by focusing on small details often overlooked in traditional models. Its innovative approach could significantly improve applications in computer vision, making it relevant for researchers and developers in AI and machine learning.
Key Takeaways
- Introduces TikArt, an agent that enhances visual reasoning through aperture-guided observation.
- Utilizes a Think-Aperture-Observe loop to improve decision-making in complex visual environments.
- Demonstrates superior performance on multiple benchmark datasets compared to existing models.
- Employs a reinforcement learning algorithm (AGRPO) to optimize reasoning policies effectively.
- Provides interpretable results through visual trajectories, aiding in understanding model behavior.
Computer Science > Computer Vision and Pattern Recognition arXiv:2602.14482 (cs) [Submitted on 16 Feb 2026] Title:TikArt: Aperture-Guided Observation for Fine-Grained Visual Reasoning via Reinforcement Learning Authors:Hao Ding, Zhichuan Yang, Weijie Ge, Ziqin Gao, Chaoyi Lu, Lei Zhao View a PDF of the paper titled TikArt: Aperture-Guided Observation for Fine-Grained Visual Reasoning via Reinforcement Learning, by Hao Ding and 5 other authors View PDF HTML (experimental) Abstract:We address fine-grained visual reasoning in multimodal large language models (MLLMs), where key evidence may reside in tiny objects, cluttered regions, or subtle markings that are lost under a single global image encoding. We introduce TikArt (Thinking Aperture), an aperture-guided agent that casts multi-step vision-language reasoning as a decision process over regions of interest. TikArt follows a Think-Aperture-Observe loop, alternating between language generation and two aperture actions: Zoom extracts rectangular crops, while Segment invokes SAM2 to obtain mask-based crops for irregular targets. After every action, the model must produce an explicit observation, turning local visual cues into persistent linguistic memory. Built on Qwen3-VL-8B, TikArt optimizes its reasoning policy with AGRPO, a GRPO-style reinforcement learning algorithm with a two-stage curriculum: it warms up segmentation actions and then jointly optimizes visual math, fine-grained VQA, and segmentation, using rewards that c...