[2510.13232] What "Not" to Detect: Negation-Aware VLMs via Structured Reasoning and Token Merging
About this article
Abstract page for arXiv paper 2510.13232: What "Not" to Detect: Negation-Aware VLMs via Structured Reasoning and Token Merging
Computer Science > Computer Vision and Pattern Recognition arXiv:2510.13232 (cs) [Submitted on 15 Oct 2025 (v1), last revised 23 Mar 2026 (this version, v2)] Title:What "Not" to Detect: Negation-Aware VLMs via Structured Reasoning and Token Merging Authors:Inha Kang, Youngsun Lim, Seonho Lee, Jiho Choi, Junsuk Choe, Hyunjung Shim View a PDF of the paper titled What "Not" to Detect: Negation-Aware VLMs via Structured Reasoning and Token Merging, by Inha Kang and 5 other authors View PDF Abstract:State-of-the-art vision-language models (VLMs) suffer from a critical failure in understanding negation, often referred to as affirmative bias. This limitation is particularly severe in described object detection (DOD) tasks. To address this, we propose two primary contributions: (1) a new dataset pipeline and (2) a novel, lightweight adaptation recipe. First, we introduce CoVAND, a dataset constructed with a systematic chain-of-thought (CoT) and VQA-based pipeline to generate high-quality, instance-grounded negation data. Second, we propose NegToMe, a novel text token merging module that directly tackles the architectural cause of affirmative bias. NegToMe fundamentally addresses the structural loss of negation cues in tokenization, grouping them with attributes into coherent semantic phrases. It maintains correct polarity at the input level, enabling robust negation understanding even with limited data. For instance, to prevent a model from treating the fragmented tokens "not" and...