[2603.19531] dinov3.seg: Open-Vocabulary Semantic Segmentation with DINOv3
About this article
Abstract page for arXiv paper 2603.19531: dinov3.seg: Open-Vocabulary Semantic Segmentation with DINOv3
Computer Science > Computer Vision and Pattern Recognition arXiv:2603.19531 (cs) [Submitted on 19 Mar 2026] Title:dinov3.seg: Open-Vocabulary Semantic Segmentation with DINOv3 Authors:Saikat Dutta, Biplab Banerjee, Hamid Rezatofighi View a PDF of the paper titled dinov3.seg: Open-Vocabulary Semantic Segmentation with DINOv3, by Saikat Dutta and 2 other authors View PDF HTML (experimental) Abstract:Open-Vocabulary Semantic Segmentation (OVSS) assigns pixel-level labels from an open set of text-defined categories, demanding reliable generalization to unseen classes at inference. Although modern vision-language models (VLMs) support strong open-vocabulary recognition, their representations learned through global contrastive objectives remain suboptimal for dense prediction, prompting many OVSS methods to depend on limited adaptation or refinement of image-text similarity maps. This, in turn, restricts spatial precision and robustness in complex, cluttered scenes. We introduce this http URL, extending this http URL into a dedicated framework for OVSS. Our contributions are four-fold. First, we design a task-specific architecture tailored to this backbone, systematically adapting established design principles from prior open-vocabulary segmentation work. Second, we jointly leverage text embeddings aligned with both the global [CLS] token and local patch-level visual features from ViT-based encoder, effectively combining semantic discrimination with fine-grained spatial locality...