[2412.00364] LMSeg: Unleashing the Power of Large-Scale Models for Open-Vocabulary Semantic Segmentation
Summary
The paper presents LMSeg, a novel approach for open-vocabulary semantic segmentation that enhances visual and linguistic feature alignment using large-scale models, achieving state-of-the-art performance on key benchmarks.
Why It Matters
This research addresses limitations in existing open-vocabulary segmentation methods by improving the representation of visual features and enriching language prompts. It highlights the potential of large-scale models in advancing semantic segmentation, which is crucial for various applications in computer vision.
Key Takeaways
- LMSeg improves open-vocabulary semantic segmentation by leveraging large-scale models.
- The method enhances visual feature extraction and linguistic prompt generation.
- State-of-the-art performance is achieved across major segmentation benchmarks.
- The approach addresses limitations of existing vision-language models like CLIP.
- The code for LMSeg will be made available for further research and application.
Computer Science > Computer Vision and Pattern Recognition arXiv:2412.00364 (cs) [Submitted on 30 Nov 2024 (v1), last revised 18 Feb 2026 (this version, v2)] Title:LMSeg: Unleashing the Power of Large-Scale Models for Open-Vocabulary Semantic Segmentation Authors:Huadong Tang, Youpeng Zhao, Yan Huang, Min Xu, Jun Wang, Qiang Wu View a PDF of the paper titled LMSeg: Unleashing the Power of Large-Scale Models for Open-Vocabulary Semantic Segmentation, by Huadong Tang and 5 other authors View PDF HTML (experimental) Abstract:It is widely agreed that open-vocabulary-based approaches outperform classical closed-set training solutions for recognizing unseen objects in images for semantic segmentation. Existing open-vocabulary approaches leverage vision-language models, such as CLIP, to align visual features with rich semantic features acquired through pre-training on large-scale vision-language datasets. However, the text prompts employed in these methods are short phrases based on fixed templates, failing to capture comprehensive object attributes. Moreover, while the CLIP model excels at exploiting image-level features, it is less effective at pixel-level representation, which is crucial for semantic segmentation tasks. In this work, we propose to alleviate the above-mentioned issues by leveraging multiple large-scale models to enhance the alignment between fine-grained visual features and enriched linguistic features. Specifically, our method employs large language models (LL...