[2602.14134] DenseMLLM: Standard Multimodal LLMs are Intrinsic Dense Predictors
Summary
The paper introduces DenseMLLM, a multimodal large language model designed to perform dense predictions without the need for complex, task-specific decoders, achieving competitive performance across various benchmarks.
Why It Matters
This research challenges the conventional approach of using specialized architectures for dense prediction tasks in multimodal models. By demonstrating that standard MLLMs can effectively handle these tasks, it opens new avenues for simplifying model design and enhancing practical applications in computer vision.
Key Takeaways
- DenseMLLM eliminates the need for task-specific decoders in multimodal models.
- The model achieves competitive results in dense prediction tasks.
- A novel vision token supervision strategy is introduced for multiple labels.
- This approach reduces architectural complexity while maintaining performance.
- The findings could influence future designs of general-purpose AI models.
Computer Science > Computer Vision and Pattern Recognition arXiv:2602.14134 (cs) [Submitted on 15 Feb 2026] Title:DenseMLLM: Standard Multimodal LLMs are Intrinsic Dense Predictors Authors:Yi Li, Hongze Shen, Lexiang Tang, Xin Li, Xinpeng Ding, Yinsong Liu, Deqiang Jiang, Xing Sun, Xiaomeng Li View a PDF of the paper titled DenseMLLM: Standard Multimodal LLMs are Intrinsic Dense Predictors, by Yi Li and 8 other authors View PDF Abstract:Multimodal Large Language Models (MLLMs) have demonstrated exceptional capabilities in high-level visual understanding. However, extending these models to fine-grained dense prediction tasks, such as semantic segmentation and depth estimation, typically necessitates the incorporation of complex, task-specific decoders and other customizations. This architectural fragmentation increases model complexity and deviates from the generalist design of MLLMs, ultimately limiting their practicality. In this work, we challenge this paradigm by accommodating standard MLLMs to perform dense predictions without requiring additional task-specific decoders. The proposed model is called DenseMLLM, grounded in the standard architecture with a novel vision token supervision strategy for multiple labels and tasks. Despite its minimalist design, our model achieves highly competitive performance across a wide range of dense prediction and vision-language benchmarks, demonstrating that a standard, general-purpose MLLM can effectively support dense perception wit...