[2411.11706] MC-LLaVA: Multi-Concept Personalized Vision-Language Model
Summary
The paper presents MC-LLaVA, a multi-concept personalized vision-language model that enhances user experience by integrating multiple concepts in training and inference, improving the model's performance in real-world applications.
Why It Matters
As vision-language models become integral to AI applications, MC-LLaVA addresses the limitations of existing models that focus on single concepts. By enabling multi-concept personalization, it enhances user interaction and broadens the applicability of VLMs in diverse scenarios, making them more effective as user assistants.
Key Takeaways
- MC-LLaVA integrates multiple concepts in a single training step, enhancing personalization.
- The model employs a personalized textual prompt to reduce training costs.
- An auxiliary loss is introduced to improve the effectiveness of personalized prompts.
- A high-quality dataset featuring diverse multi-concept scenarios is contributed.
- Comprehensive experiments show significant improvements in multi-concept personalized responses.
Computer Science > Computer Vision and Pattern Recognition arXiv:2411.11706 (cs) [Submitted on 18 Nov 2024 (v1), last revised 18 Feb 2026 (this version, v4)] Title:MC-LLaVA: Multi-Concept Personalized Vision-Language Model Authors:Ruichuan An, Sihan Yang, Renrui Zhang, Ming Lu, Tianyi Jiang, Kai Zeng, Yulin Luo, Jiajun Cao, Hao Liang, Ying Chen, Qi She, Shanghang Zhang, Wentao Zhang View a PDF of the paper titled MC-LLaVA: Multi-Concept Personalized Vision-Language Model, by Ruichuan An and 12 other authors View PDF HTML (experimental) Abstract:Current vision-language models (VLMs) show exceptional abilities across diverse tasks, such as visual question answering. To enhance user experience, recent studies have investigated VLM personalization to understand user-provided concepts. However, they mainly focus on single concepts, neglecting the existence and interplay of multiple concepts, which limits real-world applicability. This paper proposes MC-LLaVA, a multi-concept personalization paradigm. Specifically, MC-LLaVA employs a multi-concept instruction tuning strategy, effectively integrating multiple concepts in a single training step. To reduce the training costs, we propose a personalized textual prompt that uses visual token information to initialize concept tokens. Additionally, we introduce a personalized visual prompt during inference, aggregating location maps for enhanced recognition and grounding capabilities. To further push the performance upper bound, we inco...