[2602.21397] MMLoP: Multi-Modal Low-Rank Prompting for Efficient Vision-Language Adaptation
Summary
The paper presents MMLoP, a framework for efficient vision-language adaptation using low-rank prompting, achieving high accuracy with significantly fewer trainable parameters compared to existing methods.
Why It Matters
MMLoP addresses the challenge of adapting vision-language models without the need for extensive parameter tuning, making it a significant advancement in the field of machine learning. Its efficiency could lead to broader applications in AI where resource constraints are critical, such as in mobile devices or real-time systems.
Key Takeaways
- MMLoP achieves efficient vision-language adaptation with only 11.5K trainable parameters.
- The framework employs low-rank factorization to regularize prompts and mitigate overfitting.
- It introduces innovative components like self-regulating consistency loss and uniform drift correction for enhanced performance.
- MMLoP outperforms many existing methods in accuracy-efficiency tradeoff across multiple datasets.
- The approach is particularly beneficial for few-shot learning scenarios.
Computer Science > Computer Vision and Pattern Recognition arXiv:2602.21397 (cs) [Submitted on 24 Feb 2026] Title:MMLoP: Multi-Modal Low-Rank Prompting for Efficient Vision-Language Adaptation Authors:Sajjad Ghiasvand, Haniyeh Ehsani Oskouie, Mahnoosh Alizadeh, Ramtin Pedarsani View a PDF of the paper titled MMLoP: Multi-Modal Low-Rank Prompting for Efficient Vision-Language Adaptation, by Sajjad Ghiasvand and 3 other authors View PDF HTML (experimental) Abstract:Prompt learning has become a dominant paradigm for adapting vision-language models (VLMs) such as CLIP to downstream tasks without modifying pretrained weights. While extending prompts to both vision and text encoders across multiple transformer layers significantly boosts performance, it dramatically increases the number of trainable parameters, with state-of-the-art methods requiring millions of parameters and abandoning the parameter efficiency that makes prompt tuning attractive. In this work, we propose \textbf{MMLoP} (\textbf{M}ulti-\textbf{M}odal \textbf{Lo}w-Rank \textbf{P}rompting), a framework that achieves deep multi-modal prompting with only \textbf{11.5K trainable parameters}, comparable to early text-only methods like CoOp. MMLoP parameterizes vision and text prompts at each transformer layer through a low-rank factorization, which serves as an implicit regularizer against overfitting on few-shot training data. To further close the accuracy gap with state-of-the-art methods, we introduce three comple...