[2602.12510] Visual RAG Toolkit: Scaling Multi-Vector Visual Retrieval with Training-Free Pooling and Multi-Stage Search
Summary
The Visual RAG Toolkit enhances multi-vector visual retrieval by introducing a training-free pooling method and a multi-stage search process, significantly improving efficiency and accessibility.
Why It Matters
This toolkit addresses the scalability issues of existing visual retrieval systems, making advanced retrieval techniques more accessible to practitioners without extensive hardware requirements. It emphasizes efficiency while maintaining accuracy, which is crucial in the rapidly evolving field of computer vision and information retrieval.
Key Takeaways
- The Visual RAG Toolkit reduces vector-to-vector comparisons from thousands to dozens, enhancing retrieval efficiency.
- It employs training-free pooling and multi-stage retrieval to maintain accuracy while improving throughput.
- The toolkit includes robust preprocessing features, facilitating easier integration into existing workflows.
- Performance is optimized for common retrieval cutoffs, lowering hardware barriers for users.
- The approach is validated through experiments, demonstrating minimal degradation in retrieval quality.
Computer Science > Information Retrieval arXiv:2602.12510 (cs) [Submitted on 13 Feb 2026] Title:Visual RAG Toolkit: Scaling Multi-Vector Visual Retrieval with Training-Free Pooling and Multi-Stage Search Authors:Ara Yeroyan View a PDF of the paper titled Visual RAG Toolkit: Scaling Multi-Vector Visual Retrieval with Training-Free Pooling and Multi-Stage Search, by Ara Yeroyan View PDF HTML (experimental) Abstract:Multi-vector visual retrievers (e.g., ColPali-style late interaction models) deliver strong accuracy, but scale poorly because each page yields thousands of vectors, making indexing and search increasingly expensive. We present Visual RAG Toolkit, a practical system for scaling visual multi-vector retrieval with training-free, model-aware pooling and multi-stage retrieval. Motivated by Matryoshka Embeddings, our method performs static spatial pooling - including a lightweight sliding-window averaging variant - over patch embeddings to produce compact tile-level and global representations for fast candidate generation, followed by exact MaxSim reranking using full multi-vector embeddings. Our design yields a quadratic reduction in vector-to-vector comparisons by reducing stored vectors per page from thousands to dozens, notably without requiring post-training, adapters, or distillation. Across experiments with interaction-style models such as ColPali and ColSmol-500M, we observe that over the limited ViDoRe v2 benchmark corpus 2-stage retrieval typically preserves ...