[2510.06820] Efficient Discriminative Joint Encoders for Large Scale Vision-Language Reranking
Summary
The paper presents EDJE, an Efficient Discriminative Joint Encoder designed to enhance vision-language reranking by precomputing visual tokens, significantly improving retrieval performance while reducing storage and computational requirements.
Why It Matters
As multimodal retrieval systems become increasingly important, the development of efficient models like EDJE addresses the limitations of existing joint-encoder approaches. This innovation is crucial for practical deployment in large-scale applications, enhancing the speed and efficiency of vision-language tasks.
Key Takeaways
- EDJE precomputes vision tokens offline, optimizing retrieval processes.
- The model achieves high throughput, processing 50k image-text pairs per second.
- Storage requirements are drastically reduced to 49kB per image, enabling scalability.
- EDJE matches or exceeds the performance of prior models on standard datasets.
- This advancement facilitates faster and more efficient multimodal retrieval applications.
Computer Science > Computer Vision and Pattern Recognition arXiv:2510.06820 (cs) [Submitted on 8 Oct 2025 (v1), last revised 22 Feb 2026 (this version, v2)] Title:Efficient Discriminative Joint Encoders for Large Scale Vision-Language Reranking Authors:Mitchell Keren Taraday, Shahaf Wagner, Chaim Baskin View a PDF of the paper titled Efficient Discriminative Joint Encoders for Large Scale Vision-Language Reranking, by Mitchell Keren Taraday and 2 other authors View PDF HTML (experimental) Abstract:Multimodal retrieval still leans on embedding-based models like CLIP for fast vector search over pre-computed image embeddings. Yet, unlike text retrieval, where joint-encoder rerankers are standard, comparable vision-language rerankers are largely absent. We find that seminal joint encoders such as BLIP are severely bottlenecked by an expensive visual feature-extraction stage, preventing practical deployment at scale. Motivated by this bottleneck, we introduce EDJE, an Efficient Discriminative Joint Encoder that precomputes vision tokens offline and compresses them via a lightweight attention-based adapter, so online inference runs only a compact joint encoder over a small set of visual tokens plus the text. EDJE preserves strong retrieval performance while drastically reducing storage and online compute, enabling high-throughput inference. Specifically, EDJE processes 50k image--text pairs/second while requiring 49kB of disk storage per image, matching prior art on Flickr (zero...