[2602.09448] The Wisdom of Many Queries: Complexity-Diversity Principle for Dense Retriever Training

[2602.09448] The Wisdom of Many Queries: Complexity-Diversity Principle for Dense Retriever Training

arXiv - Machine Learning 3 min read Article

Summary

The paper explores the Complexity-Diversity Principle (CDP) in dense retrieval training, highlighting the trade-off between query quality and diversity for improved retrieval performance across various datasets.

Why It Matters

Understanding the CDP is crucial for enhancing the effectiveness of dense retrievers in information retrieval. By balancing query quality and diversity, researchers and practitioners can optimize retrieval systems, particularly in multi-hop scenarios, leading to better performance in real-world applications.

Key Takeaways

  • The Complexity-Diversity Principle (CDP) suggests that query complexity influences the need for diversity in retrieval training.
  • Diversity benefits retrieval performance, especially in out-of-domain scenarios and multi-hop retrieval tasks.
  • Optimal diversity thresholds are established based on query complexity, guiding effective training strategies.
  • Experiments across 31 datasets validate the CDP's effectiveness in enhancing retrieval outcomes.
  • CW-weighted training can improve out-of-domain performance even with single-query data.

Computer Science > Information Retrieval arXiv:2602.09448 (cs) [Submitted on 10 Feb 2026 (v1), last revised 24 Feb 2026 (this version, v2)] Title:The Wisdom of Many Queries: Complexity-Diversity Principle for Dense Retriever Training Authors:Xincan Feng, Noriki Nishida, Yusuke Sakai, Yuji Matsumoto View a PDF of the paper titled The Wisdom of Many Queries: Complexity-Diversity Principle for Dense Retriever Training, by Xincan Feng and 3 other authors View PDF Abstract:Prior synthetic query generation for dense retrieval produces one query per document, focusing on quality. We systematically study multi-query synthesis, discovering a quality-diversity trade-off: quality benefits in-domain, diversity benefits out-of-domain (OOD). Experiments on 31 datasets show diversity especially benefits multi-hop retrieval. Analysis reveals diversity benefit correlates with query complexity ($r$$\geq$0.95), measured by content words (CW). We formalize this as the Complexity-Diversity Principle (CDP): query complexity determines optimal diversity. CDP provides thresholds (CW$>$10: use diversity; CW$<$7: avoid it) and enables CW-weighted training that improves OOD even with single-query data. Comments: Subjects: Information Retrieval (cs.IR); Machine Learning (cs.LG) Cite as: arXiv:2602.09448 [cs.IR]   (or arXiv:2602.09448v2 [cs.IR] for this version)   https://doi.org/10.48550/arXiv.2602.09448 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Xincan Feng [view emai...

Related Articles

UMKC Announces New Master of Science in Artificial Intelligence
Ai Infrastructure

UMKC Announces New Master of Science in Artificial Intelligence

UMKC announces a new Master of Science in Artificial Intelligence program aimed at addressing workforce demand for AI expertise, set to l...

AI News - General · 4 min ·
PSA: Anyone with a link can view your Granola notes by default | The Verge
Machine Learning

PSA: Anyone with a link can view your Granola notes by default | The Verge

Granola, the AI-powered note-taking app, makes your notes viewable by anyone with a link by default. It also turns on AI training for any...

The Verge - AI · 5 min ·
Machine Learning

[D] On-Device Real-Time Visibility Restoration: Deterministic CV vs. Quantized ML Models. Looking for insights on Edge Preservation vs. Latency.

Hey everyone, We have been working on a real-time camera engine for iOS that currently uses a purely deterministic Computer Vision approa...

Reddit - Machine Learning · 1 min ·
Llms

[R] Is autoresearch really better than classic hyperparameter tuning?

We did experiments comparing Optuna &amp; autoresearch. Autoresearch converges faster, is more cost-efficient, and even generalizes bette...

Reddit - Machine Learning · 1 min ·
More in Machine Learning: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime