[2602.09448] The Wisdom of Many Queries: Complexity-Diversity Principle for Dense Retriever Training
Summary
The paper explores the Complexity-Diversity Principle (CDP) in dense retrieval training, highlighting the trade-off between query quality and diversity for improved retrieval performance across various datasets.
Why It Matters
Understanding the CDP is crucial for enhancing the effectiveness of dense retrievers in information retrieval. By balancing query quality and diversity, researchers and practitioners can optimize retrieval systems, particularly in multi-hop scenarios, leading to better performance in real-world applications.
Key Takeaways
- The Complexity-Diversity Principle (CDP) suggests that query complexity influences the need for diversity in retrieval training.
- Diversity benefits retrieval performance, especially in out-of-domain scenarios and multi-hop retrieval tasks.
- Optimal diversity thresholds are established based on query complexity, guiding effective training strategies.
- Experiments across 31 datasets validate the CDP's effectiveness in enhancing retrieval outcomes.
- CW-weighted training can improve out-of-domain performance even with single-query data.
Computer Science > Information Retrieval arXiv:2602.09448 (cs) [Submitted on 10 Feb 2026 (v1), last revised 24 Feb 2026 (this version, v2)] Title:The Wisdom of Many Queries: Complexity-Diversity Principle for Dense Retriever Training Authors:Xincan Feng, Noriki Nishida, Yusuke Sakai, Yuji Matsumoto View a PDF of the paper titled The Wisdom of Many Queries: Complexity-Diversity Principle for Dense Retriever Training, by Xincan Feng and 3 other authors View PDF Abstract:Prior synthetic query generation for dense retrieval produces one query per document, focusing on quality. We systematically study multi-query synthesis, discovering a quality-diversity trade-off: quality benefits in-domain, diversity benefits out-of-domain (OOD). Experiments on 31 datasets show diversity especially benefits multi-hop retrieval. Analysis reveals diversity benefit correlates with query complexity ($r$$\geq$0.95), measured by content words (CW). We formalize this as the Complexity-Diversity Principle (CDP): query complexity determines optimal diversity. CDP provides thresholds (CW$>$10: use diversity; CW$<$7: avoid it) and enables CW-weighted training that improves OOD even with single-query data. Comments: Subjects: Information Retrieval (cs.IR); Machine Learning (cs.LG) Cite as: arXiv:2602.09448 [cs.IR] (or arXiv:2602.09448v2 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2602.09448 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Xincan Feng [view emai...