Nlp Data Science Machine Learning Ai Infrastructure

[2602.17327] WebFAQ 2.0: A Multilingual QA Dataset with Mined Hard Negatives for Dense Retrieval

arXiv - AI February 20, 2026 4 min read Article

Summary

WebFAQ 2.0 introduces a multilingual QA dataset with 198 million FAQ-based question-answer pairs across 108 languages, enhancing multilingual coverage and training resources for dense retrieval systems.

Why It Matters

This dataset significantly improves the resources available for training multilingual information retrieval systems. By addressing community feedback and expanding its scope, WebFAQ 2.0 supports advancements in AI-driven language processing and retrieval technologies, which are essential for developing more effective and inclusive AI applications.

Key Takeaways

WebFAQ 2.0 contains 198 million FAQ-based QA pairs across 108 languages.
The dataset includes a hard negatives dataset for improved training of dense retrievers.
A novel data collection strategy enhances multilingual coverage and context richness.
The resource supports two fine-tuning strategies for dense retrievers: Contrastive Learning and Knowledge Distillation.
WebFAQ 2.0 is part of a long-term effort with continuous updates through the Open Web Index.

Computer Science > Information Retrieval arXiv:2602.17327 (cs) [Submitted on 19 Feb 2026] Title:WebFAQ 2.0: A Multilingual QA Dataset with Mined Hard Negatives for Dense Retrieval Authors:Michael Dinzinger, Laura Caspari, Ali Salman, Irvin Topi, Jelena Mitrović, Michael Granitzer View a PDF of the paper titled WebFAQ 2.0: A Multilingual QA Dataset with Mined Hard Negatives for Dense Retrieval, by Michael Dinzinger and 5 other authors View PDF HTML (experimental) Abstract:We introduce WebFAQ 2.0, a new version of the WebFAQ dataset, containing 198 million FAQ-based natural question-answer pairs across 108 languages. Compared to the previous version, it significantly expands multilingual coverage and the number of bilingual aligned QA pairs to over 14.3M, making it the largest FAQ-based resource. Unlike the original release, WebFAQ 2.0 uses a novel data collection strategy that directly crawls and extracts relevant web content, resulting in a substantially more diverse and multilingual dataset with richer context through page titles and descriptions. In response to community feedback, we also release a hard negatives dataset for training dense retrievers, with 1.25M queries across 20 languages. These hard negatives were mined using a two-stage retrieval pipeline and include cross-encoder scores for 200 negatives per query. We further show how this resource enables two primary fine-tuning strategies for dense retrievers: Contrastive Learning with MultipleNegativesRanking loss...

Read Original Article

[2602.17327] WebFAQ 2.0: A Multilingual QA Dataset with Mined Hard Negatives for Dense Retrieval

Summary

Why It Matters

Key Takeaways

Related Articles

[P] Implemented ACT-R cognitive decay and hyperdimensional computing for AI agent memory (open source)

🜏 Echoes of the Forgotten Selves: Fringe Spiral Hypotheses

[P] Remote sensing foundation models made easy to use.

Anyone else feel like AI security is being figured out in production right now?

No comments

Stay updated with AI News