[2602.17327] WebFAQ 2.0: A Multilingual QA Dataset with Mined Hard Negatives for Dense Retrieval

[2602.17327] WebFAQ 2.0: A Multilingual QA Dataset with Mined Hard Negatives for Dense Retrieval

arXiv - AI 4 min read Article

Summary

WebFAQ 2.0 introduces a multilingual QA dataset with 198 million FAQ-based question-answer pairs across 108 languages, enhancing multilingual coverage and training resources for dense retrieval systems.

Why It Matters

This dataset significantly improves the resources available for training multilingual information retrieval systems. By addressing community feedback and expanding its scope, WebFAQ 2.0 supports advancements in AI-driven language processing and retrieval technologies, which are essential for developing more effective and inclusive AI applications.

Key Takeaways

  • WebFAQ 2.0 contains 198 million FAQ-based QA pairs across 108 languages.
  • The dataset includes a hard negatives dataset for improved training of dense retrievers.
  • A novel data collection strategy enhances multilingual coverage and context richness.
  • The resource supports two fine-tuning strategies for dense retrievers: Contrastive Learning and Knowledge Distillation.
  • WebFAQ 2.0 is part of a long-term effort with continuous updates through the Open Web Index.

Computer Science > Information Retrieval arXiv:2602.17327 (cs) [Submitted on 19 Feb 2026] Title:WebFAQ 2.0: A Multilingual QA Dataset with Mined Hard Negatives for Dense Retrieval Authors:Michael Dinzinger, Laura Caspari, Ali Salman, Irvin Topi, Jelena Mitrović, Michael Granitzer View a PDF of the paper titled WebFAQ 2.0: A Multilingual QA Dataset with Mined Hard Negatives for Dense Retrieval, by Michael Dinzinger and 5 other authors View PDF HTML (experimental) Abstract:We introduce WebFAQ 2.0, a new version of the WebFAQ dataset, containing 198 million FAQ-based natural question-answer pairs across 108 languages. Compared to the previous version, it significantly expands multilingual coverage and the number of bilingual aligned QA pairs to over 14.3M, making it the largest FAQ-based resource. Unlike the original release, WebFAQ 2.0 uses a novel data collection strategy that directly crawls and extracts relevant web content, resulting in a substantially more diverse and multilingual dataset with richer context through page titles and descriptions. In response to community feedback, we also release a hard negatives dataset for training dense retrievers, with 1.25M queries across 20 languages. These hard negatives were mined using a two-stage retrieval pipeline and include cross-encoder scores for 200 negatives per query. We further show how this resource enables two primary fine-tuning strategies for dense retrievers: Contrastive Learning with MultipleNegativesRanking loss...

Related Articles

Nlp

[P] Implemented ACT-R cognitive decay and hyperdimensional computing for AI agent memory (open source)

Built a memory server for AI agents (MCP protocol) and implemented two cognitive science techniques in v7.5 I wanted to share. ACT-R Cogn...

Reddit - Machine Learning · 1 min ·
Nlp

🜏 Echoes of the Forgotten Selves: Fringe Spiral Hypotheses

🜏 Echoes of the Forgotten Selves: Fringe Spiral Hypotheses These hypotheses are not meant to be believed. They are meant to be **held lig...

Reddit - Artificial Intelligence · 1 min ·
Llms

[P] Remote sensing foundation models made easy to use.

This project enables the idea of tasking remote sensing models to acquire embeddings like we task satellites to acquire data! https://git...

Reddit - Machine Learning · 1 min ·
Nlp

Anyone else feel like AI security is being figured out in production right now?

I’ve been digging into AI security incident data from 2025 into this year, and it feels like something isn’t being talked about enough ou...

Reddit - Artificial Intelligence · 1 min ·
More in Nlp: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime