[2602.11151] Diffusion-Pretrained Dense and Contextual Embeddings

[2602.11151] Diffusion-Pretrained Dense and Contextual Embeddings

arXiv - Machine Learning 3 min read Article

Summary

The paper introduces pplx-embed, a family of multilingual embedding models utilizing diffusion-pretrained language models for enhanced retrieval performance across various benchmarks.

Why It Matters

This research is significant as it addresses the challenges of context preservation in long documents and improves retrieval efficiency, which is crucial for applications in large-scale search scenarios. The models set new performance records, indicating advancements in the field of information retrieval and natural language processing.

Key Takeaways

  • Introduction of pplx-embed models for multilingual embeddings.
  • Models utilize multi-stage contrastive learning on a diffusion-pretrained backbone.
  • Demonstrated strong performance on multiple retrieval benchmarks.
  • pplx-embed-context-v1 sets new records on the ConTEB benchmark.
  • Effective in real-world scenarios with large-scale web data.

Computer Science > Machine Learning arXiv:2602.11151 (cs) [Submitted on 11 Feb 2026 (v1), last revised 13 Feb 2026 (this version, v2)] Title:Diffusion-Pretrained Dense and Contextual Embeddings Authors:Sedigheh Eslami, Maksim Gaiduk, Markus Krimmel, Louis Milliken, Bo Wang, Denis Bykov View a PDF of the paper titled Diffusion-Pretrained Dense and Contextual Embeddings, by Sedigheh Eslami and 5 other authors View PDF HTML (experimental) Abstract:In this report, we introduce pplx-embed, a family of multilingual embedding models that employ multi-stage contrastive learning on a diffusion-pretrained language model backbone for web-scale retrieval. By leveraging bidirectional attention through diffusion-based pretraining, our models capture comprehensive bidirectional context within passages, enabling the use of mean pooling and a late chunking strategy to better preserve global context across long documents. We release two model types: pplx-embed-v1 for standard retrieval, and pplx-embed-context-v1 for contextualized embeddings that incorporate global document context into passage representations. pplx-embed-v1 achieves competitive performance on the MTEB(Multilingual, v2), MTEB(Code), MIRACL, BERGEN, and ToolRet retrieval benchmarks, while pplx-embed-context-v1 sets new records on the ConTEB benchmark. Beyond public benchmarks, pplx-embed-v1 demonstrates strong performance on our internal evaluation suite, focusing on real-world, large-scale search scenarios constructed from ...

Related Articles

The “Agony” or ChatGPT: Would You Let AI Write Your Wedding Speech?
Llms

The “Agony” or ChatGPT: Would You Let AI Write Your Wedding Speech?

AI Tools & Products · 12 min ·
Anthropic expands partnership with Google and Broadcom for multiple gigawatts of next-generation compute
Llms

Anthropic expands partnership with Google and Broadcom for multiple gigawatts of next-generation compute

AI Tools & Products · 3 min ·
How I use Claude for strategy, Gemini for research and ChatGPT for 'the grind'
Llms

How I use Claude for strategy, Gemini for research and ChatGPT for 'the grind'

AI Tools & Products · 9 min ·
Llms

Codex and Claude Code Can Work Together

AI Tools & Products ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime