Llms Machine Learning Generative Ai Nlp

[2602.11151] Diffusion-Pretrained Dense and Contextual Embeddings

arXiv - Machine Learning February 16, 2026 3 min read Article

Summary

The paper introduces pplx-embed, a family of multilingual embedding models utilizing diffusion-pretrained language models for enhanced retrieval performance across various benchmarks.

Why It Matters

This research is significant as it addresses the challenges of context preservation in long documents and improves retrieval efficiency, which is crucial for applications in large-scale search scenarios. The models set new performance records, indicating advancements in the field of information retrieval and natural language processing.

Key Takeaways

Introduction of pplx-embed models for multilingual embeddings.
Models utilize multi-stage contrastive learning on a diffusion-pretrained backbone.
Demonstrated strong performance on multiple retrieval benchmarks.
pplx-embed-context-v1 sets new records on the ConTEB benchmark.
Effective in real-world scenarios with large-scale web data.

Computer Science > Machine Learning arXiv:2602.11151 (cs) [Submitted on 11 Feb 2026 (v1), last revised 13 Feb 2026 (this version, v2)] Title:Diffusion-Pretrained Dense and Contextual Embeddings Authors:Sedigheh Eslami, Maksim Gaiduk, Markus Krimmel, Louis Milliken, Bo Wang, Denis Bykov View a PDF of the paper titled Diffusion-Pretrained Dense and Contextual Embeddings, by Sedigheh Eslami and 5 other authors View PDF HTML (experimental) Abstract:In this report, we introduce pplx-embed, a family of multilingual embedding models that employ multi-stage contrastive learning on a diffusion-pretrained language model backbone for web-scale retrieval. By leveraging bidirectional attention through diffusion-based pretraining, our models capture comprehensive bidirectional context within passages, enabling the use of mean pooling and a late chunking strategy to better preserve global context across long documents. We release two model types: pplx-embed-v1 for standard retrieval, and pplx-embed-context-v1 for contextualized embeddings that incorporate global document context into passage representations. pplx-embed-v1 achieves competitive performance on the MTEB(Multilingual, v2), MTEB(Code), MIRACL, BERGEN, and ToolRet retrieval benchmarks, while pplx-embed-context-v1 sets new records on the ConTEB benchmark. Beyond public benchmarks, pplx-embed-v1 demonstrates strong performance on our internal evaluation suite, focusing on real-world, large-scale search scenarios constructed from ...

Read Original Article