[2604.06829] WRAP++: Web discoveRy Amplified Pretraining
About this article
Abstract page for arXiv paper 2604.06829: WRAP++: Web discoveRy Amplified Pretraining
Computer Science > Computation and Language arXiv:2604.06829 (cs) [Submitted on 8 Apr 2026] Title:WRAP++: Web discoveRy Amplified Pretraining Authors:Jiang Zhou, Yunhao Wang, Xing Wu, Tinghao Yu, Feng Zhang View a PDF of the paper titled WRAP++: Web discoveRy Amplified Pretraining, by Jiang Zhou and 4 other authors View PDF HTML (experimental) Abstract:Synthetic data rephrasing has emerged as a powerful technique for enhancing knowledge acquisition during large language model (LLM) pretraining. However, existing approaches operate at the single-document level, rewriting individual web pages in isolation. This confines synthesized examples to intra-document knowledge, missing cross-document relationships and leaving facts with limited associative context. We propose WRAP++ (Web discoveRy Amplified Pretraining), which amplifies the associative context of factual knowledge by discovering cross-document relationships from web hyperlinks and synthesizing joint QA over each discovered document pair. Concretely, WRAP++ discovers high-confidence relational motifs including dual-links and co-mentions, and synthesizes QA that requires reasoning across both documents. This produces relational knowledge absent from either source document alone, creating diverse entry points to the same facts. Because the number of valid entity pairs grows combinatorially, this discovery-driven synthesis also amplifies data scale far beyond single-document rewriting. Instantiating WRAP++ on Wikipedia, ...