[2601.15298] Embedding Retrofitting: Data Engineering for better RAG
Summary
This paper discusses embedding retrofitting, a technique that enhances pre-trained word vectors using knowledge graph constraints to improve domain-specific retrieval, emphasizing the importance of data quality in this process.
Why It Matters
The study highlights how the quality of knowledge graphs significantly impacts the effectiveness of embedding retrofitting. By addressing data quality issues, researchers can enhance retrieval accuracy in natural language processing tasks, which is crucial for developing more reliable AI systems.
Key Takeaways
- Embedding retrofitting adjusts word vectors to improve retrieval.
- Data quality, particularly from knowledge graphs, is critical for success.
- Preprocessing can significantly enhance the performance of retrofitting techniques.
- Noisy data can lead to substantial degradation in results.
- Quantitative synthesis questions benefit the most from improved retrofitting.
Computer Science > Computation and Language arXiv:2601.15298 (cs) This paper has been withdrawn by Anantha Sharma [Submitted on 6 Jan 2026 (v1), last revised 17 Feb 2026 (this version, v2)] Title:Embedding Retrofitting: Data Engineering for better RAG Authors:Anantha Sharma View a PDF of the paper titled Embedding Retrofitting: Data Engineering for better RAG, by Anantha Sharma No PDF available, click to view other formats Abstract:Embedding retrofitting adjusts pre-trained word vectors using knowledge graph constraints to improve domain-specific retrieval. However, the effectiveness of retrofitting depends critically on knowledge graph quality, which in turn depends on text preprocessing. This paper presents a data engineering framework that addresses data quality degradation from annotation artifacts in real-world corpora. The analysis shows that hashtag annotations inflate knowledge graph density, leading to creating spurious edges that corrupt the retrofitting objective. On noisy graphs, all retrofitting techniques produce statistically significant degradation ($-3.5\%$ to $-5.2\%$, $p<0.05$). After preprocessing, \acrshort{ewma} retrofitting achieves $+6.2\%$ improvement ($p=0.0348$) with benefits concentrated in quantitative synthesis questions ($+33.8\%$ average). The gap between clean and noisy preprocessing (10\%+ swing) exceeds the gap between algorithms (3\%), establishing preprocessing quality as the primary determinant of retrofitting success. Comments: Subjec...