[2603.22213] SPA: A Simple but Tough-to-Beat Baseline for Knowledge Injection
About this article
Abstract page for arXiv paper 2603.22213: SPA: A Simple but Tough-to-Beat Baseline for Knowledge Injection
Computer Science > Machine Learning arXiv:2603.22213 (cs) [Submitted on 23 Mar 2026] Title:SPA: A Simple but Tough-to-Beat Baseline for Knowledge Injection Authors:Kexian Tang, Jiani Wang, Shaowen Wang, Kaifeng Lyu View a PDF of the paper titled SPA: A Simple but Tough-to-Beat Baseline for Knowledge Injection, by Kexian Tang and 3 other authors View PDF HTML (experimental) Abstract:While large language models (LLMs) are pretrained on massive amounts of data, their knowledge coverage remains incomplete in specialized, data-scarce domains, motivating extensive efforts to study synthetic data generation for knowledge injection. We propose SPA (Scaling Prompt-engineered Augmentation), a simple but tough-to-beat baseline that uses a small set of carefully designed prompts to generate large-scale synthetic data for knowledge injection. Through systematic comparisons, we find that SPA outperforms several strong baselines. Furthermore, we identify two key limitations of prior approaches: (1) while RL-based methods may improve the token efficiency of LLM-based data augmentation at small scale, they suffer from diversity collapse as data scales, leading to diminishing returns; and (2) while multi-stage prompting may outperform simple augmentation methods, their advantages can disappear after careful prompt tuning. Our results suggest that, for knowledge injection, careful prompt design combined with straightforward large-scale augmentation can be surprisingly effective, and we hope ...