[2602.07298] Principled Synthetic Data Enables the First Scaling Laws for LLMs in Recommendation
Summary
This paper presents a novel framework for generating high-quality synthetic data to establish scaling laws for large language models (LLMs) in recommendation systems, demonstrating significant performance improvements over traditional data sources.
Why It Matters
The research addresses a critical gap in the development of LLMs for recommendation systems by introducing a method to generate synthetic data that overcomes the limitations of raw user interaction data. This advancement could enhance the efficiency and effectiveness of LLMs in real-world applications, making it a significant contribution to the field of AI and machine learning.
Key Takeaways
- Introduces a framework for generating high-quality synthetic data for LLMs.
- Demonstrates a 130% improvement in recall for models trained on synthetic data compared to real data.
- Establishes the first robust power-law scaling for LLMs in the recommendation domain.
- Shifts focus from data deficiencies to leveraging structured information.
- Provides empirical evidence for predictable perplexity reduction across synthetic data modalities.
Computer Science > Information Retrieval arXiv:2602.07298 (cs) [Submitted on 7 Feb 2026 (v1), last revised 12 Feb 2026 (this version, v2)] Title:Principled Synthetic Data Enables the First Scaling Laws for LLMs in Recommendation Authors:Benyu Zhang, Qiang Zhang, Jianpeng Cheng, Hong-You Chen, Qifei Wang, Wei Sun, Shen Li, Jia Li, Jiahao Wu, Xiangjun Fan, Hong Yan View a PDF of the paper titled Principled Synthetic Data Enables the First Scaling Laws for LLMs in Recommendation, by Benyu Zhang and 10 other authors View PDF HTML (experimental) Abstract:Large Language Models (LLMs) represent a promising frontier for recommender systems, yet their development has been impeded by the absence of predictable scaling laws, which are crucial for guiding research and optimizing resource allocation. We hypothesize that this may be attributed to the inherent noise, bias, and incompleteness of raw user interaction data in prior continual pre-training (CPT) efforts. This paper introduces a novel, layered framework for generating high-quality synthetic data that circumvents such issues by creating a curated, pedagogical curriculum for the LLM. We provide powerful, direct evidence for the utility of our curriculum by showing that standard sequential models trained on our principled synthetic data significantly outperform ($+130\%$ on recall@100 for SasRec) models trained on real data in downstream ranking tasks, demonstrating its superiority for learning generalizable user preference patte...