[2602.22743] Generative Data Transformation: From Mixed to Unified Data
Summary
The paper presents Taesar, a data-centric framework designed to enhance recommendation model performance by addressing data sparsity and cold start challenges through effective cross-domain data encoding.
Why It Matters
As recommendation systems increasingly rely on diverse data sources, Taesar offers a novel approach that improves data integration without the complexity of traditional model-centric architectures. This could lead to more efficient and effective AI applications in various domains.
Key Takeaways
- Taesar employs a contrastive decoding mechanism for better data integration.
- The framework improves model performance by addressing data sparsity and cold start issues.
- Taesar outperforms existing model-centric solutions in generating enriched datasets.
- It allows standard models to learn intricate dependencies without complex architectures.
- The code for Taesar is publicly available, promoting further research and application.
Computer Science > Artificial Intelligence arXiv:2602.22743 (cs) [Submitted on 26 Feb 2026] Title:Generative Data Transformation: From Mixed to Unified Data Authors:Jiaqing Zhang, Mingjia Yin, Hao Wang, Yuxin Tian, Yuyang Ye, Yawen Li, Wei Guo, Yong Liu, Enhong Chen View a PDF of the paper titled Generative Data Transformation: From Mixed to Unified Data, by Jiaqing Zhang and 8 other authors View PDF HTML (experimental) Abstract:Recommendation model performance is intrinsically tied to the quality, volume, and relevance of their training data. To address common challenges like data sparsity and cold start, recent researchs have leveraged data from multiple auxiliary domains to enrich information within the target domain. However, inherent domain gaps can degrade the quality of mixed-domain data, leading to negative transfer and diminished model performance. Existing prevailing \emph{model-centric} paradigm -- which relies on complex, customized architectures -- struggles to capture the subtle, non-structural sequence dependencies across domains, leading to poor generalization and high demands on computational resources. To address these shortcomings, we propose \textsc{Taesar}, a \emph{data-centric} framework for \textbf{t}arget-\textbf{a}lign\textbf{e}d \textbf{s}equenti\textbf{a}l \textbf{r}egeneration, which employs a contrastive decoding mechanism to adaptively encode cross-domain context into target-domain sequences. It employs contrastive decoding to encode cross-dom...