[2603.00889] CHIMERA: Compact Synthetic Data for Generalizable LLM Reasoning
About this article
Abstract page for arXiv paper 2603.00889: CHIMERA: Compact Synthetic Data for Generalizable LLM Reasoning
Computer Science > Computation and Language arXiv:2603.00889 (cs) [Submitted on 1 Mar 2026] Title:CHIMERA: Compact Synthetic Data for Generalizable LLM Reasoning Authors:Xinyu Zhu, Yihao Feng, Yanchao Sun, Xianzhi Du, Pingzhi Li, Olli Saarikivi, Yun Zhu, Yu Meng View a PDF of the paper titled CHIMERA: Compact Synthetic Data for Generalizable LLM Reasoning, by Xinyu Zhu and 7 other authors View PDF HTML (experimental) Abstract:Large Language Models (LLMs) have recently exhibited remarkable reasoning capabilities, largely enabled by supervised fine-tuning (SFT)- and reinforcement learning (RL)-based post-training on high-quality reasoning data. However, reproducing and extending these capabilities in open and scalable settings is hindered by three fundamental data-centric challenges: (1) the cold-start problem, arising from the lack of seed datasets with detailed, long Chain-of-Thought (CoT) trajectories needed to initialize reasoning policies; (2) limited domain coverage, as most existing open-source reasoning datasets are concentrated in mathematics, with limited coverage of broader scientific disciplines; and (3) the annotation bottleneck, where the difficulty of frontier-level reasoning tasks makes reliable human annotation prohibitively expensive or infeasible. To address these challenges, we introduce CHIMERA, a compact synthetic reasoning dataset comprising 9K samples for generalizable cross-domain reasoning. CHIMERA is constructed with three key properties: (1) it pr...