[2508.20373] NPG-Muse: Scaling Long Chain-of-Thought Reasoning with NP-Hard Graph Problems
Summary
The paper presents NPG-Muse, a novel approach to enhance long chain-of-thought reasoning in large language models using NP-hard graph problems as a synthetic training corpus.
Why It Matters
As large language models (LLMs) become integral in AI applications, enhancing their reasoning capabilities is crucial. This research introduces a scalable method to improve reasoning depth and efficiency, addressing the limitations of current training datasets and methods. The findings could significantly impact AI development and applications in complex reasoning tasks.
Key Takeaways
- NPG-Muse utilizes NP-hard graph problems to train LLMs for improved reasoning.
- A two-stage post-training framework enhances reasoning depth and efficiency.
- The approach demonstrates superior performance in various reasoning benchmarks.
- This method provides a scalable alternative to costly human-curated datasets.
- NPG-Muse-7B outperforms larger models in specific reasoning tasks.
Computer Science > Computation and Language arXiv:2508.20373 (cs) [Submitted on 28 Aug 2025 (v1), last revised 17 Feb 2026 (this version, v2)] Title:NPG-Muse: Scaling Long Chain-of-Thought Reasoning with NP-Hard Graph Problems Authors:Yuyao Wang, Bowen Liu, Jianheng Tang, Nuo Chen, Yuhan Li, Qifan Zhang, Chenyi Zi, Chen Zhang, Jia Li View a PDF of the paper titled NPG-Muse: Scaling Long Chain-of-Thought Reasoning with NP-Hard Graph Problems, by Yuyao Wang and 8 other authors View PDF HTML (experimental) Abstract:Reasoning Large Language Models (RLLMs) have recently achieved remarkable progress on complex reasoning tasks, largely enabled by their long chain-of-thought (Long CoT) capabilities. However, developing these Long CoT behaviors relies heavily on post-training with high-quality datasets, which are typically costly and human-curated (e.g., mathematics and code), leaving scalable alternatives unexplored. In this work, we introduce NP-hard (NPH) graph problems as a novel synthetic training corpus, as they inherently require deep reasoning, extensive exploration, and reflective strategies, which are the core characteristics of Long CoT reasoning. Building on this insight, we develop a two-stage post-training framework: (i) Long-CoT Supervised Fine-Tuning (SFT) on rejection-sampled NPH graph instances, which substantially enhances reasoning depth, and (ii) Reinforcement Learning (RL) with a fine-grained reward design, which sharpens reasoning efficiency. The resulting NP...