[2508.20373] NPG-Muse: Scaling Long Chain-of-Thought Reasoning with NP-Hard Graph Problems

[2508.20373] NPG-Muse: Scaling Long Chain-of-Thought Reasoning with NP-Hard Graph Problems

arXiv - AI 4 min read Article

Summary

The paper presents NPG-Muse, a novel approach to enhance long chain-of-thought reasoning in large language models using NP-hard graph problems as a synthetic training corpus.

Why It Matters

As large language models (LLMs) become integral in AI applications, enhancing their reasoning capabilities is crucial. This research introduces a scalable method to improve reasoning depth and efficiency, addressing the limitations of current training datasets and methods. The findings could significantly impact AI development and applications in complex reasoning tasks.

Key Takeaways

  • NPG-Muse utilizes NP-hard graph problems to train LLMs for improved reasoning.
  • A two-stage post-training framework enhances reasoning depth and efficiency.
  • The approach demonstrates superior performance in various reasoning benchmarks.
  • This method provides a scalable alternative to costly human-curated datasets.
  • NPG-Muse-7B outperforms larger models in specific reasoning tasks.

Computer Science > Computation and Language arXiv:2508.20373 (cs) [Submitted on 28 Aug 2025 (v1), last revised 17 Feb 2026 (this version, v2)] Title:NPG-Muse: Scaling Long Chain-of-Thought Reasoning with NP-Hard Graph Problems Authors:Yuyao Wang, Bowen Liu, Jianheng Tang, Nuo Chen, Yuhan Li, Qifan Zhang, Chenyi Zi, Chen Zhang, Jia Li View a PDF of the paper titled NPG-Muse: Scaling Long Chain-of-Thought Reasoning with NP-Hard Graph Problems, by Yuyao Wang and 8 other authors View PDF HTML (experimental) Abstract:Reasoning Large Language Models (RLLMs) have recently achieved remarkable progress on complex reasoning tasks, largely enabled by their long chain-of-thought (Long CoT) capabilities. However, developing these Long CoT behaviors relies heavily on post-training with high-quality datasets, which are typically costly and human-curated (e.g., mathematics and code), leaving scalable alternatives unexplored. In this work, we introduce NP-hard (NPH) graph problems as a novel synthetic training corpus, as they inherently require deep reasoning, extensive exploration, and reflective strategies, which are the core characteristics of Long CoT reasoning. Building on this insight, we develop a two-stage post-training framework: (i) Long-CoT Supervised Fine-Tuning (SFT) on rejection-sampled NPH graph instances, which substantially enhances reasoning depth, and (ii) Reinforcement Learning (RL) with a fine-grained reward design, which sharpens reasoning efficiency. The resulting NP...

Related Articles

Llms

Claude Max 20x usage hit 40% by Monday noon — how does Codex CLI compare?

I'm on Claude Max (the $100/mo plan) and noticed something that surprised me. By Monday noon I had already used 40% of the 20x monthly li...

Reddit - Artificial Intelligence · 1 min ·
How to use the new ChatGPT app integrations, including DoorDash, Spotify, Uber, and others | TechCrunch
Llms

How to use the new ChatGPT app integrations, including DoorDash, Spotify, Uber, and others | TechCrunch

Learn how to use Spotify, Canva, Figma, Expedia, and other apps directly in ChatGPT.

TechCrunch - AI · 10 min ·
Anthropic Restricts Claude Agent Access Amid AI Automation Boom in Crypto
Llms

Anthropic Restricts Claude Agent Access Amid AI Automation Boom in Crypto

AI Tools & Products · 7 min ·
Is cutting ‘please’ when talking to ChatGPT better for the planet? An expert explains
Llms

Is cutting ‘please’ when talking to ChatGPT better for the planet? An expert explains

AI Tools & Products · 5 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime