[2602.11937] Extending Puzzle for Mixture-of-Experts Reasoning Models

[2602.11937] Extending Puzzle for Mixture-of-Experts Reasoning Models with Application to GPT-OSS Acceleration

arXiv - Machine Learning March 30, 2026 4 min read

About this article

Abstract page for arXiv paper 2602.11937: Extending Puzzle for Mixture-of-Experts Reasoning Models with Application to GPT-OSS Acceleration

Computer Science > Machine Learning arXiv:2602.11937 (cs) [Submitted on 12 Feb 2026 (v1), last revised 26 Mar 2026 (this version, v2)] Title:Extending Puzzle for Mixture-of-Experts Reasoning Models with Application to GPT-OSS Acceleration Authors:Akhiad Bercovich, Nir Ailon, Vladimir Anisimov, Tomer Asida, Nave Assaf, Mohammad Dabbah, Ido Galil, Amnon Geifman, Yonatan Geifman, Izhak Golan, Roi Koren, Itay Levy, Zach Moshe, Pavlo Molchanov, Najeeb Nabwani, Mostofa Patwary, Omri Puny, Tomer Ronen, Itamar Schen, Elad Segal, Ido Shahaf, Oren Tropp, Ran Zilberstein, Ran El-Yaniv View a PDF of the paper titled Extending Puzzle for Mixture-of-Experts Reasoning Models with Application to GPT-OSS Acceleration, by Akhiad Bercovich and 23 other authors View PDF HTML (experimental) Abstract:Reasoning-focused LLMs improve answer quality by generating longer reasoning traces, but the additional tokens dramatically increase serving cost, motivating inference optimization. We extend and apply Puzzle, a post-training neural architecture search (NAS) framework, to gpt-oss-120B to produce gpt-oss-puzzle-88B, a deployment-optimized derivative. Our approach combines heterogeneous MoE expert pruning, selective replacement of full-context attention with window attention, FP8 KV-cache quantization with calibrated scales, and post-training reinforcement learning to recover accuracy, while maintaining low generation length. In terms of per-token speeds, on an 8XH100 node we achieve 1.63X and 1.22X ...

Originally published on March 30, 2026. Curated by AI News.

Llms

De-aged casts, ChatGPT-generated programs: How AI is changing Korean TV

Artificial intelligence is transforming every corner of industry, and television is no exception. Major networks in Korea have recently a...

AI Tools & Products · 4 min · 12 minutes ago

Llms

[2603.16629] MLLM-based Textual Explanations for Face Comparison

Abstract page for arXiv paper 2603.16629: MLLM-based Textual Explanations for Face Comparison

arXiv - AI · 4 min · about 1 hour ago

Llms

[2603.15159] To See is Not to Master: Teaching LLMs to Use Private Libraries for Code Generation

Abstract page for arXiv paper 2603.15159: To See is Not to Master: Teaching LLMs to Use Private Libraries for Code Generation

arXiv - AI · 4 min · about 1 hour ago

Llms

[2602.08316] SWE Context Bench: A Benchmark for Context Learning in Coding

Abstract page for arXiv paper 2602.08316: SWE Context Bench: A Benchmark for Context Learning in Coding

arXiv - AI · 4 min · about 1 hour ago

[2602.11937] Extending Puzzle for Mixture-of-Experts Reasoning Models with Application to GPT-OSS Acceleration

About this article

Related Articles

De-aged casts, ChatGPT-generated programs: How AI is changing Korean TV

[2603.16629] MLLM-based Textual Explanations for Face Comparison

[2603.15159] To See is Not to Master: Teaching LLMs to Use Private Libraries for Code Generation

[2602.08316] SWE Context Bench: A Benchmark for Context Learning in Coding

No comments

Stay updated with AI News