[2602.11937] Extending Puzzle for Mixture-of-Experts Reasoning Models with Application to GPT-OSS Acceleration
About this article
Abstract page for arXiv paper 2602.11937: Extending Puzzle for Mixture-of-Experts Reasoning Models with Application to GPT-OSS Acceleration
Computer Science > Machine Learning arXiv:2602.11937 (cs) [Submitted on 12 Feb 2026 (v1), last revised 26 Mar 2026 (this version, v2)] Title:Extending Puzzle for Mixture-of-Experts Reasoning Models with Application to GPT-OSS Acceleration Authors:Akhiad Bercovich, Nir Ailon, Vladimir Anisimov, Tomer Asida, Nave Assaf, Mohammad Dabbah, Ido Galil, Amnon Geifman, Yonatan Geifman, Izhak Golan, Roi Koren, Itay Levy, Zach Moshe, Pavlo Molchanov, Najeeb Nabwani, Mostofa Patwary, Omri Puny, Tomer Ronen, Itamar Schen, Elad Segal, Ido Shahaf, Oren Tropp, Ran Zilberstein, Ran El-Yaniv View a PDF of the paper titled Extending Puzzle for Mixture-of-Experts Reasoning Models with Application to GPT-OSS Acceleration, by Akhiad Bercovich and 23 other authors View PDF HTML (experimental) Abstract:Reasoning-focused LLMs improve answer quality by generating longer reasoning traces, but the additional tokens dramatically increase serving cost, motivating inference optimization. We extend and apply Puzzle, a post-training neural architecture search (NAS) framework, to gpt-oss-120B to produce gpt-oss-puzzle-88B, a deployment-optimized derivative. Our approach combines heterogeneous MoE expert pruning, selective replacement of full-context attention with window attention, FP8 KV-cache quantization with calibrated scales, and post-training reinforcement learning to recover accuracy, while maintaining low generation length. In terms of per-token speeds, on an 8XH100 node we achieve 1.63X and 1.22X ...