[2603.02599] SUN: Shared Use of Next-token Prediction for Efficient

[2603.02599] SUN: Shared Use of Next-token Prediction for Efficient Multi-LLM Disaggregated Serving

arXiv - Machine Learning March 04, 2026 4 min read

About this article

Abstract page for arXiv paper 2603.02599: SUN: Shared Use of Next-token Prediction for Efficient Multi-LLM Disaggregated Serving

Computer Science > Artificial Intelligence arXiv:2603.02599 (cs) [Submitted on 3 Mar 2026] Title:SUN: Shared Use of Next-token Prediction for Efficient Multi-LLM Disaggregated Serving Authors:Sunghyeon Woo, Ahreum Seo, Jaegwang Lee, Jaeeun Kil, Hanbae Seo, Joonghoon Kim, Baeseong Park, Se Jung Kwon, Dongsoo Lee View a PDF of the paper titled SUN: Shared Use of Next-token Prediction for Efficient Multi-LLM Disaggregated Serving, by Sunghyeon Woo and 7 other authors View PDF HTML (experimental) Abstract:In multi-model LLM serving, decode execution remains inefficient due to model-specific resource partitioning: since cross-model batching is not possible, memory-bound decoding often suffers from severe GPU underutilization, especially under skewed workloads. We propose Shared Use of Next-token Prediction (SUN), the first approach that enables cross-model sharing of decode execution in disaggregated multi-LLM serving. SUN decomposes a decoder-only Transformer into a prefill module and a decode module, and fine-tunes only the task-specific prefill module, enabling a frozen decode module to be shared across models. This design enables a model-agnostic decode routing policy that balances decode requests across shared workers to maximize utilization. Across diverse tasks and model families, SUN achieves accuracy comparable to full fine-tuning while maintaining system throughput with fewer decode workers. In particular, SUN improves throughput per GPU by up to 2.0x over conventiona...

Originally published on March 04, 2026. Curated by AI News.

Llms

You can now use ChatGPT with Apple’s CarPlay | The Verge

ChatGPT is now accessible from your CarPlay dashboard if you have iOS 26.4 or newer and the latest version of the ChatGPT app.

The Verge - AI · 3 min · about 1 hour ago

Llms

Have Companies Began Adopting Claude Co-Work at an Enterprise Level?

Hi Guys, My company is considering purchasing the Claude Enterprise plan. The main two constraints are: - Being able to block usage of Cl...

Reddit - Artificial Intelligence · 1 min · about 6 hours ago

Llms

What I learned about multi-agent coordination running 9 specialized Claude agents

I've been experimenting with multi-agent AI systems and ended up building something more ambitious than I originally planned: a fully ope...

Reddit - Artificial Intelligence · 1 min · about 8 hours ago

Llms

[D] The problem with comparing AI memory system benchmarks — different evaluation methods make scores meaningless

I've been reviewing how various AI memory systems evaluate their performance and noticed a fundamental issue with cross-system comparison...

Reddit - Machine Learning · 1 min · about 8 hours ago

[2603.02599] SUN: Shared Use of Next-token Prediction for Efficient Multi-LLM Disaggregated Serving

About this article

Related Articles

You can now use ChatGPT with Apple’s CarPlay | The Verge

Have Companies Began Adopting Claude Co-Work at an Enterprise Level?

What I learned about multi-agent coordination running 9 specialized Claude agents

[D] The problem with comparing AI memory system benchmarks — different evaluation methods make scores meaningless

No comments

Stay updated with AI News