[2604.00499] Scheduling LLM Inference with Uncertainty-Aware Output Length Predictions
About this article
Abstract page for arXiv paper 2604.00499: Scheduling LLM Inference with Uncertainty-Aware Output Length Predictions
Computer Science > Machine Learning arXiv:2604.00499 (cs) [Submitted on 1 Apr 2026] Title:Scheduling LLM Inference with Uncertainty-Aware Output Length Predictions Authors:Haoyu Zheng, Yongqiang Zhang, Fangcheng Fu, Xiaokai Zhou, Hao Luo, Hongchao Zhu, Yuanyuan Zhu, Hao Wang, Xiao Yan, Jiawei Jiang View a PDF of the paper titled Scheduling LLM Inference with Uncertainty-Aware Output Length Predictions, by Haoyu Zheng and 9 other authors View PDF HTML (experimental) Abstract:To schedule LLM inference, the \textit{shortest job first} (SJF) principle is favorable by prioritizing requests with short output lengths to avoid head-of-line (HOL) blocking. Existing methods usually predict a single output length for each request to facilitate scheduling. We argue that such a \textit{point estimate} does not match the \textit{stochastic} decoding process of LLM inference, where output length is \textit{uncertain} by nature and determined by when the end-of-sequence (EOS) token is sampled. Hence, the output length of each request should be fitted with a distribution rather than a single value. With an in-depth analysis of empirical data and the stochastic decoding process, we observe that output length follows a heavy-tailed distribution and can be fitted with the log-t distribution. On this basis, we propose a simple metric called Tail Inflated Expectation (TIE) to replace the output length in SJF scheduling, which adjusts the expectation of a log-t distribution with its tail probabi...