[2604.00499] Scheduling LLM Inference with Uncertainty-Aware Output

[2604.00499] Scheduling LLM Inference with Uncertainty-Aware Output Length Predictions

arXiv - Machine Learning April 02, 2026 4 min read

About this article

Abstract page for arXiv paper 2604.00499: Scheduling LLM Inference with Uncertainty-Aware Output Length Predictions

Computer Science > Machine Learning arXiv:2604.00499 (cs) [Submitted on 1 Apr 2026] Title:Scheduling LLM Inference with Uncertainty-Aware Output Length Predictions Authors:Haoyu Zheng, Yongqiang Zhang, Fangcheng Fu, Xiaokai Zhou, Hao Luo, Hongchao Zhu, Yuanyuan Zhu, Hao Wang, Xiao Yan, Jiawei Jiang View a PDF of the paper titled Scheduling LLM Inference with Uncertainty-Aware Output Length Predictions, by Haoyu Zheng and 9 other authors View PDF HTML (experimental) Abstract:To schedule LLM inference, the \textit{shortest job first} (SJF) principle is favorable by prioritizing requests with short output lengths to avoid head-of-line (HOL) blocking. Existing methods usually predict a single output length for each request to facilitate scheduling. We argue that such a \textit{point estimate} does not match the \textit{stochastic} decoding process of LLM inference, where output length is \textit{uncertain} by nature and determined by when the end-of-sequence (EOS) token is sampled. Hence, the output length of each request should be fitted with a distribution rather than a single value. With an in-depth analysis of empirical data and the stochastic decoding process, we observe that output length follows a heavy-tailed distribution and can be fitted with the log-t distribution. On this basis, we propose a simple metric called Tail Inflated Expectation (TIE) to replace the output length in SJF scheduling, which adjusts the expectation of a log-t distribution with its tail probabi...

Originally published on April 02, 2026. Curated by AI News.

Llms

Agents that write their own code at runtime and vote on capabilities, no human in the loop

hollowOS just hit v4.4 and I added something that I haven’t seen anyone else do. Previous versions gave you an OS for agents: structured ...

Reddit - Artificial Intelligence · 1 min · about 2 hours ago

Llms

Google Maps can now write captions for your photos using AI | TechCrunch

Gemini can now create captions when users are looking to share a photo or video.

TechCrunch - AI · 4 min · about 3 hours ago

Llms

ParetoBandit: Budget-Paced Adaptive Routing for Non-Stationary LLM Serving

submitted by /u/PatienceHistorical70 [link] [comments]

Reddit - Machine Learning · 1 min · about 4 hours ago

Llms

Stop Overcomplicating AI Workflows. This Is the Simple Framework

I’ve been working on building an agentic AI workflow system for business use cases and one thing became very clear very quickly. This is ...

Reddit - Artificial Intelligence · 1 min · about 6 hours ago

[2604.00499] Scheduling LLM Inference with Uncertainty-Aware Output Length Predictions

About this article

Related Articles

Agents that write their own code at runtime and vote on capabilities, no human in the loop

Google Maps can now write captions for your photos using AI | TechCrunch

ParetoBandit: Budget-Paced Adaptive Routing for Non-Stationary LLM Serving

Stop Overcomplicating AI Workflows. This Is the Simple Framework

No comments

Stay updated with AI News