[2504.09775] Understanding and Optimizing Multi-Stage AI Inference Pipelines
About this article
Abstract page for arXiv paper 2504.09775: Understanding and Optimizing Multi-Stage AI Inference Pipelines
Computer Science > Hardware Architecture arXiv:2504.09775 (cs) [Submitted on 14 Apr 2025 (v1), last revised 20 Mar 2026 (this version, v5)] Title:Understanding and Optimizing Multi-Stage AI Inference Pipelines Authors:Abhimanyu Rajeshkumar Bambhaniya, Hanjiang Wu, Suvinay Subramanian, Sudarshan Srinivasan, Souvik Kundu, Amir Yazdanbakhsh, Midhilesh Elavazhagan, Madhu Kumar, Tushar Krishna View a PDF of the paper titled Understanding and Optimizing Multi-Stage AI Inference Pipelines, by Abhimanyu Rajeshkumar Bambhaniya and 8 other authors View PDF Abstract:The rapid evolution of Large Language Models (LLMs) has driven the need for increasingly sophisticated inference pipelines and hardware platforms. Modern LLM serving extends beyond traditional prefill-decode workflows, incorporating multi-stage processes such as Retrieval Augmented Generation (RAG), key-value (KV) cache retrieval, dynamic model routing, and multi step reasoning. These stages exhibit diverse computational demands, requiring distributed systems that integrate GPUs, ASICs, CPUs, and memory-centric architectures. However, existing simulators lack the fidelity to model these heterogeneous, multi-engine workflows, limiting their ability to inform architectural decisions. To address this gap, we introduce MIST, a Heterogeneous Multi-stage LLM inference Execution Simulator. MIST models diverse request stages; including RAG, KV retrieval, reasoning, prefill, and decode across complex hardware hierarchies. MIST sup...