[2604.09107] TensorHub: Scalable and Elastic Weight Transfer for LLM RL Training
About this article
Abstract page for arXiv paper 2604.09107: TensorHub: Scalable and Elastic Weight Transfer for LLM RL Training
Computer Science > Distributed, Parallel, and Cluster Computing arXiv:2604.09107 (cs) [Submitted on 10 Apr 2026] Title:TensorHub: Scalable and Elastic Weight Transfer for LLM RL Training Authors:Chenhao Ye, Huaizheng Zhang, Mingcong Han, Baoquan Zhong, Xiang Li, Qixiang Chen, Xinyi Zhang, Weidong Zhang, Kaihua Jiang, Wang Zhang, He Sun, Wencong Xiao, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau View a PDF of the paper titled TensorHub: Scalable and Elastic Weight Transfer for LLM RL Training, by Chenhao Ye and 13 other authors View PDF HTML (experimental) Abstract:Modern LLM reinforcement learning (RL) workloads require a highly efficient weight transfer system to scale training across heterogeneous computational resources. However, existing weight transfer approaches either fail to provide flexibility for dynamically scaling clusters or incur fundamental data movement overhead, resulting in poor performance. We introduce Reference-Oriented Storage (ROS), a new storage abstraction for RL weight transfer that exploits the highly replicated model weights in place. ROS presents the illusion that certain versions of the model weights are stored and can be fetched on demand. Underneath, ROS does not physically store any copies of the weights; instead, it tracks the workers that hold these weights on GPUs for inference. Upon request, ROS directly uses them to serve reads. We build TensorHub, a production-quality system that extends the ROS idea with topology-optimized trans...