[2602.07263] tLoRA: Efficient Multi-LoRA Training with Elastic Shared Super-Models
Summary
The paper introduces tLoRA, a framework designed for efficient multi-LoRA training of large language models, improving training throughput and resource utilization significantly.
Why It Matters
As the demand for fine-tuning large language models grows, efficient resource management during training becomes critical. tLoRA addresses the challenges of concurrent LoRA jobs, enhancing training efficiency and reducing completion times, which is vital for researchers and practitioners in machine learning.
Key Takeaways
- tLoRA enables efficient batch training of multiple LoRA jobs.
- It improves training throughput by 1.2–1.8x and job completion time by 2.3–5.4x.
- The framework utilizes a fused LoRA kernel for optimized computation.
- An adaptive scheduler maximizes resource sharing and throughput.
- tLoRA enhances GPU utilization by 37%, making it a valuable tool for large-scale training.
Computer Science > Machine Learning arXiv:2602.07263 (cs) [Submitted on 6 Feb 2026 (v1), last revised 13 Feb 2026 (this version, v2)] Title:tLoRA: Efficient Multi-LoRA Training with Elastic Shared Super-Models Authors:Kevin Li, Dibyadeep Saha, Avni Kanodia, Fan Lai View a PDF of the paper titled tLoRA: Efficient Multi-LoRA Training with Elastic Shared Super-Models, by Kevin Li and 3 other authors View PDF HTML (experimental) Abstract:As Low-Rank Adaptation (LoRA) becomes the standard approach for efficiently fine-tuning large language models (LLMs), shared clusters increasingly execute many concurrent LoRA training jobs over the same frozen backbone. While recent advances enable batching (co-locating) multiple adapters during serving, efficient training-time co-location of heterogeneous LoRA adapters presents unique challenges. Jobs often differ in adapter rank, batch size, and resource allocation, and naïve batching can introduce synchronization stalls, communication overheads, and per-job slowdowns that are worse than executing independently. We introduce tLoRA, a framework that enables efficient batch training of multiple LoRA jobs. tLoRA fuses adapters that share the same base model into an elastic shared super-model, exploiting existing distributed training frameworks to derive parallelism plans that share resources effectively. At the kernel level, tLoRA employs a fused LoRA kernel that adaptively reconstructs low-rank computation tiles and schedules rank-aware nano-...