[2602.13069] Memory-Efficient Structured Backpropagation for On-Device LLM Fine-Tuning
Summary
The paper presents Memory-Efficient Structured Backpropagation (MeSP), a novel approach for on-device fine-tuning of large language models (LLMs) that significantly reduces memory usage while maintaining gradient accuracy.
Why It Matters
As on-device AI applications grow, efficient fine-tuning methods are crucial for enabling personalization without compromising performance. MeSP addresses memory constraints in mobile devices, making advanced AI more accessible and practical.
Key Takeaways
- MeSP achieves a 49% reduction in memory usage compared to existing methods.
- The method maintains mathematically identical gradients while reducing peak memory from 361MB to 136MB.
- MeSP enables fine-tuning scenarios previously infeasible on memory-constrained devices.
- The paper highlights the inefficiency of existing low-memory gradient estimation methods.
- MeSP leverages LoRA's low-rank structure for efficient computation.
Computer Science > Machine Learning arXiv:2602.13069 (cs) [Submitted on 13 Feb 2026] Title:Memory-Efficient Structured Backpropagation for On-Device LLM Fine-Tuning Authors:Juneyoung Park, Yuri Hong, Seongwan Kim, Jaeho Lee View a PDF of the paper titled Memory-Efficient Structured Backpropagation for On-Device LLM Fine-Tuning, by Juneyoung Park and 3 other authors View PDF HTML (experimental) Abstract:On-device fine-tuning enables privacy-preserving personalization of large language models, but mobile devices impose severe memory constraints, typically 6--12GB shared across all workloads. Existing approaches force a trade-off between exact gradients with high memory (MeBP) and low memory with noisy estimates (MeZO). We propose Memory-efficient Structured Backpropagation (MeSP), which bridges this gap by manually deriving backward passes that exploit LoRA's low-rank structure. Our key insight is that the intermediate projection $h = xA$ can be recomputed during backward at minimal cost since rank $r \ll d_{in}$, eliminating the need to store it. MeSP achieves 49\% average memory reduction compared to MeBP on Qwen2.5 models (0.5B--3B) while computing mathematically identical gradients. Our analysis also reveals that MeZO's gradient estimates show near-zero correlation with true gradients (cosine similarity $\approx$0.001), explaining its slow convergence. MeSP reduces peak memory from 361MB to 136MB for Qwen2.5-0.5B, enabling fine-tuning scenarios previously infeasible on m...