[2508.19073] CARMA: Collocation-Aware Resource Manager

[2508.19073] CARMA: Collocation-Aware Resource Manager

arXiv - Machine Learning 4 min read Article

Summary

CARMA is a collocation-aware resource manager designed to optimize GPU utilization for deep learning workloads while mitigating risks of out-of-memory crashes and performance interference.

Why It Matters

As deep learning tasks increasingly rely on GPU resources, efficient management of these resources is critical for improving performance and energy efficiency. CARMA addresses common challenges in GPU utilization, making it relevant for researchers and practitioners in the field of distributed computing and machine learning.

Key Takeaways

  • CARMA enhances GPU utilization by 54% through informed collocation decisions.
  • It reduces out-of-memory crashes and performance interference among tasks.
  • The system achieves a 35% reduction in end-to-end execution time for deep learning workloads.
  • Energy consumption is decreased by approximately 15% with CARMA's optimizations.
  • Fine-grained monitoring and task placement policies are key features of CARMA.

Computer Science > Distributed, Parallel, and Cluster Computing arXiv:2508.19073 (cs) [Submitted on 26 Aug 2025 (v1), last revised 23 Feb 2026 (this version, v3)] Title:CARMA: Collocation-Aware Resource Manager Authors:Ehsan Yousefzadeh-Asl-Miandoab, Florina M. Ciorba, Pınar Tözün View a PDF of the paper titled CARMA: Collocation-Aware Resource Manager, by Ehsan Yousefzadeh-Asl-Miandoab and 2 other authors View PDF HTML (experimental) Abstract:GPUs running deep learning (DL) workloads are frequently underutilized. Collocating multiple DL training tasks on the same GPU can improve utilization but introduces two key risks: (1) out-of-memory (OOM) crashes for newly scheduled tasks, and (2) severe performance interference among co-running tasks, which can negate any throughput gains. These issues reduce system robustness, quality of service, and energy efficiency. We present CARMA, a task-level, collocation-aware resource manager for the server-scale. CARMA addresses collocation challenges via (1) fine-grained monitoring and bookkeeping of GPUs and a collocation risk analysis that filters out the high-risk GPUs; (2) task placement policies that cap GPU utilization to limit OOMs and interference; (3) integration of GPU memory need estimators for DL tasks to minimize OOMs during collocation; and (4) a lightweight recovery method that relaunches jobs crashed due to OOMs. Our evaluation on a DL training workload derived from real-world traces shows that CARMA uses GPUs more effici...

Related Articles

Machine Learning

[HIRING] Machine Learning Evaluation Specialist | Remote | $50/hr

​ We are onboarding domain experts with strong machine learning knowledge to design advanced evaluation tasks for AI systems. About the R...

Reddit - ML Jobs · 1 min ·
Machine Learning

Japan is adopting robotics and physical AI, with a model where startups innovate and corporations provide scale

Physical AI is emerging as one of the next major industrial battlegrounds, with Japan’s push driven more by necessity than anything else....

Reddit - Artificial Intelligence · 1 min ·
Machine Learning

mining hardware doing AI training - is the output actually useful

there's this network that launched recently routing crypto mining hardware toward AI training workloads. miners seem happy with the econo...

Reddit - Artificial Intelligence · 1 min ·
AI is changing how small online sellers decide what to make | MIT Technology Review
Machine Learning

AI is changing how small online sellers decide what to make | MIT Technology Review

Entrepreneurs based in the US are using tools like Alibaba’s Accio to compress weeks of product research and supplier hunting into a sing...

MIT Technology Review · 8 min ·
More in Machine Learning: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime