[2602.21144] Scaling State-Space Models on Multiple GPUs with Tensor Parallelism

[2602.21144] Scaling State-Space Models on Multiple GPUs with Tensor Parallelism

arXiv - Machine Learning 4 min read Article

Summary

This paper explores tensor parallelism for scaling selective state-space models (SSMs) on multiple GPUs, addressing challenges in memory and performance to enhance inference efficiency in large language models.

Why It Matters

As large language models increasingly rely on selective state-space models, optimizing their performance on multi-GPU setups is crucial for handling long-context workloads. This research presents innovative solutions to improve throughput and efficiency, making it relevant for developers and researchers in AI and machine learning.

Key Takeaways

  • Tensor parallelism can significantly enhance the inference performance of selective state-space models on multiple GPUs.
  • The proposed methods improve throughput by 1.6-4.0x depending on the number of GPUs used.
  • Quantized AllReduce techniques further optimize synchronization bandwidth, boosting performance by an additional 10-18%.
  • The study evaluates multiple SSM-based large language models, demonstrating the versatility of the approach.
  • Addressing engineering challenges in SSMs is essential for efficient multi-GPU execution.

Computer Science > Distributed, Parallel, and Cluster Computing arXiv:2602.21144 (cs) [Submitted on 24 Feb 2026] Title:Scaling State-Space Models on Multiple GPUs with Tensor Parallelism Authors:Anurag Dutt, Nimit Shah, Hazem Masarani, Anshul Gandhi View a PDF of the paper titled Scaling State-Space Models on Multiple GPUs with Tensor Parallelism, by Anurag Dutt and 3 other authors View PDF HTML (experimental) Abstract:Selective state space models (SSMs) have rapidly become a compelling backbone for large language models, especially for long-context workloads. Yet in deployment, their inference performance is often bounded by the memory capacity, bandwidth, and latency limits of a single GPU, making multi-GPU execution increasingly necessary. Although tensor parallelism (TP) is widely used to scale Transformer inference, applying it to selective SSM blocks is non-trivial because the SSM mixer couples large projections with a sequence-wise recurrent state update and local mixing whose efficiency depends on preserving locality and avoiding synchronization in the critical path. This paper presents a communication-efficient TP design for selective SSM inference that addresses three practical engineering challenges: enabling TTFT improvements via an SSM state cache across prefill and decode, partitioning the mixer's packed parameter tensor so that recurrent updates remain local while minimizing communication, and reducing TP aggregation overhead with quantized AllReduce. We eva...

Related Articles

Llms

[R] Reference model free behavioral discovery of AudiBench model organisms via Probe-Mediated Adaptive Auditing

Anthropic's AuditBench - 56 Llama 3.3 70B models with planted hidden behaviors - their best agent detects the behaviros 10-13% of the tim...

Reddit - Machine Learning · 1 min ·
Llms

[P] Dante-2B: I'm training a 2.1B bilingual fully open Italian/English LLM from scratch on 2×H200. Phase 1 done — here's what I've built.

The problem If you work with Italian text and local models, you know the pain. Every open-source LLM out there treats Italian as an after...

Reddit - Machine Learning · 1 min ·
Llms

I have been coding for 11 years and I caught myself completely unable to debug a problem without AI assistance last month. That scared me more than anything I have seen in this industry.

I want to be honest about something that happened to me because I think it is more common than people admit. Last month I hit a bug in a ...

Reddit - Artificial Intelligence · 1 min ·
Llms

OpenClaw security checklist: practical safeguards for AI agents

Here is one of the better quality guides on the ensuring safety when deploying OpenClaw: https://chatgptguide.ai/openclaw-security-checkl...

Reddit - Artificial Intelligence · 1 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime