[D] - 1M tokens/second serving Qwen 3.5 27B on B200 GPUs, benchmark results and findings
About this article
Wrote up the process of pushing Qwen 3.5 27B (dense, FP8) to 1.1M total tok/s on 96 B200 GPUs with vLLM v0.18.0. DP=8 nearly 4x'd throughput over TP=8. Model is too small for tensor parallelism to help on B200s. MTP-1 mattered more than anything else (GPU utilization was 0% without it). MTP-5 crashed with cudaErrorIllegalAddress. 97.1% scaling efficiency at 8 nodes, 96.5% at 12. TPOT flat at ~46ms regardless of node count. Inference Gateway (KV-cache-aware routing) added ~35% overhead vs Clus...
You've been blocked by network security.To continue, log in to your Reddit account or use your developer tokenIf you think you've been blocked by mistake, file a ticket below and we'll look into it.Log in File a ticket