[D] 60% MatMul Performance Bug in cuBLAS on RTX 5090 [D]
About this article
cuBLAS dispatches an inefficient kernel for every batched FP32 workload, from 256×256 to 8192×8192×8. It only uses ~40% of the available compute on RTX GPUs. Tested with RTX 5090, but likely all RTX non-Pro GPUs are affected. I tested with the latest CUDA 13.2.51, cuBLAS 13.3.0, and driver 595.58.03. Previous versions are even worse. I wrote a simple, yet efficient kernel and compared it to cuBLAS across a variety of workloads. Batched perf vs cuBLAS on 5090 (>100% means my kernel is faste...
You've been blocked by network security.To continue, log in to your Reddit account or use your developer tokenIf you think you've been blocked by mistake, file a ticket below and we'll look into it.Log in File a ticket