[D] MXFP8 GEMM: Up to 99% of cuBLAS performance using CUDA + PTX
About this article
New blog post by Daniel Vega-Myhre (Meta/PyTorch) illustrating GEMM design for FP8, including deep-dives into all the constraints and design challenges introduced by MXFP8. Link: https://danielvegamyhre.github.io/2026/03/29/mxfp8-gemm.html Original Tweet: https://x.com/vega_myhre/status/2038293614204445039 Additional resources: MXFP8 and DeepEP for DeepSeek-V3 on B200 w/ TorchTitan: https://pytorch.org/blog/enabling-up-to-41-faster-pre-training-mxfp8-and-deepep-for-deepseek-v3-on-b200-with-to...
You've been blocked by network security.To continue, log in to your Reddit account or use your developer tokenIf you think you've been blocked by mistake, file a ticket below and we'll look into it.Log in File a ticket