Easily Build and Share ROCm Kernels with Hugging Face
About this article
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
Back to Articles Easily Build and Share ROCm Kernels with Hugging Face Published November 17, 2025 Update on GitHub Upvote 37 +31 Abdennacer Badaoui badaoui Follow Daniel Huang daniehua Follow colorswind ColorsWind Follow Zesen Liu ftyghome Follow Intoduction Custom kernels are the backbone of high-performance deep learning, enabling GPU operations tailored precisely to your workload; whether that’s image processing, tensor transformations, or other compute-heavy tasks. But compiling these kernels for the right architectures, wiring all the build flags, and integrating them cleanly into PyTorch extensions can quickly become a mess of CMake/Nix, compiler errors, and ABI issues, which is not fun. Hugging Face’s kernels library makes it easy to build (with kernel-builder) and share these kernels with the kernels-community, with support for multiple GPU and accelerator backends, including CUDA, ROCm, Metal, and XPU. This ensures your kernels are fast, portable, and seamlessly integrated with PyTorch. In this guide, we focus exclusively on ROCm-compatible kernels and show how to build, test, and share them using kernels. You’ll learn how to create kernels that run efficiently on AMD GPUs, along with best practices for reproducibility, packaging, and deployment. This ROCm-specific walkthrough is a streamlined version of the original kernel-builder guide. If you’re looking for the broader CUDA-focused version, you can find it here: A Guide to Building and Scaling Production-Ready...