Custom Kernels for All from Codex and Claude
About this article
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
Back to Articles Custom Kernels for All from Codex and Claude Published February 13, 2026 Update on GitHub Upvote 36 +30 ben burtenshaw burtenshaw Follow Sayak Paul sayakpaul Follow Aritra Roy Gosthipaty ariG23498 Follow shaun smith evalstate Follow tl;dr: We built an agent skill that teaches coding agents how to write production CUDA kernels. Then we pointed Claude and Codex at two real targets: a diffusers pipeline and a transformers model. The agents produced working kernels for both, with correct PyTorch bindings and benchmarks, end to end. Writing CUDA kernels is hard. Writing CUDA kernels that correctly integrate with transformers and diffusers is harder. There are architecture-specific memory access patterns, vectorization strategies, warp shuffle reductions, and a dozen integration pitfalls that trip up even experienced developers. It is exactly the kind of specialized, high-stakes problem where agent skills shine. We gave coding agents the domain knowledge they need, like which GPU architecture to target, how to structure a kernel-builder project, when to use shared memory versus registers, and how to write PyTorch bindings. The agents did the rest. If you have used the LLM training skill or read We Got Claude to Teach Open Models, the pattern will feel familiar: package domain expertise into a skill, point the agent at a problem, and let it work. Why a skill for kernels? The Kernel Hub solved the distribution of custom hardware kernels. You can load pre-compiled ...