[2602.14293] KernelBlaster: Continual Cross-Task CUDA Optimization via Memory-Augmented In-Context Reinforcement Learning
Summary
KernelBlaster introduces a novel framework for optimizing CUDA code across GPU architectures using Memory-Augmented In-Context Reinforcement Learning, achieving significant performance improvements.
Why It Matters
As GPU architectures evolve, optimizing CUDA code becomes increasingly complex. KernelBlaster addresses this challenge by enabling agents to learn from past experiences, enhancing the efficiency of CUDA optimization processes. This innovation is crucial for developers and researchers aiming to leverage GPU capabilities effectively.
Key Takeaways
- KernelBlaster enhances CUDA optimization through a Memory-Augmented In-Context Reinforcement Learning framework.
- The framework allows agents to accumulate knowledge, improving decision-making for future tasks.
- KernelBlaster achieves speedups of up to 2.50x compared to traditional methods.
- It provides an open-source solution, including a test harness and evaluation pipeline.
- The approach addresses the limitations of traditional compilers and LLM fine-tuning.
Computer Science > Machine Learning arXiv:2602.14293 (cs) [Submitted on 15 Feb 2026] Title:KernelBlaster: Continual Cross-Task CUDA Optimization via Memory-Augmented In-Context Reinforcement Learning Authors:Kris Shengjun Dong, Sahil Modi, Dima Nikiforov, Sana Damani, Edward Lin, Siva Kumar Sastry Hari, Christos Kozyrakis View a PDF of the paper titled KernelBlaster: Continual Cross-Task CUDA Optimization via Memory-Augmented In-Context Reinforcement Learning, by Kris Shengjun Dong and 6 other authors View PDF HTML (experimental) Abstract:Optimizing CUDA code across multiple generations of GPU architectures is challenging, as achieving peak performance requires an extensive exploration of an increasingly complex, hardware-specific optimization space. Traditional compilers are constrained by fixed heuristics, whereas finetuning Large Language Models (LLMs) can be expensive. However, agentic workflows for CUDA code optimization have limited ability to aggregate knowledge from prior exploration, leading to biased sampling and suboptimal solutions. We propose KernelBlaster, a Memory-Augmented In-context Reinforcement Learning (MAIC-RL) framework designed to improve CUDA optimization search capabilities of LLM-based GPU coding agents. KernelBlaster enables agents to learn from experience and make systematically informed decisions on future tasks by accumulating knowledge into a retrievable Persistent CUDA Knowledge Base. We propose a novel profile-guided, textual-gradient-based...