[2602.14293] KernelBlaster: Continual Cross-Task CUDA Optimization via Memory-Augmented In-Context Reinforcement Learning

[2602.14293] KernelBlaster: Continual Cross-Task CUDA Optimization via Memory-Augmented In-Context Reinforcement Learning

arXiv - AI 4 min read Article

Summary

KernelBlaster introduces a novel framework for optimizing CUDA code across GPU architectures using Memory-Augmented In-Context Reinforcement Learning, achieving significant performance improvements.

Why It Matters

As GPU architectures evolve, optimizing CUDA code becomes increasingly complex. KernelBlaster addresses this challenge by enabling agents to learn from past experiences, enhancing the efficiency of CUDA optimization processes. This innovation is crucial for developers and researchers aiming to leverage GPU capabilities effectively.

Key Takeaways

  • KernelBlaster enhances CUDA optimization through a Memory-Augmented In-Context Reinforcement Learning framework.
  • The framework allows agents to accumulate knowledge, improving decision-making for future tasks.
  • KernelBlaster achieves speedups of up to 2.50x compared to traditional methods.
  • It provides an open-source solution, including a test harness and evaluation pipeline.
  • The approach addresses the limitations of traditional compilers and LLM fine-tuning.

Computer Science > Machine Learning arXiv:2602.14293 (cs) [Submitted on 15 Feb 2026] Title:KernelBlaster: Continual Cross-Task CUDA Optimization via Memory-Augmented In-Context Reinforcement Learning Authors:Kris Shengjun Dong, Sahil Modi, Dima Nikiforov, Sana Damani, Edward Lin, Siva Kumar Sastry Hari, Christos Kozyrakis View a PDF of the paper titled KernelBlaster: Continual Cross-Task CUDA Optimization via Memory-Augmented In-Context Reinforcement Learning, by Kris Shengjun Dong and 6 other authors View PDF HTML (experimental) Abstract:Optimizing CUDA code across multiple generations of GPU architectures is challenging, as achieving peak performance requires an extensive exploration of an increasingly complex, hardware-specific optimization space. Traditional compilers are constrained by fixed heuristics, whereas finetuning Large Language Models (LLMs) can be expensive. However, agentic workflows for CUDA code optimization have limited ability to aggregate knowledge from prior exploration, leading to biased sampling and suboptimal solutions. We propose KernelBlaster, a Memory-Augmented In-context Reinforcement Learning (MAIC-RL) framework designed to improve CUDA optimization search capabilities of LLM-based GPU coding agents. KernelBlaster enables agents to learn from experience and make systematically informed decisions on future tasks by accumulating knowledge into a retrievable Persistent CUDA Knowledge Base. We propose a novel profile-guided, textual-gradient-based...

Related Articles

Llms

Claude code x n8n

Hi everyone, I’ve been exploring MCP and integrating tools like n8n with Claude Code, and I’m trying to understand how practical this rea...

Reddit - Artificial Intelligence · 1 min ·
Llms

LLM comprehension question

Basically, does anyone else also get a really strange sense of lingering confusion and non-comprehension when an LLM explains a complex c...

Reddit - Artificial Intelligence · 1 min ·
Llms

Curated 550+ free AI tools useful for building projects (LLMs, APIs, local models, RAG, agents)

Over the last few days I was collecting free or low cost AI tools that are actually useful if you want to build stuff, not just try rando...

Reddit - Artificial Intelligence · 1 min ·
Claude Mythos and misguided open-weight fearmongering
Llms

Claude Mythos and misguided open-weight fearmongering

AI Tools & Products · 9 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime