[2509.04575] Bootstrapping Task Spaces for Self-Improvement
Summary
This article presents Exploratory Iteration (ExIt), a novel approach in reinforcement learning that enhances self-improvement in agents by leveraging informative task histories for multi-step iterations during inference.
Why It Matters
The research addresses a significant challenge in reinforcement learning: how to enable agents to effectively self-improve without predefined iteration limits. By introducing ExIt, the authors provide a framework that could lead to more efficient learning processes in various domains, enhancing the capabilities of AI systems.
Key Takeaways
- ExIt allows agents to perform multi-step self-improvement at inference-time.
- The method selectively samples informative task histories to create new training instances.
- ExIt can enhance task diversity through explicit exploration mechanisms.
- Demonstrated effectiveness across various domains, including math and tool-use tasks.
- Offers a new perspective on training policies for improved performance beyond average iteration depths.
Computer Science > Machine Learning arXiv:2509.04575 (cs) [Submitted on 4 Sep 2025 (v1), last revised 22 Feb 2026 (this version, v3)] Title:Bootstrapping Task Spaces for Self-Improvement Authors:Minqi Jiang, Andrei Lupu, Yoram Bachrach View a PDF of the paper titled Bootstrapping Task Spaces for Self-Improvement, by Minqi Jiang and 2 other authors View PDF HTML (experimental) Abstract:Progress in many task domains emerges from repeated revisions to previous solution attempts. Training agents that can reliably self-improve over such sequences at inference-time is a natural target for reinforcement learning (RL), yet the naive approach assumes a fixed maximum iteration depth, which can be both costly and arbitrary. We present Exploratory Iteration (ExIt), a family of autocurriculum RL methods that directly exploits the recurrent structure of self-improvement tasks to train LLMs to perform multi-step self-improvement at inference-time while only training on the most informative single-step iterations. ExIt grows a task space by selectively sampling the most informative intermediate, partial histories encountered during an episode for continued iteration, treating these starting points as new self-iteration task instances to train a self-improvement policy. ExIt can further pair with explicit exploration mechanisms to sustain greater task diversity. Across several domains, encompassing competition math, multi-turn tool-use, and machine learning engineering, we demonstrate that...