[2512.00672] ML-Tool-Bench: Tool-Augmented Planning for ML Tasks
Summary
The paper presents ML-Tool-Bench, a benchmark for evaluating tool-augmented planning in machine learning tasks, addressing the limitations of existing methods and improving agent performance.
Why It Matters
As machine learning continues to evolve, the need for autonomous agents that can effectively manage complex workflows is critical. This research offers a structured approach to enhance the capabilities of these agents, paving the way for more reliable and efficient ML applications.
Key Takeaways
- Introduces ML-Tool-Bench, a benchmark for tool-augmented ML agents.
- Addresses shortcomings in existing tool-use evaluations for complex ML tasks.
- Proposes methods to improve planning and execution of ML workflows.
- Demonstrates significant performance improvements using structured feedback.
- Provides a foundation for future research in autonomous ML agents.
Computer Science > Machine Learning arXiv:2512.00672 (cs) [Submitted on 29 Nov 2025 (v1), last revised 20 Feb 2026 (this version, v2)] Title:ML-Tool-Bench: Tool-Augmented Planning for ML Tasks Authors:Yaswanth Chittepu, Raghavendra Addanki, Tung Mai, Anup Rao, Branislav Kveton View a PDF of the paper titled ML-Tool-Bench: Tool-Augmented Planning for ML Tasks, by Yaswanth Chittepu and 4 other authors View PDF HTML (experimental) Abstract:The development of autonomous machine learning (ML) agents capable of end-to-end data science workflows represents a significant frontier in artificial intelligence. These agents must orchestrate complex sequences of data analysis, feature engineering, model selection, and hyperparameter optimization, tasks that require sophisticated planning and iteration. While recent work on building ML agents has explored using large language models (LLMs) for direct code generation, tool-augmented approaches offer greater modularity and reliability. However, existing tool-use benchmarks focus primarily on task-specific tool selection or argument extraction for tool invocation, failing to evaluate the sophisticated planning capabilities required for ML Agents. In this work, we introduce a comprehensive benchmark for evaluating tool-augmented ML agents using a curated set of 61 specialized tools and 15 tabular ML challenges from Kaggle. Our benchmark goes beyond traditional tool-use evaluation by incorporating an in-memory named object management, allowi...