[2602.24288] DARE-bench: Evaluating Modeling and Instruction Fidelity

[2602.24288] DARE-bench: Evaluating Modeling and Instruction Fidelity of LLMs in Data Science

arXiv - AI March 02, 2026 4 min read

About this article

Abstract page for arXiv paper 2602.24288: DARE-bench: Evaluating Modeling and Instruction Fidelity of LLMs in Data Science

Computer Science > Artificial Intelligence arXiv:2602.24288 (cs) [Submitted on 27 Feb 2026] Title:DARE-bench: Evaluating Modeling and Instruction Fidelity of LLMs in Data Science Authors:Fan Shu, Yite Wang, Ruofan Wu, Boyi Liu, Zhewei Yao, Yuxiong He, Feng Yan View a PDF of the paper titled DARE-bench: Evaluating Modeling and Instruction Fidelity of LLMs in Data Science, by Fan Shu and 6 other authors View PDF HTML (experimental) Abstract:The fast-growing demands in using Large Language Models (LLMs) to tackle complex multi-step data science tasks create an emergent need for accurate benchmarking. There are two major gaps in existing benchmarks: (i) the lack of standardized, process-aware evaluation that captures instruction adherence and process fidelity, and (ii) the scarcity of accurately labeled training data. To bridge these gaps, we introduce DARE-bench, a benchmark designed for machine learning modeling and data science instruction following. Unlike many existing benchmarks that rely on human- or model-based judges, all tasks in DARE-bench have verifiable ground truth, ensuring objective and reproducible evaluation. To cover a broad range of tasks and support agentic tools, DARE-bench consists of 6,300 Kaggle-derived tasks and provides both large-scale training data and evaluation sets. Extensive evaluations show that even highly capable models such as gpt-o4-mini struggle to achieve good performance, especially in machine learning modeling tasks. Using DARE-bench t...

Originally published on March 02, 2026. Curated by AI News.

Llms

I Accidentally Discovered a Security Vulnerability in AI Education — Then Submitted It To a $200K Competition

Last night I was testing Maestro University, the first fully AI-taught university. I walked into their enrollment chatbot and asked it to...

Reddit - Artificial Intelligence · 1 min · about 2 hours ago

Llms

Is anyone else concerned with this blatant potential of security / privacy breach?

Recently, when sending a very sensitive email to my brother including my mother’s health information, I wondered what happens if a recipi...

Reddit - Artificial Intelligence · 1 min · about 2 hours ago

Llms

An attack class that passes every current LLM filter - no payload, no injection signature, no log trace

https://shapingrooms.com/research I published a paper today on something I've been calling postural manipulation. The short version: ordi...

Reddit - Artificial Intelligence · 1 min · about 4 hours ago

Llms

[R] An attack class that passes every current LLM filter - no payload, no injection signature, no log trace

https://shapingrooms.com/research I've been documenting what I'm calling postural manipulation: a specific class of language that install...

Reddit - Machine Learning · 1 min · about 4 hours ago

[2602.24288] DARE-bench: Evaluating Modeling and Instruction Fidelity of LLMs in Data Science

About this article

Related Articles

I Accidentally Discovered a Security Vulnerability in AI Education — Then Submitted It To a $200K Competition

Is anyone else concerned with this blatant potential of security / privacy breach?

An attack class that passes every current LLM filter - no payload, no injection signature, no log trace

[R] An attack class that passes every current LLM filter - no payload, no injection signature, no log trace

No comments

Stay updated with AI News