AI benchmarks are broken. Here’s what we need instead.

AI benchmarks are broken. Here’s what we need instead. | MIT Technology Review

MIT Technology Review March 31, 2026 8 min read

About this article

One-off tests don’t measure AI’s true impact. We’re better off shifting to more human-centered, context-specific methods.

For decades, artificial intelligence has been evaluated through the question of whether machines outperform humans. From chess to advanced math, from coding to essay writing, the performance of AI models and applications is tested against that of individual humans completing tasks. This framing is seductive: An AI vs. human comparison on isolated problems with clear right or wrong answers is easy to standardize, compare, and optimize. It generates rankings and headlines. But there’s a problem: AI is almost never used in the way it is benchmarked. Although researchers and industry have started to improve benchmarking by moving beyond static tests to more dynamic evaluation methods, these innovations resolve only part of the issue. That’s because they still evaluate AI’s performance outside the human teams and organizational workflows where its real-world performance ultimately unfolds. While AI is evaluated at the task level in a vacuum, it is used in messy, complex environments where it usually interacts with more than one person. Its performance (or lack thereof) emerges only over extended periods of use. This misalignment leaves us misunderstanding AI’s capabilities, overlooking systemic risks, and misjudging its economic and social consequences. To mitigate this, it’s time to shift from narrow methods to benchmarks that assess how AI systems perform over longer time horizons within human teams, workflows, and organizations. I have studied real-world AI deployment ...

Originally published on March 31, 2026. Curated by AI News.

Machine Learning

AI Has Flooded All the Weather Apps | WIRED

Weather forecasting has gotten a big boost from machine learning. How that translates into what users see can vary.

Wired - AI · 8 min · 1 minute ago

Llms

What I learned about multi-agent coordination running 9 specialized Claude agents

I've been experimenting with multi-agent AI systems and ended up building something more ambitious than I originally planned: a fully ope...

Reddit - Artificial Intelligence · 1 min · 1 minute ago

Machine Learning

The AI Chip War is Just Getting Started

Everyone talks about AI models, but the real bottleneck might be hardware. According to a recent study by Roots Analysis: AI chip market ...

Reddit - Artificial Intelligence · 1 min · 1 minute ago

Machine Learning

Exclusive: Runway launches $10M fund, Builders program to support early stage AI startups | TechCrunch

Runway is launching a $10 million fund and startup program to back companies building with its AI video models, as it pushes toward inter...

TechCrunch - AI · 7 min · 17 minutes ago

AI benchmarks are broken. Here’s what we need instead. | MIT Technology Review

About this article

Related Articles

AI Has Flooded All the Weather Apps | WIRED

What I learned about multi-agent coordination running 9 specialized Claude agents

The AI Chip War is Just Getting Started

Exclusive: Runway launches $10M fund, Builders program to support early stage AI startups | TechCrunch

No comments

Stay updated with AI News