Llms Machine Learning Ai Agents

ClawBench: Can AI Agents Complete Everyday Online Tasks? 153 tasks, 144 live websites, best model at 33.3% [R]

Reddit - Machine Learning April 14, 2026 1 min read

About this article

We introduce ClawBench, a benchmark that evaluates AI browser agents on 153 real-world everyday tasks across 144 live websites. Unlike synthetic benchmarks, ClawBench tests agents on actual production platforms. Key findings: The best model (Claude Sonnet 4.6) achieves only 33.3% success rate GLM-5 (Zhipu AI) comes second at 24.2% — surprisingly strong for a text-only model Finance and Academic tasks are easier (50% for the best model); Travel and Dev tasks are much harder No model exceeds 50...

You've been blocked by network security.To continue, log in to your Reddit account or use your developer tokenIf you think you've been blocked by mistake, file a ticket below and we'll look into it.Log in File a ticket

Originally published on April 14, 2026. Curated by AI News.

Read Original Article

Llms

How much are you actually spending on AI APIs? I built an OpenSource router to cut that.

I've been working on Manifest, an open-source AI cost optimization tool. The idea is simple: instead of sending every request to the same...

Reddit - Artificial Intelligence · 1 min · about 1 hour ago

Llms

Claude Code Degradation: An interesting and novel find

As many of you have likely seen, the Claude Code community newswire has been ablaze with Claude Code being quite degraded lately, startin...

Reddit - Artificial Intelligence · 1 min · about 1 hour ago

Llms

I built a tool that turns repeated file reads into 13-token references. My AI Coding sessions now use 86% fewer tokens on file-heavy tasks based on mathematics and research. [P]

I got tired of watching Claude Code re-read the same files over and over. A 2,000-token file read 5 times = 10,000 tokens gone. So I buil...

Reddit - Machine Learning · 1 min · about 2 hours ago

Llms

Claude Launched routines in Claude Code.

https://preview.redd.it/v47kba3gu6vg1.png?width=1209&format=png&auto=webp&s=8643a24ef8d3ec5de52dcf214a65fa4c00e4b667 submitte...

ClawBench: Can AI Agents Complete Everyday Online Tasks? 153 tasks, 144 live websites, best model at 33.3% [R]

About this article

Related Articles

How much are you actually spending on AI APIs? I built an OpenSource router to cut that.

Claude Code Degradation: An interesting and novel find

I built a tool that turns repeated file reads into 13-token references. My AI Coding sessions now use 86% fewer tokens on file-heavy tasks based on mathematics and research. [P]

Claude Launched routines in Claude Code.

No comments

Stay updated with AI News