ClawBench: Can AI Agents Complete Everyday Online Tasks? 153 tasks, 144 live websites, best model at 33.3% [R]
About this article
We introduce ClawBench, a benchmark that evaluates AI browser agents on 153 real-world everyday tasks across 144 live websites. Unlike synthetic benchmarks, ClawBench tests agents on actual production platforms. Key findings: The best model (Claude Sonnet 4.6) achieves only 33.3% success rate GLM-5 (Zhipu AI) comes second at 24.2% — surprisingly strong for a text-only model Finance and Academic tasks are easier (50% for the best model); Travel and Dev tasks are much harder No model exceeds 50...
You've been blocked by network security.To continue, log in to your Reddit account or use your developer tokenIf you think you've been blocked by mistake, file a ticket below and we'll look into it.Log in File a ticket