Inside VAKRA: Reasoning, Tool Use, and Failure Modes of Agents
About this article
A Blog post by IBM Research on Hugging Face
Back to Articles Inside VAKRA: Reasoning, Tool Use, and Failure Modes of Agents Enterprise Article Published April 15, 2026 Upvote 6 Ankita Naik ankita-naik Follow ibm-research danish danish Follow ibm-research Ben Ben871 Follow ibm-research Anupama Murthi anupamamurthi Follow ibm-research VAKRA Dataset | LeaderBoard | Release Blog | GitHub | Submit to Leaderboard We recently introduced VAKRA, a tool-grounded, executable benchmark for evaluating how well AI agents reason and act in enterprise-like environments. Unlike traditional benchmarks that test isolated skills, VAKRA measures compositional reasoning across APIs and documents, using full execution traces to assess whether agents can reliably complete multi-step workflows. VAKRA provides an executable environment where agents interact with over 8,000+ locally hosted APIs backed by real databases spanning 62 domains, along with domain-aligned document collections. Tasks can require 3-7 step reasoning chains that combine structured API interaction with unstructured retrieval under natural-language tool-use constraints. As can be seen below, models perform poorly on VAKRA - in this blog, we include additional dataset details about the tasks in VAKRA and present an analysis of failure modes we observed on different tasks. Task Description As shown below, the VAKRA benchmark comprises of four tasks, each testing a different set of capabilities. Fig 1: Representative examples of each capability in the VAKRA benchmark Capabil...