Inside VAKRA: Reasoning, Tool Use, and Failure Modes of Agents

Hugging Face Blog April 15, 2026 15 min read

About this article

A Blog post by IBM Research on Hugging Face

Back to Articles Inside VAKRA: Reasoning, Tool Use, and Failure Modes of Agents Enterprise Article Published April 15, 2026 Upvote 6 Ankita Naik ankita-naik Follow ibm-research danish danish Follow ibm-research Ben Ben871 Follow ibm-research Anupama Murthi anupamamurthi Follow ibm-research VAKRA Dataset | LeaderBoard | Release Blog | GitHub | Submit to Leaderboard We recently introduced VAKRA, a tool-grounded, executable benchmark for evaluating how well AI agents reason and act in enterprise-like environments. Unlike traditional benchmarks that test isolated skills, VAKRA measures compositional reasoning across APIs and documents, using full execution traces to assess whether agents can reliably complete multi-step workflows. VAKRA provides an executable environment where agents interact with over 8,000+ locally hosted APIs backed by real databases spanning 62 domains, along with domain-aligned document collections. Tasks can require 3-7 step reasoning chains that combine structured API interaction with unstructured retrieval under natural-language tool-use constraints. As can be seen below, models perform poorly on VAKRA - in this blog, we include additional dataset details about the tasks in VAKRA and present an analysis of failure modes we observed on different tasks. Task Description As shown below, the VAKRA benchmark comprises of four tasks, each testing a different set of capabilities. Fig 1: Representative examples of each capability in the VAKRA benchmark Capabil...

Originally published on April 15, 2026. Curated by AI News.

Llms

Built GPT-2, Llama 3, and DeepSeek from scratch in PyTorch - open source code + book [p]

I spent the past year implementing five LLM architectures from scratch in PyTorch and wrote a book documenting the process. What's covere...

Reddit - Machine Learning · 1 min · about 6 hours ago

Open Source Ai

Meet HoloTab by HCompany. Your AI browser companion.

A Blog post by H company on Hugging Face

Hugging Face Blog · 3 min · about 11 hours ago

Llms

[2604.12168] Fully Homomorphic Encryption on Llama 3 model for privacy preserving LLM inference

Abstract page for arXiv paper 2604.12168: Fully Homomorphic Encryption on Llama 3 model for privacy preserving LLM inference

arXiv - AI · 4 min · about 14 hours ago

Llms

[2604.12016] Identity as Attractor: Geometric Evidence for Persistent Agent Architecture in LLM Activation Space

Abstract page for arXiv paper 2604.12016: Identity as Attractor: Geometric Evidence for Persistent Agent Architecture in LLM Activation S...

arXiv - AI · 3 min · about 14 hours ago

Inside VAKRA: Reasoning, Tool Use, and Failure Modes of Agents

About this article

Related Articles

Built GPT-2, Llama 3, and DeepSeek from scratch in PyTorch - open source code + book [p]

Meet HoloTab by HCompany. Your AI browser companion.

[2604.12168] Fully Homomorphic Encryption on Llama 3 model for privacy preserving LLM inference

[2604.12016] Identity as Attractor: Geometric Evidence for Persistent Agent Architecture in LLM Activation Space

No comments

Stay updated with AI News