Inside VAKRA: Reasoning, Tool Use, and Failure Modes of Agents

Inside VAKRA: Reasoning, Tool Use, and Failure Modes of Agents

Hugging Face Blog 15 min read

About this article

A Blog post by IBM Research on Hugging Face

Back to Articles Inside VAKRA: Reasoning, Tool Use, and Failure Modes of Agents Enterprise Article Published April 15, 2026 Upvote 6 Ankita Naik ankita-naik Follow ibm-research danish danish Follow ibm-research Ben Ben871 Follow ibm-research Anupama Murthi anupamamurthi Follow ibm-research VAKRA Dataset | LeaderBoard | Release Blog | GitHub | Submit to Leaderboard We recently introduced VAKRA, a tool-grounded, executable benchmark for evaluating how well AI agents reason and act in enterprise-like environments. Unlike traditional benchmarks that test isolated skills, VAKRA measures compositional reasoning across APIs and documents, using full execution traces to assess whether agents can reliably complete multi-step workflows. VAKRA provides an executable environment where agents interact with over 8,000+ locally hosted APIs backed by real databases spanning 62 domains, along with domain-aligned document collections. Tasks can require 3-7 step reasoning chains that combine structured API interaction with unstructured retrieval under natural-language tool-use constraints. As can be seen below, models perform poorly on VAKRA - in this blog, we include additional dataset details about the tasks in VAKRA and present an analysis of failure modes we observed on different tasks. Task Description As shown below, the VAKRA benchmark comprises of four tasks, each testing a different set of capabilities. Fig 1: Representative examples of each capability in the VAKRA benchmark Capabil...

Originally published on April 15, 2026. Curated by AI News.

Related Articles

Llms

Built GPT-2, Llama 3, and DeepSeek from scratch in PyTorch - open source code + book [p]

I spent the past year implementing five LLM architectures from scratch in PyTorch and wrote a book documenting the process. What's covere...

Reddit - Machine Learning · 1 min ·
Meet HoloTab by HCompany. Your AI browser companion.
Open Source Ai

Meet HoloTab by HCompany. Your AI browser companion.

A Blog post by H company on Hugging Face

Hugging Face Blog · 3 min ·
[2604.12168] Fully Homomorphic Encryption on Llama 3 model for privacy preserving LLM inference
Llms

[2604.12168] Fully Homomorphic Encryption on Llama 3 model for privacy preserving LLM inference

Abstract page for arXiv paper 2604.12168: Fully Homomorphic Encryption on Llama 3 model for privacy preserving LLM inference

arXiv - AI · 4 min ·
[2604.12016] Identity as Attractor: Geometric Evidence for Persistent Agent Architecture in LLM Activation Space
Llms

[2604.12016] Identity as Attractor: Geometric Evidence for Persistent Agent Architecture in LLM Activation Space

Abstract page for arXiv paper 2604.12016: Identity as Attractor: Geometric Evidence for Persistent Agent Architecture in LLM Activation S...

arXiv - AI · 3 min ·
More in Open Source Ai: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime