VAKRA: A Hard Look at Where AI Agents Actually Fail

IBM Research just dropped VAKRA, and if you’re building AI agents for anything beyond chatbot demos, you should pay attention.

VAKRA stands for… honestly I don’t care what it stands for. What matters is what it does: it puts AI agents through actual enterprise workflows, not the sanitized, single-step toy problems most benchmarks use. The results are sobering.

What VAKRA Actually Tests

The benchmark throws agents into an environment with over 8,000 locally hosted APIs spanning 62 domains, backed by real databases and document collections. Tasks require 3-7 step reasoning chains that mix API calls with document retrieval. No handholding, no simplified tool sets.

There are four capability areas, each designed to stress different weaknesses:

API Chaining with Business Intelligence APIs (2,077 test instances, 54 domains): Agents need to chain 1-12 tool calls across structured data manipulation APIs. Think filtering football team stats across multiple columns to find “FC Barcelona” – but with real database schemas and no shortcuts.

The catch? Agents must call get_data(tool_universe_id=id) first to initialize the data source. This returns a lightweight preview, not the full dataset. The server stores everything server-side to avoid bloated MCP transfers. Smart design, but it means agents can’t just dump everything into context and call it a day.

Tool Selection with Dashboard APIs (1,597 instances, 17 domains): This is where things get nasty. Each domain has 6 to 328 tools available (average 116). Agents must pick the right one from a massive set, and OpenAI’s API spec caps tool lists at 128. So if you’re building on GPT-4 or Claude, you’re already hitting a wall before the agent even starts thinking.

The APIs here are REST endpoints wrapped by MCP servers. They’re query-aligned and encapsulate most computation, so picking wrong means the whole chain collapses.

Where Models Actually Fail

The paper doesn’t sugarcoat this. Models fail hard on VAKRA, and the failure modes are instructive:

1. Tool selection paralysis. Given 100+ tools, agents either pick randomly or default to the first vaguely relevant one. There’s no systematic narrowing down. This isn’t surprising – LLMs weren’t designed for combinatorial selection problems – but it’s a real blocker for enterprise deployment.

2. Reasoning chain breakage. Multi-step workflows require maintaining state across calls. Models lose track of intermediate results, forget what they’ve already retrieved, or hallucinate data that doesn’t exist. The preview mechanism in VAKRA actually makes this worse because agents can’t just dump everything into their context window.

3. API spec limitations. The 128-tool cap is a hard constraint for anyone using hosted LLMs. VAKRA domains routinely exceed this, meaning agents literally can’t see all available tools. This isn’t a model failure – it’s an infrastructure failure that kills usability.

4. Compositional reasoning gaps. The benchmark specifically tests combining structured API calls with unstructured document retrieval. Models that excel at one often fail at the other. Few can do both in a single workflow without breaking.

Why This Matters More Than Your Average Benchmark

Most agent benchmarks are basically “can this model answer a question about a Wikipedia article?” VAKRA is “can this model navigate a real enterprise data ecosystem, make decisions under tool constraints, and recover from its own mistakes?”

That’s a fundamentally harder problem, and the results show we’re not there yet. The VAKRA team has been upfront about this – they’re not selling a solution, they’re exposing the problem.

What I’d Like to See Next

The dataset and leaderboard are open source, which is great. But what I really want is:

Analysis of which model families handle specific failure modes better. Is Claude better at tool selection? Is GPT-4 better at reasoning chains? The current data doesn’t break this down cleanly.
Work on adaptive tool selection strategies. If we can’t fit all tools in context, maybe agents need hierarchical selection or retrieval-augmented tool discovery.
More focus on error recovery. Current benchmarks mostly measure one-shot success. Real agents need to detect when they’ve gone wrong and backtrack.

VAKRA is a wake-up call for anyone building production agents. The demos are impressive, but the benchmarks tell a different story. We’ve got a long way to go before these systems handle real enterprise complexity reliably.

The dataset and leaderboard are available if you want to torture-test your own models. I’d recommend it – the results might surprise you.

VAKRA: A Hard Look at Where AI Agents Actually Fail

What VAKRA Actually Tests

Where Models Actually Fail

Why This Matters More Than Your Average Benchmark

What I’d Like to See Next

Comments (0)