By Troels Marstrand
April 01, 2025
There is a lot of talk about agentic AI, but I don’t really get it. It is a new wrapping around LLMs (such as OpenAI’s chatGPT, Anthropic’s Claude, Google’s Gemini) with access to a series of APIs. This is not some new breakthrough or “smarter” way of getting stuff done. I would argue it is quite the opposite.
In fact, research has shown that for some systems 3 out of 4 attempts to solve a task results in failure.LLMs suffers from increasingly worse performance the further you get from their training distribution, that is, types of text and examples they have seen before. They also respond quite bad at very vague instructions. To further exacerbate the problem, you have no way of knowing when something is too far from the training distribution (because you don’t have access to it), and you don’t know when something is too vague.
Now you hook this up to a series of APIs and hope the LLM will query the correct one, with the correct parameters based on the context of the input. If something goes wrong you now have to troubleshoot and do standard bug fixing. However, it becomes really difficult to re-trace the steps that led to the failure as LLM outputs are stochastic and will likely change each time you call the model with the same parameters (unless temperature is set to 0).
You now have a system where you don’t know when things will go wrong, when they go wrong you have little to no ability to re-create the error and do standard bug fixing. Oh, I also forgot to mention, what do you do when a new release of the underlying LLM comes out? How will that affect your system?
Now – let us say you have squared all of these issues away and your LLM gets it right 99% of the time. That is pretty good for an AI model. But what you implemented was a multi-agent system, as that is all the rage. That means the 99% failure rate now compounds across the different agents, and in a simple linear system with around 5-6 agents your overall error rate will quickly rise from 1% to 5%. Imagine running this at any kind of scale, like processing applications for tax deductions.
The thoughts above is not merely a rant – a recent paper out of UC Berkeley by Cemri et al. show that some multi-agent systems has a success rate as low as 25%. The paper is well worth a read and points to three categories of failures for such systems:
Poor specifiction: here the agents simply don’t understand the task at hand, or choose to go off on their own (37.17% of all failures)
Inter-agent misalignment: failure to communicate, withholding information and all the usual things you see even when humans try to collaborate (31.41% of all failures)
Task verification: stopping before the job is done, or not verifying that it is actually done (31.41% of all failures)
What is interesting is that quantifying these failure points required full trace of the conversation history and task execution across the multi-agent system. That would be akin to debugging a modern tech stack simply by looking at print statements, not having a proper testing or logging / debug system. This is not a viable approach to verifying and testing at scale.
I won’t be holding my breath for agents and multi-agent systems as the big thing in 2025, it is simply not primetime yet. There will be a lot of hype and buzz around it, but the matter of fact is that the tooling to build such systems doesn’t exist yet. We need to figure out how to do both unit and integration tests of these systems, how to track bugs across multiple agents. That is really difficult when the agent is a random process, and each new run will produce slightly different results. In a sense we would need a toolkit that can measure boundary conditions for when specific prompts and outputs are within the tolerance range of the agents to produce the same output.I am sure 2025 will be the year where we see some embarrassing multi-agent failures, and a decent amount of investor capital will be burned on chasing this a bit too early on. At this stage AI agents is more akin to pageants, all pomp and no substance.
We work with a select number of leaders who are serious about winning. If that’s you, let’s talk.