Why do Multi-Agent Systems Fail? (Berkeley Paper)

Written by

Audrey Miles

Updated on:July-01st-2025

Background

Research Questions

The problem we address in this paper is why Multi-Agent LLM systems (MAS) show little improvement in performance compared to single-agent frameworks. Despite their potential in handling complex multi-step tasks and dynamically interacting with different environments, MAS still show limited improvement in accuracy or performance on popular benchmarks.
Research Difficulties

The research difficulties of this problem include: how to comprehensively analyze the challenges that hinder the effectiveness of MAS; how to identify the multiple failure modes that lead to MAS failure; and how to propose effective improvement measures to improve the performance and reliability of MAS.

Related Work

In terms of agent system challenges, existing studies have proposed relevant solutions for specific agent challenges, such as solving the long-line-of-sight network navigation problem by introducing workflow memory, but these works do not fully understand the reasons for the failure of MAS, nor do they propose strategies that can be widely applied in various fields.
In terms of agent system design principles, some studies have highlighted the challenges of building robust agent systems and proposed new strategies, but these studies mainly focus on single-agent design and lack research on the comprehensive failure modes of MAS.
In terms of fault classification in LLM systems, there is limited research specifically on the failure modes of LLM agents. This study fills this gap and provides a pioneering study of MAS failure modes.

Overall conclusion

The results of the paper show that the failure of MAS is not only due to the limitations of LLM, but more importantly, the structural defects in the MAS design. Future research should focus on improving the design principles and organizational structure of MAS to improve its reliability and performance. The main contributions of the paper include:

We introduce MASFT, the first empirically based MAS fault taxonomy, which provides a structured framework for understanding and mitigating MAS faults.
A scalable LLM-as-a-judge evaluation process is developed to analyze new MAS performance and diagnose failure modes.
Intervention studies targeting agent norms, conversation management, and verification strategies achieved some improvements but highlighted the need for structural MAS redesign.
The relevant data sets and tools have been open-sourced to facilitate further research on MAS.

Research Methods

This paper proposes a multi-agent system failure classification method (MASFT) to solve the problem of failure of multi-agent large language model systems. Specifically,

Data collection and analysis

We adopt a theoretical sampling method to select diverse multi-agent systems and task sets, and collect more than 150 dialogue traces of five popular open-source MAS for analysis, with an average of more than 15,000 lines of text per trace.

Fault pattern recognition and taxonomy construction

Specification and system design failures : including failure to comply with task specifications (15.2%), failure to comply with role specifications (1.57%), duplication of steps (11.5%), loss of conversation history (2.36%), and failure to realize termination conditions (6.54%).
Inter-agent misalignment : including dialogue reset (5.50%), failure to clarify (2.09%), task derailment (5.50%), information withholding (6.02%), ignoring other agent input (4.71%), reasoning-action mismatch (7.59%), etc.
Task verification and termination : including premature termination (9.16%), no or incomplete verification (3.2%), incorrect verification (3.3%), etc.

Data were collected and analyzed iteratively through theoretical sampling, open coding, constant comparative analysis, memoing, and theorizing, and 14 different failure modes were identified.
Clustering failure modes into 3 main failure categories:
Three expert annotators independently labeled 15 trajectories, achieving inter-annotator consistency with a Cohen's kappa score of 0.88, and continuously iteratively adjusted the failure modes and failure categories, ultimately forming a structured MASFT taxonomy.

Automated evaluation process development

We introduce a scalable automatic evaluation pipeline using OpenAI's O1's LLM-as-a-judge. We cross-validate with three human expert annotators on 10 trajectories and obtain a Cohen's kappa agreement rate of 0.77.

Experimental design

Data Collection : Five popular open source MAS (Metagpt, Chatdev, Hyperagent, Appworld, Ag2) were analyzed, involving more than 150 tasks, and six expert annotators were employed to participate in the study.
Sample selection : Representative multi-agent systems and tasks were selected, covering different application scenarios and system structures to ensure the wide applicability of the research results.
Parameter configuration : During the experiment, different MAS systems were analyzed and evaluated in detail, including the tracking and analysis of their execution trajectories, as well as the identification and classification of various failure modes.

Empirical Data

System Name	Success rate	Failure rate	Test scenario
MetaGPT	66.0%	34.0%	Programdev
ChatDev	25.0%	75.0%	Programdev
HyperAgent	25.3%	74.7%	SWE-bench lite
AppWorld	13.3%	86.7%	test-c
AG2	84.8%	15.2%	GSM-Plus

Results and Analysis

Failure Mode Analysis

Through detailed analysis and evaluation, 14 different failure modes were identified and classified into three main failure categories. These failure modes are prevalent in different MAS systems, leading to the degradation of system performance and the increase of mission failure rate.
For example, in the category of norms and system design failures, the failure rates of non-compliance with task norms and role norms are relatively high, at 15.2% and 1.57% respectively; in the category of inter-agent misalignment, failure modes such as dialogue reset and reasoning-action mismatch are also relatively common.

Comparison with existing systems

Comparing the correctness of the state-of-the-art open source MAS Chatdev with GPT-4o and Claude-3, the results show that Chatdev's correctness can be as low as 25%, indicating that MAS still has a large performance gap when compared with existing advanced models.

Effects of interventions

Two interventions were implemented: improving the specification of agent roles and enhancing the orchestration strategy. Case studies on Ag2 and experiments on Chatdev show that although these interventions bring +14% improvement to Chatdev, they do not solve all failure cases and the improved performance is still not good enough for actual deployment.