Comparison of 19 Intelligent Agent Frameworks: Which One Is Right for You?

Intelligence (Agent) refers to an intelligent body that can autonomously perceive its environment and take action to achieve its goals, i.e., AI acts as a representative of a person or an organization to carry out certain specific behaviors and transactions, reduce the complexity of a person's or an organization's work, and reduce the workload and communication costs.

Background

Currently, we are exploring the application direction of the Agent. We took this opportunity to investigate and learn about the current mainstream Agent framework. This article is also a record of our research process.

Network popular Agents

As of today, the open-source Agent applications can be said to be blossoming. The article is also a selection of the heat and discussion on the higher degree of 19 types of Agent, essentially covering the mainstream Agent framework, with a simple summary of each type, serving as a reference for everyone to learn.

Agent Basics

Agent's core decision-making logic is to allow LLM to choose to perform specific actions based on dynamically changing environmental information or to make judgments on the results, and influence the environment, through multiple rounds of iteration to repeat the above steps until the completion of the goal.

Streamlined Decision Process : P (Perception) → P (Planning) → A (Action)

Perception is the ability of an Agent to collect information from the environment and extract relevant knowledge from it.
Planning refers to the decision-making process made by the Agent for a certain goal.
Action is the action made based on the environment and planning.

Among them, Policy is the core decision of the Agent to make Action, and Action becomes the premise and foundation of further Perception through Observation, forming an autonomous closed-loop learning process.

agent的决策流程

The engineering realization can be split into four core modules: Reasoning, Memory, Tools, and Action.

agent工程实现核心模块

Decision Model

Currently, the mainstream decision-making model of Agent is the ReAct framework, and there are also some variants of the ReAct framework. The following is a comparison of the two frameworks.

Traditional ReAct framework: Reason and Act

ReAct = less sample prompt + Thought + Action + Observation. It is a common prompt structure for invoking tools, reasoning, and planning, and then executing, executing specific actions based on the environment, and giving the thought process Thought.

传统ReAct框架：Reason and Act

Plan-and-Execute ReAct

Class BabyAgi's execution process: a part of Agent completes the disassembly of complex tasks by optimizing the process of planning and task execution, disassembling the complex task into multiple subtasks, and then executing them sequentially/in batches.

The advantage is that for the solution of complex tasks, the need to call multiple tools is reduced, but also only needs to call the large language model three times, rather than each tool call to adjust the large language model.

LLmCompiler: Execute tasks in parallel, generate a DAG graph to execute actions during planning, which can be understood as aggregating multiple tools into a tool execution graph, and executing a certain action in a graphical way.

paper: https://arxiv.org/abs/2312.04511?ref=blog.langchain.dev

Agent framework

According to the differences in frameworks and implementation methods, Agent frameworks are simply divided into two categories: Single-Agent and Multi-Agent, corresponding to single-intelligence and multi-intelligence architectures, respectively; Multi-Agent uses multiple intelligences to solve more complex problems.

Single-Agent

BabyAGI

git: https://github.com/yoheinakajima/babyagi/blob/main/babyagi.py

doc: https://yoheinakajima.com/birth-of-babyagi/

The babyAGI decision-making process: 1) Decompose tasks based on requirements; 2) Prioritize tasks; 3) Execute tasks and integrate results.

Highlights

As an early agent practice, the babyAGI framework is simple and practical, and the task prioritization module in it is a relatively unique feature, which is mostly invisible in subsequent agents.

AutoGPT

git: https://github.com/Significant-Gravitas/AutoGPT

AutoGPT is positioned similarly to a personal assistant, helping the user with a specified task, such as researching a topic.AutoGPT places more emphasis on the use of external tools, such as search engines, page views, and so on.

Similarly, as an early agent, autoGPT is small in size, although it has many drawbacks, such as the inability to control the number of iterations and limited tools. However, there were many imitators, and many frameworks have evolved based on it.

HuggingGPT

git: https://github.com/microsoft/JARVIS

paper: https://arxiv.org/abs/2303.17580

HuggingGPT's tasks are divided into four parts:

Task planning: Planning the task into different steps. This step is easier to understand.
Model Selection: In a task, different models may need to be invoked to accomplish it. For example, in a writing task, first a sentence is written, then the model is expected to help supplement the text, and then an image is expected to be generated. This involves calling different models.
Execution of the task: Different models are selected for execution depending on the task.
Response aggregation and feedback: The results of the execution are fed back to the user.

HuggingGPT任务流程

Highlights of HuggingGPT: HuggingGPT differs from AutoGPT in that it can invoke different models on HuggingFace for more complex tasks, thus increasing the precision and accuracy of each task. However, the overall cost is not much lower.

GPT-Engineer

git: https://github.com/AntonOsika/gpt-engineer

Developed based on Langchain, a single-engineer agent to solve coding scenarios.

The goal is to create a complete code repository, asking users to enter additional supplementary information when needed.

Highlights: an automated upgrade of Code-Copilot

Run the effect:

Samantha

git: https://github.com/BRlkl/AGI-Samantha

tw: https://twitter.com/Schindler___/status/1745986132737769573

Inspired by the movie HER, the core reasoning logic is reflection + observation, based on the fact that GPT4-V constantly acquires image and voice information from the environment, and will initiate questions autonomously.

AGI-Samantha Features

Dynamic voice communication: Samantha can autonomously decide when to communicate based on context and its reflection.
Real-time visual capabilities: It can understand and react to visual information, such as content in images or videos. It can react based on this visual information. For example, if it sees an object or scene, it can communicate or act on that information. Even though Samantha does not always use visual information directly, this information continues to influence her thinking and behavior. This means that even when visual information is not directly talked about or processed, it is behind the scenes influencing the way it makes decisions and acts.
Externally categorized memory: Samantha has a special memory system that dynamically writes and reads the most relevant information based on context.
Continuous Evolution: It stores experiences that influence its behavior, such as personality, speech frequency, and style.

AGI-Samantha consists of a number of purpose-built Large Language Models (LLMs), each of which is called a "module". The main modules are: Thinking, Consciousness, Subconsciousness, Answering, Memory Reading, Memory Writing, Memory Selection, and Vision. These modules mimic the workflow of the human brain through internal circulation and coordination. Allowing Samantha to receive and process visual and auditory information and then respond accordingly. In short, AGI-Samantha is an advanced artificial intelligence system that strives to mimic human thinking and behavior.

Highlights

Combine visual information to assist decision making, optimized memory module, interested in fork code local run to play a game.

AppAgent

doc: https://appagent-official.github.io/

git: https://github.com/X-PLUG/MobileAgent

A multimodal app agent based on ground-dino and GPT view models.

Highlights: Based on a vision/multimodal app agent, an OS-level agent can complete system-level operations and directly control multiple apps. Due to the need for system-level permissions, only Android is supported.

OS-Copilot

git: https://github.com/OS-Copilot/FRIDAY

doc: https://os-copilot.github.io/

OS-level agent, FRIDAY can learn from images, videos, or text, and can perform a range of computer tasks, such as drawing in Excel or creating a website. Most importantly, FRIDAY can learn new skills by doing tasks, just as humans do, becoming better at them through constant trial and practice.

Highlights: Self-learning improvements, learning how to use software applications more effectively, best practices for performing specific tasks, and more.

Langgraph

doc: https://python.langchain.com/docs/langgraph

A feature of langchain that allows developers to reconfigure the execution flow within a single agent utilizing a graph, adding some flexibility, and can be combined with tools such as langSmith.

Multi-Agent

Stanford Virtual Town

git: https://github.com/joonspk-research/generative_agents

paper: https://arxiv.org/abs/2304.03442

Virtual Town, as an early multi-agent project, many of its designs have also influenced other multi-agent frameworks. The reflection and memory retrieval FEATURES in it are more interesting, simulating the way humans think.

Agents perceive their environment, and all the perceptions of the current agent (a complete record of the experience) are stored in a memory stream. Based on the agent's perceptions, the system retrieves relevant memories and uses these retrieved memories to determine the next behavior. These retrieved memories are also used to form long-term plans and create higher-level reflections, which are fed into the memory stream for future use.

The memory stream records all of the agent's experiences, and retrieval retrieves a portion of the memory stream from the memory stream based on Recency, Importance, and Relevance for delivery to the language model.

Reflection is a higher-level, more abstract thinking generated by the agent. Because reflections are also memories, they are included with other observations when retrieved. Reflections are generated periodically.

MetaGPT

git: https://github.com/geekan/MetaGPT

doc: https://docs.deepwisdom.ai/main/zh/guide/get_started/introduction.html

MetaGPT is a domestic open-source Multi-Agent framework. The current overall community activity is high, and there are constantly new features out; the Chinese document support is very good.

MetaGPT to a software company's way to form, the purpose is to complete a software requirement, input a sentence of what the boss needs, output user stories / competitive analysis/requirements/data structures / APIs / documents, and so on.

MetaGPT internally consists of Product Managers / Architects / Project Managers / Engineers, which provides the full process of a software company with carefully orchestrated SOPs.

As shown in the right part of the diagram, Role will _observe Message from Environment. If there is a Message caused by a specific Action that Role watches, then it is a valid observation that triggers Role's subsequent thinking and action. I think Role will select an Action within its capabilities and set it to do something. The role performs what it wants to do, i.e., runs the Action and gets the output. The output is encapsulated in a Message, and the final publish message is sent to the Environment, completing a complete run of the intelligence.

Dialogue mode: each agent role maintains a message queue of its own, and according to its settings to consume the data inside the personalized consumption, and then, after completing an act, will send a message to the global environment for all agents to consume.

The overall code is concise, mainly including: - actions: intelligent body behavior - documents: intelligent body output documents - learn: intelligent body to learn new skills - memory: intelligent body memory - prompts: prompt words - providers: third-party services - utils: utility functions and so on.

Interested students can walk through the role code, the core logic is in there: https://github.com/geekan/MetaGPT/blob/main/metagpt/roles/role.py.

Comparison with HuggingGPT

AutoGen

doc: https://microsoft.github.io/autogen/docs/Getting-Started

AutoGen is a framework developed by Microsoft for implementing complex workflows through agent communication. It is also currently the top-ranked Multi-Agent framework in terms of activity and is "on par" with MetaGPT.

Example: Suppose you are building an automated customer service system. In this system, one agent is responsible for receiving customer questions, another agent is responsible for searching the database to find the answer, and another agent is responsible for formatting the answer and sending it to the customer.AutoGen can coordinate the work of these agents. This means that you can have multiple "agents" (which can be LLMs, humans, or other tools) collaborating in a single workflow.

Customizability: AutoGen allows a high degree of customization. You can choose which type of LLM to use, which human input to use, and which tools to use. For example, in a content recommendation system, you may want to use a specially trained LLM to generate personalized recommendations, but also want human experts to provide feedback. AutoGen allows for seamless integration of the two.
Human Involvement: AutoGen also supports human input and feedback, which is useful for tasks that require human review or decision-making. Example: In a legal consulting application, initial legal advice may be generated by an LLM, but the final advice needs to be reviewed by a real legal expert. AutoGen can automate this process.
Workflow Optimization: AutoGen not only simplifies the creation and management of workflows, but it also provides tools and methods to optimize these processes. Example: If your application involves multiple steps of data processing and analysis, AutoGen can help you figure out which steps can be executed in parallel to speed up the process!

Multi-agent interaction framework:

https://microsoft.github.io/autogen/docs/Use-Cases/agent_chat

Three types of agents, dealing with single-task, user input, and teamwork functionality

Basic two-agent interaction:

The assistant receives a message from user_proxy containing a task description.
The assistant then attempts to write Python code to solve the task and sends the response to the user_proxy.
Once user_proxy receives the response from the cell phone, it tries to reply by either soliciting a human input or preparing an automatically created response. If no class input is provided, user_proxy executes the code and uses the result as an automatic response.
The helper then creates a further response for user_proxy. The user_proxy can then decide whether or not to close the dialog. If not, repeat steps 3 and 4.

Implementing a multi-agent communication approach:

Dynamic team communication: Register a reply function in the group chat manager, broadcast the message, and specify the role of the next speaker.

Finite state machine: Customize the DAG flowchart to define SOPs for inter-agent communication

Multi-Agent example:

Reference: https://microsoft.github.io/autogen/docs/Examples/#automated-multi-agent-chat

In addition, Autogen also opens a source playground, supports page operations, and can be deployed locally. Want to play a little bit can refer to this tweet: https://twitter.com/MatthewBerman/status/1746933297870155992

ChatDEV

git: https://github.com/OpenBMB/ChatDev

doc: https://chatdev.modelbest.cn/introduce

ChatDev is a virtual software company that operates through a variety of intelligences with different roles: executive, product officer, technology officer, programmer, reviewer, tester, designer, etc. These intelligences form a multi-intelligence organization with the mission of "transforming the digital world through programming". These intelligences form a multi-intelligence organizational structure whose mission is to "transform the digital world through programming". Intelligences within ChatDev collaborate by attending specialized functional workshops that include tasks such as design, coding, testing, and documentation.

ChatDev (2023.9) can be easily mistaken as a concrete implementation of a common MultiAgent framework for software development, but in fact, it is not. ChatDev is based on Camel, which means that its internal processes involve multiple communications between two Agents, and the overall communication relationships and order of the different Agent roles are configured by the developer. The overall communication relationship and order of the different Agent roles are configured by the developer. From this point of view, it doesn't look like a full-featured MultiAgent framework implementation.

However, it seems hard to say that this is a cut and dried approach when using Camel, if the multi-agent communication routing level is not done properly, the effect may not be as good as such a fixed waterfall type of communication between two, and the author of ChatDev also describes this (each time is 1-1 communication) as a feature.

The ChatDev project itself doesn't have a lot of reusability in its code, and it relies on an old version of Camel that should be discarded. The project itself is more of an academic prototype to support a thesis, and is not designed to be developed on by others.

GPTeam

git: https://github.com/101dotxyz/GPTeam

A meta-GPT-like approach to multi-agent collaboration, an earlier Multi-Agent exploration with more fixed interactions.

GPT Researcher

git: https://github.com/assafelovic/gpt-researcher

Serial Multi-Agent framework can be adapted for content production

The architecture of GPT Researcher is mainly carried out by running two agents, a "planner" and an "executor"; the planner is responsible for generating research questions, while the executor searches for relevant information based on the research questions generated by the planner. The planner is responsible for generating research questions, while the executor searches for relevant information based on the research questions generated by the planner, and then filters and summarizes all the relevant information through the planner to generate a research report.

TaskWeaver

git: https://github.com/microsoft/TaskWeaver?tab=readme-ov-file

doc: https://microsoft.github.io/TaskWeaver/docs/overview

TaskWeaver, oriented towards data analysis tasks, interprets user requests through coded snippets and efficiently coordinates various plug-ins in the form of functions to perform data analysis tasks.TaskWeaver is more than just a tool, it is a complex system that interprets commands, converts them into code, and performs tasks with precision.

TaskWeaver's workflow involves several key components and processes, The following is an overview of the workflow. It consists of three key components: the Planner, the Code Generator (CG), and the Code Executor (CE). The Code Generator and Code Executor consist of a Code Interpreter (CI).

The subsequent exploration of the multi-agent direction mentioned in the paper can be combined with autoGen

Microsoft UFO

git: https://github.com/microsoft/UFO

UFO is a Windows-oriented agent that combines natural language and visual manipulation of the Windows GUI.

The working principle of UFO (UI-Focused Agent) is based on advanced visual language modeling techniques, in particular GPT-Vision, and a unique dual-agent framework that allows it to understand and perform graphical user interface (GUI) tasks in the Windows operating system. The following is a detailed explanation of how UFO works:

Dual Agent Framework Dual Agent Architecture: UFO consists of two main agents, the AppAgent and the ActAgent, which are responsible for application selection and switching, and for performing specific actions within those applications, respectively. Application Selection Agent (AppAgent): responsible for deciding which application needs to be launched or switched to fulfill a user request. It makes its choice by analyzing the user's natural language commands and a screenshot of the current desktop. Once the most suitable application is determined, the AppAgent creates a global plan to guide the execution of the task. Action Selection Agent (ActAgent): Once an application is selected, the ActAgent performs specific actions in that application, such as clicking buttons, entering text, etc. The ActAgent uses the application's screenshots and control information to determine the most appropriate next action, and translates these actions into actual actions on the application's controls through the Control Interaction module (CIM)
Control Interaction Module The Control Interaction Module of the UFO is a key component in translating the actions recognized by the agent into actual execution in the application. This module enables the UFO to interact directly with the application's GUI elements to perform actions such as clicking, dragging, text entry, etc., without human intervention.
Multimodal Input Processing UFO can process multiple types of input, including text (the user's natural language commands) and images (the application's screen shots). This enables UFO to understand the current state of the GUI, the available controls, and their properties to make accurate operational decisions.
User Request Parsing When receiving natural language instructions from the user, the UFO first parses them to determine the user's intent and the task that needs to be accomplished. It then decomposes this task into a series of subtasks or operational steps that are executed sequentially by the AppAgent and ActAgent.
Seamless switching between applications. If the fulfillment of a user request requires the operation of multiple applications, the UFO is able to seamlessly switch between them. It decides when and how to switch applications through the AppAgent and performs specific operations in each application through the ActAgent.
Mapping of Natural Language Commands to GUI Operations. One of the core functions of UFO is to map the user's natural language commands to specific GUI operations. This process involves understanding the intent of the command, recognizing the relevant GUI elements, and generating and executing actions that manipulate those elements. In this way, UFO can automate a wide range of complex tasks, from document editing and information extraction to email composing and sending, greatly improving the efficiency and ease with which users can work in the Windows operating system.

CrewAI

git: https://github.com/joaomdmoura/crewAI

site: https://www.crewai.com/

Multi-agent framework based on Langchain

Crew in CrewAI is a container layer that combines agents, tasks, and processes, and is the actual place where tasks are executed. As a collaborative environment, Crew provides a platform for agents to communicate with each other, cooperate, and execute tasks according to defined processes. Through the design of Crew, agents are better able to collaborate and efficiently accomplish tasks. The sequential and hierarchical structure of agents is supported.

Benefits of CrewAI

Combined with the LangChain ecosystem, CrewAI offers the flexibility of Autogen's conversational agents and ChatDev's structured process approach without the rigidity.CrewAI's processes are designed to be dynamic and adaptable, seamlessly integrating into development and production workflows.

AgentScope

git: https://github.com/modelscope/agentscope/blob/main/README_ZH.md

Ali open source Multi-agent framework, the highlight is to support a distributed framework, and do engineering link optimization and monitoring.

Camel

git: https://github.com/camel-ai/camel

site: https://www.camel-ai.org

Early Multi-Agent project to achieve one-to-one dialog between agents, less documentation in addition to git, and a site that did not provide much useful information.

Agent framework summary

Single Intelligence = Large Language Model (LLM) + Observation (obs) + Thinking (thought) + Action (act) + Memory (mem)

Multi-Intelligents = Intelligents + Environment + SOPs + Review + Communication + Costs

Multi-intelligentsia Benefits :

Multiple perspectives to analyze the problem: While LLM can play many perspectives, it will quickly collapse to a specific perspective with system prompts or the first few rounds of conversation.
Complex problem disassembly: Each sub-agent is responsible for solving problems in a specific domain, reducing the requirement for memory and prompt length.
Highly manipulable: one can autonomously choose the desired perspective and persona;
Open-closed principle: extend the functionality by adding sub-agents, and new features are added without modifying the previous agent.
(Possibly) faster problem solving: solves the problem of single-agent concurrency;

Disadvantages:

Increased cost and time-consumption;
More complex interactions, high cost of custom development.
Simple problems a single Agent can solve;

Problems that can be solved by multiple intelligences:

Solving complex problems;
Generating plots with multiple character interactions.

Multi-Agent is not the end state of the Agent framework, Multi-Agent framework is the product of the current limited LLM capabilities in the background, more or to solve the current LLM's ability to address the shortcomings of the LLM through the LLM many iterations, to make up for some of the obvious errors, there is still a very high learning and development costs between different frameworks. With the enhancement of LLM capabilities, the future of the Agent framework will certainly move towards a simpler, easier-to-use direction.

What can be done?

Possible Directions

Game scenarios (NPC dialog, game material production), content production, private domain assistant, OS level intelligences, part of the work to improve efficiency.

Multi-Agent Framework

Multi-agent systems should be like the human brain, with a clear division of labor and the ability to work together, e.g., the brain has different areas responsible for vision, taste, touch, walking, balance, and even controlling the movement of limbs.

Referring to the two most complete Multi-Agent frameworks of MetaGPT and AutoGen ecology, we can start from the following perspectives:

Environment & Communication: Interaction between Agents, message passing, common memory, execution order, distributed agent, OS-agent
SOP: defining SOPs, orchestrating customized Agents
Evaluation: Agent robustness assurance, input/output parsing.
Cost: resource allocation between Agents
Proxy: customized proxy, programmable, execution size model

Single Agent Framework

Execution architecture optimization: thesis data support

CoT to XoT, from one thought, one step act to one thought, multiple acts, from the chained way of thinking to multi-dimensional thinking;

Long-term memory optimization :

AGENT with personalization capability, simulating human recall process, adding long-term memory into AGENT;

Multimodal capacity building :

Agent can observe not only limited to user input questions, but can be added to include touch, vision, and perception of the surrounding environment;

Self-thinking ability: actively ask questions and self-optimize.

Others

Deployment: Agent and workflow configuration and servitization, and in the longer term, distributed deployment needs to be considered.

Monitoring: Multi-agent visualization, energy, and cost monitoring.

RAG: solving the problem of semantic isolation

Evaluation: agent evaluation, workflow evaluation, AgentBench.

Training corpus: data labeling, data reflow

Business Choice: Copilot or Agent, Single Agent or Multi-Agent?

V. References

What is an AI agent: https://www.breezedeus.com/article/ai-agent-part1#33ddb6413e094280aaa4ac82634d01d9
What is AI agent part 2: https://www.breezedeus.com/article/ai-agent-part2
ReAct: Synergizing Reasoning and Acting in Language Models: https://react-lm.github.io/
Plan-and-Execute Agents: https://blog.langchain.dev/planning-agents/
LLmCompiler: https://arxiv.org/abs/2312.04511?ref=blog.langchain.dev
agent: https://hub.baai.ac.cn/view/27683
TaskWeaver Creating Super AI Agents: https://hub.baai.ac.cn/view/34799
For a Multi-Agent Framework, CrewAI has its Advantages Compared to AutoGen: https://levelup.gitconnected.com/for-a-multi-agent-framework- crewAI-has-its-advantages-compared-to-autogen-a1df3ff66ed3
AgentScope: A Flexible yet Robust Multi-Agent Platform: https://arxiv.org/abs/2402.14034
Assisting in Writing Wikipedia-like Articles From Scratch with Large Language Models: https://arxiv.org/abs/2402.14207
Autogen's Basic Framework: https://limoncc.com/post/3271c9aecd8f7df1/
In-depth analysis of MetaGPT authors:https://www.bilibili.com/video/BV1Ru411V7XL/?spm_id_from=333.999.0.0&vd_source=b27d8b2549ee8e4b490115503ac81017
Agent Product Design:https://mp.weixin.qq.com/s/pbCg1KOXK63U9QY28yXpsw?poc_token=HHAx12Wjjn0BqZd4N-byo0-rjRmpjhjjl6yN6Bdz
Building the Future of Responsible AI: A Reference Architecture for Designing Large Language Model-based based: Agents:https://arxiv.org/abs/2311.13148
Multi Agent Policy Architecture Foundations:https://mp.weixin.qq.com/s?__biz=Mzk0MDU2OTk1Ng==&mid=2247483811&idx=1&sn= f92d1ecdb6f2ddcbc36e70e8ffe5efa2&chksm=c2dee5a8f5a96cbeaa66b8575540a416c80d66f7427f5095999f520a09717fa2906cfccddb59&scene= 21#wechat_redirect
Learning Manual for Getting Started with MetaGPT Intelligent Body Development: https://deepwisdom.feishu.cn/wiki/BfS0wmk4piMXXIkHvn5czNT8nuh