What Is an AI Agent? A Beginner's Guide

Intelligence is a product of modern technology with the ability to learn, reason, make decisions and perform, and stems from theories in several disciplines. Its development was influenced by the ideas of philosophers such as Aristotle, and theories such as Von Neumann, Wiener, and Shannon provide its foundation. Intelligentsia exhibit the ability to act uniquely, possessing multifaceted qualities such as learning, action, autonomy, and responsiveness. Its components include perceptual observation, memory retrieval, reasoning planning, and action execution units that work in concert to achieve goals. With the development of large-scale language modeling technology, intelligibles are widely used and are expected to achieve a high degree of autonomy in the future, bringing more convenience to human life.

In this paper, we introduce the origin, definition, characteristics, constituent units, and applications of intelligent bodies. We also discuss some open-source frameworks for Agent practice, suitable for developers.

Why the emergence of intelligent body?

With the arrival of the large language model era, artificial intelligence technology has become a hot topic, a variety of platforms based on the large language model are emerging, providing the majority of users with basic dialog, image generation, video or voice generation, greatly improving the efficiency of people's work.

However, most of these platforms nowadays are temporarily encountered what problems or what fragments of small things, will use the reasoning ability of the large language model, the public exposure to the so-called artificial intelligence technology may be basically limited to this, such as querying a certain knowledge, write a certain code function, but it is difficult to solve a certain this can only be accomplished by human beings to complete the systematic work, such as the development of a piece of software or the release of a book. It is difficult to solve a systematic task that can only be done by humans, such as developing a software or publishing a book. Because to achieve this goal, there will be more matters in it, and it can't be accomplished with a few large language model conversations. Even if it is done by humans, it is often not done by one person.

For example, if we want to do a research report, generally the process is like this:

So research artificial intelligence of this group of higher humans, research set up a concept of artificial intelligence - intelligent body , used to solve the specific field of work tasks Agent (agent, that is, a substitute for the human work of the thing), to liberate the human for the completion of a certain task cumbersome locking work within the content:

Definition of Intelligent Body

Various definitions of the term "intelligence" have been proposed in academia and industry. Roughly speaking, an intelligence should have human-like thinking and planning capabilities, possess memory and even emotions, and have the skills to interact with the environment, intelligences, and humans to accomplish specific tasks.

In the view of this entire AI book, an intelligent body can be visualized as a digital human in the environment, in which

Intelligentsia = Large Language Model (LLM) + Observation + Thinking + Action + Memory

This formula summarizes the functional nature of an intelligent body. To understand each component, let's draw an analogy with humans:

Large Language Model (LLM) : The LLM serves as the "brain" part of the Intelligent Body, enabling it to process information, learn from interactions, make decisions and perform actions.
Observation and Perception : This is the perceptual mechanism of an intelligent body that enables it to perceive its environment. An intelligent may receive a range of signals such as text messages from another intelligent, visual data from a surveillance camera or audio from a customer service recording. These observations form the basis for all subsequent actions.
Reasoned Thinking : the thinking process involves analyzing observations and memory content and considering possible actions. This is the decision-making process within the intelligent body, which may be driven by the LLM.
Action execution : these are the explicit responses of the intelligences to their thinking and observations. Actions can be code generation using LLM or manually predefined operations such as reading local files. In addition, intelligences can perform actions that use tools, including searching for weather on the Internet, using a calculator to perform mathematical calculations, and so on.
Memory and retrieval : the memory of an intelligent body stores past experiences. This is crucial for learning as it allows the intelligences to refer to previous results and adjust future actions accordingly.

Characteristics of Agents

Agents place special emphasis on their ability to learn and act. The ability to learn is a key indicator of intelligence and one of the basic requirements for Agents; Learning Agents are constructed to operate in initially unknown environments, and by interacting with the environment they continue to learn and iterate so that their knowledge and skills improve and their decision-making abilities are enhanced. In addition, the action capability of intelligent agents allows them to interact with the environment through perception, decision-making, and action to achieve a given goal through continuous iterative learning. In the field of Artificial Intelligence, intelligences can refer to programs or systems that are autonomous and intelligent, and are also able to perceive, plan, make decisions, and perform related tasks. These intelligences can be used to solve various problems such as autonomous driving, natural language processing and gaming. These intelligences can be virtual, such as software programs, or physical, such as robots.

In addition to the autonomy and mobility emphasized earlier, intelligences have other characteristics. For example, Wooldridge (1994) mentions that some of the characteristics of intelligences include:

Autonomy : intelligences are able to function independently without external intervention and possess a degree of control over their behavior and state.
Responsiveness : intelligences are able to sense changes in the environment in which they find themselves and respond quickly, whether it is the physical world, user interfaces, other intelligent entities, or the Internet.
Proactive : Intelligent bodies are not limited to passive responses to changes in their environment; they can also exhibit goal-directed behavior through proactive actions.
Social : intelligences are capable of communicating and interacting with other intelligent entities or humans through specific communication methods.

With the advancement of AI technologies such as Large-scale Language Modeling (LLM), there is an increasing expectation of intelligent entities. Modern intelligent entities are not only expected to possess the above characteristics, but also to have a higher level of autonomy and be able to learn and perform tasks independently. In addition, intelligent entities are also viewed as a human-designed and realized system, which are endowed with human traits such as knowledge, beliefs, rationality, intentions, and responsibility.

To explore Intelligent Entities more systematically, we first define them from a narrower perspective: an Intelligent Entity is a computerized system that is capable of autonomously achieving a goal within a certain range according to a predefined goal . We can refer to the grading system of self-driving cars and gradually build intelligent entities from basic to advanced to realize higher levels of autonomy. At the same time, the hierarchical management of intelligent entities can not only effectively deal with potential risks, but also maximize their application value.

Working Environment of Intelligent Entity

1. Environment

Intelligent entities cannot be separated from their working environment. The environment of the intelligent body is the external factor that it needs to influence and adapt to, and it is the control object of the intelligent body. Intelligent bodies interact with the environment to observe and perceive, plan and make decisions, and execute actions to form a feedback closed loop with the environment to realize their goals. An intelligent body interacts with the environment to achieve its predefined goals by sensing the environment through a Sensor and then acting on the environment through an Actuator. In this intelligent body system, the environment is the source of knowledge of the intelligent body and also the object of its action; the intelligent body observes and senses the state of the environment, combines with the knowledge built in the intelligent body to form a certain strategy and acts on the environment, so as to optimize its utility or benefit, and achieves its goal through continuous iteration. The working environment of an intelligent body can be either physical or virtual.

For example, for a self-driving intelligence, it faces elements of the working environment including roads, other vehicles, police, pedestrians, passengers, and weather. These factors play an important role in traffic management and safety.

2. Characteristics of the environment

In practical applications, the working environment of the intelligent body is often not static, but also has a certain degree of uncertainty.

Uncertainty (Nondeterminism)

If the next state of the environment is completely determined by the current state and the actions performed by the intelligences, then we call the environment deterministic (Deterministic); otherwise, it is nondeterministic. Most real-world situations are so complex that it is impossible for an intelligent body to keep track of all the states of the environment that it fails to observe; in such cases, they are considered nondeterministic. This is because in real situations there are too many unknown variables and factors that make it impossible to accurately predict the next state. Therefore, it is common to treat the environment as non-deterministic in order to account for some uncertainty when dealing with these situations.

If the uncertainty in a model of the environment explicitly involves probabilities (e.g., "There is a 25% probability of rain tomorrow"), the model is stochastic (Stochastic), whereas if only the probabilities are listed without quantification (e.g., "There is a chance of rain tomorrow "), the model is non-deterministic.

Dynamics.

The state of an intelligence's environment tends to change over time as it goes about its planning, then the intelligence's environment is dynamic (Dynamic); otherwise, the environment is static. Although static environments are easier to deal with because the intelligences do not have to constantly observe the world and worry about the passage of time when deciding their actions, most real-world applications have dynamic environments. The environment of self-driving intelligences is clearly dynamic.

Intelligent bodies coping with dynamic or uncertain environments may need to get constant feedback on their state, and by inferring to understand the state of the environment and its changes, they can make timely and appropriate action decisions and feedback to act on the environment.

Key Component Units of an Intelligent Body

Intelligent body itself includes Sensor, Memory, Planner and Actuator.

1. Observation and Sensing

The intelligent body observes its environment through the perception unit to determine the relevant state and changes of the environment as the source of information for planning and learning decisions. Intelligent bodies obtain different information about the state of the environment through various perception devices, and their perception space includes multimodal information from multiple sensory modalities, such as text, sound, sight, touch, smell, etc. The intelligent body utilizes its perceptual history sequence, i.e., its complete history of ever perceiving the state of the environment, combined with its built-in knowledge, to form its action decisions through meticulous planning.

Ideally the environment is Observable, i.e., the complete state of the environment can be observed by the intelligences' perceptors at every point in time; indeed the task environment can effectively be considered Observable if the perceptors detect all aspects relevant to the choice of action. Observable environments are very convenient so that an intelligent can decide how to act by observing the complete state of the environment without needing additional information or internal states to help it make decisions. This makes task execution and decision making simpler and more straightforward.

If an environment is unobservable or partially observable because the perceived state of the environment may be noisy and inaccurate, or some parts of the state of the environment are not included in the perceptron data at all, the intelligent body may need to draw on additional other information or maintain an internal state in order to understand the environment.

2. Memory and Retrieval

Intelligent bodies generate action strategies based on relevant information, including relevant built-in knowledge and historical memory. Intelligent bodies need to remember and retrieve their built-in knowledge and experience memories with the help of effective storage mechanisms:

2.1 Built-in knowledge:

Intelligent bodies often have certain built-in knowledge according to their application scenarios; mainly including the following types of knowledge:

Language: if the communication medium of the intelligent body is natural language, the language knowledge defines the grammar, which covers many aspects of language specification such as linguistics, syntax, semantics and pragmatics. Only with linguistic knowledge can intelligences understand and communicate conversationally. In addition, contemporary language models allow intelligences to acquire knowledge of multiple languages, which eliminates the need for additional translation.

Common Sense: Common sense knowledge usually refers to general world facts that most humans possess. For example, it is commonly known that medicines are used to treat diseases and umbrellas are used to protect against rain. This information may not be explicitly mentioned in the context in which the intelligence is communicating. Without the appropriate general knowledge, the intelligent body may make wrong decisions, such as not carrying an umbrella on a rainy day.

Domain: specialized domain knowledge refers to knowledge related to specific application domains and scenarios, such as mathematics, chemistry, medicine, programming, law, finance, industry, personnel, sales, etc. Intelligent bodies require a certain amount of specialized domain knowledge to effectively solve problems within a specific domain. For example, intelligences designed to perform programming tasks require knowledge of programming, such as the code format of a programming language. Similarly, an intelligent body used for diagnostic purposes should have appropriate medical knowledge, such as the name of a particular disease and prescribed medications.

This knowledge may be stored in the form of parameters in one of the models or processed and stored in a knowledge base for easy retrieval when needed. In addition, knowledge can be stored in memory in a variety of forms, such as natural language text, embeddings, and databases, each of which has unique advantages. For example, natural language retains comprehensive semantic information that can be easily applied to reasoning, while embeddings can improve the efficiency of memory reading.

2.2 Historical Memory:

In an intelligent body system, a historical memory record stores a sequence of past experiences of the intelligent body's observations, thoughts, and actions. In particular, an intelligent body explores the observation-perception environment to obtain relevant information about the state of the environment by acting to stimulate the environment or otherwise, in addition to thinking and learning from its built-in knowledge and historical memory. Intelligence relies on memory mechanisms to access prior experience in order to effectively formulate action strategies and decisions. When faced with similar problems, memory mechanisms help intelligences to apply previous strategies effectively. In addition, these memory mechanisms enable intelligences to draw on past experiences to adapt to unfamiliar environments. Memory mechanisms also need to address the following issues:

Length of records: intelligences based on language models interact with the language model in a natural language format, attaching historical records to each input. As these records increase, they may exceed the limits of the model architecture.

Retrieval of memories: as intelligentsia accumulate large amounts of data from historical sequences of observations and actions, they face the burden of memorizing increasing amounts of data. This makes it increasingly challenging to retrieve relevant memory content for the purpose of attempting to establish associations between topics, potentially leading to responses from the intelligences that are inconsistent with the current context.

3. Reasoning and planning

Goal-based reasoning and planning abilities are fundamental manifestations of intelligence, and they help intelligences analyze and solve complex problems. Reasoning is based on evidence and logic, and is the cornerstone of analyzing and solving problems as well as making rational decisions. The main forms of human reasoning include Deduction, Induction and Abduction. Reasoning is essential for intelligences to handle complex tasks. Planning, on the other hand, helps intelligences develop strategies for dealing with complex challenges; it gives intelligences a structured process for organizing their thoughts, setting goals, and identifying steps to achieve those goals. At the core of planning ability for intelligences is the ability to reason; through reasoning, intelligences break down complex tasks into more manageable subtasks and develop appropriate plans for the completion of each subtask.

In addition, reasoning and planning empower intelligences to learn and help intelligences learn to accumulate knowledge and experience. The initial configuration and plan of an intelligent body may reflect some prior knowledge of the environment, but as the task proceeds, the intelligent body acquires more knowledge and experience through reasoning and planning, making it possible for previously held knowledge and experience to be modified and augmented; through reasoning and planning, the intelligent body can also use introspection to optimize and modify its action strategies and plans to ensure that they better align with the reality of the environment, the thereby adapting to the environment, performing tasks more efficiently and successfully reaching goals.

4. Action and Execution

The action execution unit is responsible for transforming the decisions of the intelligent body into concrete actions and directly executing the actions on the environment and influencing the future state of the environment to achieve its goals. Typically, an intelligent body's choice of action at any given moment can depend on its built-in knowledge and the entire sequence of perceptions it has observed so far, but not on anything it has not yet perceived. This means that the actions of the intelligent body are made on the basis of information and experience already available, rather than on unknown factors. An intelligent body can accomplish action execution through its own output, such as the textual output of a language model, but it often needs to extend its action space by borrowing external forces, such as the ability of the intelligent body to act embodied and to use external tools, in order to better respond to changes in the environment, to provide feedback, and to alter the shaping of the environment. Intelligent bodies make decisions and specify their action choices by making decisions for each possible perceptual sequence. Mathematically, the behavior of an intelligent body is described by an action function that maps any given perceptual sequence to an action.

Multi-intelligent book

If the above matters were to be accomplished, as a human, the probability is that it would be done in the following manner:

Set up a matter project project;
Set project goals and disassemble the project;
Find the person who is suitable to do this task as per the task after disassembling the project;
The person who receives the task, and then look at the size of the task, dismantle the matter to carry out the corresponding work;
If necessary, people working on related tasks need to interact and communicate with each other to synchronize work progress.

Since the solution of a problem can be introduced into the intelligent body, so for a project after the dismantling of multiple problems, can cause multiple intelligent bodies, as long as they can work together.

A multi-intelligence system can be viewed as a society of intelligences, in which

Multi-intelligentsia = Intelligentsia + Environment + Standard Processes (SOPs) + Communication + Economy

Each of these components plays an important role:

Intelligentsia: Based on the individual definitions above, the intelligentsia in a multi-intelligent system work together, each with unique LLM, observation, thinking, action and memory.
ENVIRONMENT: The environment is the public place where intelligences live and interact. Intelligences observe important information from the environment and publish the output of their actions for use by other intelligences.
Standard Operating Procedures (SOPs): these are established procedures that govern the actions and interactions of intelligences, ensuring orderly and efficient operations within the system. For example, in an automotive manufacturing SOP, one intelligence welds car parts while another installs cables to keep the assembly line organized.
Communication: communication is the process of exchanging information between intelligences. It is essential for collaboration, negotiation and competition within the system.
Economy: this refers to the system of value exchange in a multi-intelligent body environment that determines resource allocation and task prioritization.

An example.

This is a simple example showing how intelligences work:

In the environment, there are three intelligences Alice, Bob and Charlie that interact with each other.
They can post messages or outputs of their actions to the environment, which are also observed by the other intelligences.
The following will reveal the internal process of the intelligent body Charlie, which applies equally to Alice and Bob.
Internally, the intelligence Charlie has some of the components we described above, such as LLM, observe, think, and act.Charlie's process of thinking and acting can be driven by the LLM, and it also has the ability to use tools in the process of acting.
Charlie observes relevant documents from Alice and requirements from Bob, acquires helpful memories, thinks about how to write the code, performs the action of writing the code, and eventually publishes the results.
Charlie notifies Bob of the result by posting it to the environment, and Bob responds with a compliment after receiving it.

Intelligent Body Applications

1. Researcher: searching from the web and summarizing the report

The function of realizing the research mentioned in the previous section is done by the researcher role. It can search and summarize from the Internet according to the user's research question and finally generate a report. In this article, we will introduce the researcher role from the aspects of design ideas, code implementation, and usage examples.

Design Ideas

Before using MetaGPT to develop the Researcher role, we need to think about how to do it if we are a Researcher, searching on the Internet and outputting a research report. Generally it involves the following steps:

Analyze the problem to be researched and break it down into several sub-problems that can be searched with a search engine.
Search for sub-problems through the search engine, browse the search engine will give a number of search results with the title, the original Url, summary and other information, to determine whether each search result is related to the problem to be searched and whether the source of information is reliable, so as to choose whether to further browse the web page through the Url
Click on the web page to browse further, determine whether the content of the web page is helpful to the problem to be researched, extract the relevant information and record it.
Aggregate all the recorded information and write a report on the problem to be studied.

Therefore, we try to let GPT simulate the above research process, the overall steps are as follows:

User inputs research questions
The researcher uses GPT to generate a set of research questions that together form an objective opinion on any given task.
The researcher receives the questions broken down by the GPT, and for each research question, first searches through a search engine to obtain the initial search results
Web site to obtain the content of the web page through the browser and use it to summarize the content of the web page
Aggregate all summarized content and track its sources
Finally, GPT generates a final research report based on the summarized content.

2. Tutorial Assistant: Generate technical tutorials

Generate a technical tutorial by entering a single sentence.

Design Ideas

The LLM model first generates a catalog of tutorials, then chunks the catalog according to secondary headings, generates detailed content for each piece of the catalog according to the headings, and finally splices the headings and content. The chunking design solves the limitation of long text in the LLM model.

3. Receipt Assistant: Extracting Structural Information from Receipts

Support ocr recognition of invoice file in pdf, png, jpg, zip format, generate csv file with payee, city, total amount, invoice date information. In case of pdf, png, jpg type invoice file, i.e. single file invoice, you can ask questions related to invoice content. At the same time, support for multi-language invoice results generation.

Design Ideas

For pdf, png, jpg format invoice files, through the open source PaddleOCR API for ocr recognition of invoice files, and then ocr recognition of the data provided to the llm model to extract the main information written in the form, and finally ask the llm model about the content of the invoice.

For zip format invoice files, first unzip the zip archive to the specified directory, then recursively traverse the pdf, png, jpg format invoice files to perform ocr recognition, then provide the ocr recognized data to the llm model to extract the main information and write it to the same form. Multiple documents do not support the question content .

Popular open source frameworks for developers

At present, many companies have open source out of the source code of the intelligent body practice, mainly for developers, the author also have translated several popular source code, here you can see most of the open source framework code "19 types of Agent (intelligent body) framework comparison".

Honestly, the concept of intelligent body is very good, the ideal can replace the actual work of many, in fact, really want to achieve intelligent body can think for themselves, make planning decisions and implementation, according to the current artificial intelligence technology there is still a certain distance. Looked through the source code, basically in accordance with the definition of the intelligent body, the abstract realization of all kinds of methods (observation, thinking, memory, action, etc.), but the specific implementation, can not reach the intelligent body self-assertion, the so-called memory, but also only a piece of software opened up a memory space to store the record of past events and will not be like a person who will use the output of the memory of the fusion of the processing problem, the open-source now realized! Basically still exhaustive of the various parameters, environmental conditions, as long as certain factors change (such as running the machine, call the large language model, the network, etc.), the original run-through example will also be a problem.

For example, Silicon Valley's first digital program order Devin has been hit by the wisdom of the famous blogger fake, in fact, it is not fake, but the specific self-perception, decision-making by the environmental parameters of the impact is relatively large, can not do the real intelligence, although I do not have the opportunity to turn over the source code, but the probability of the decision-making and implementation of a variety of cases is also exhaustive.

However, for the time being, some of the simple rules of work (can be exhaustive of most of the environment and the process may be) can indeed use the intelligent body program to do, and even in the case of fault tolerance is relatively good, you can join the human intervention to achieve better results, we may wish to give it a try, anyway, the author has already begun to do it, and there is a certain effect out of the forthcoming instead of the hand of a lot of tedious work.