Google AI Agent White Paper - What is Agent

Google AI Agent white paper reveals the mystery of the agent and explores how generative AI models can go beyond their own capabilities.
Core content:
1. The concept of agent and its similarity to human tool use
2. The ability of the agent to autonomously plan and execute tasks
3. Analysis of the core components of the agent cognitive architecture
This combination of reasoning, logic, and the ability to access external information, all connected to a generative AI model, leads to the concept of an “ agent .”
introduction
Humans excel at complex pattern recognition tasks, but they often need to supplement their existing knowledge with tools such as books, Google searches, or calculators to draw conclusions. Similar to humans, generative AI models can also be trained to use tools to obtain real-time information or trigger real-world behaviors. For example, a model can access a customer's historical purchase records through a database search tool to generate personalized shopping recommendations; or according to user instructions, call multiple API interfaces to send email replies to colleagues, or complete financial transactions on behalf of users.
To achieve these functions, the model not only needs to connect to external tools, but also needs to have the ability to autonomously plan and execute tasks. This way of combining reasoning, logic and external information access, and integrating them all into the generative artificial intelligence model, leads to the concept of agent - a program that can break through the boundaries of the independent capabilities of the generative AI model.
This white paper will delve into more details on these and related topics.
What is an Agent?
In its most basic form, a Generative AI Agent can be defined as an application that attempts to achieve a specific goal by observing its environment and taking action (using the tools at its disposal). The agent is autonomous and can operate independently without human intervention, especially when given a clear goal. In addition, the agent can actively plan to achieve its goals - even without direct human instructions, it can reason about the next action to take to complete the final task.
Although the concept of "agent" in the field of AI has a broad and powerful connotation, this white paper will focus on the specific types of agents that current generative AI models can build .
To understand the inner workings of an intelligent agent, we must first understand the core components that drive its behavior, actions, and decisions . The combination of these components can be considered a cognitive architecture , and a variety of architectural forms can be achieved through the flexible combination of different components. Focusing on the core functions, the cognitive architecture of an intelligent agent consists of the following three basic components (as shown in the figure):
Next, let’s look at the three major parts: model, tool, and orchestration layer .
Model
In the architecture of an agent, "model" specifically refers to the language model (LM) as its core decision-making center . The model can be a single or multiple language models of different sizes (small/large), and must be able to follow instruction-based reasoning and logic frameworks , such as ReAct (reasoning-action coordination), Chain-of-Thought, or Tree-of-Thoughts.
Models can be designed to be general , multimodal , or fine-tuned , depending on the specific needs of the agent architecture. For best production results, it is recommended to choose the model that best fits the target end application, and give priority to models that have been pre-trained based on data features related to the tools you plan to use .
Note: The model itself usually does not have pre-built-in specific configurations for the agent (such as tool selection, orchestration/reasoning logic settings). However, by providing examples (such as examples showing the agent using specific tools or performing reasoning steps in different scenarios), the model can be optimized to improve its performance in the agent task.
Tools
Although foundational models have shown excellent performance in text and image generation, they are still inherently limited by their inability to interact with the outside world . Tools bridge this gap by:
Expanding the boundaries of action : enabling agents to interact with external data and services, breaking through the inherent capability limitations of the underlying model.
Diverse forms and complexity :
Tools can range from simple to complex implementations, but are typically based on common Web API methods such as GET (data retrieval), POST (data submission), PATCH (partial update), and DELETE (data deletion).
Examples : Update customer information in a database, obtain weather data to optimize travel recommendations provided by the agent.
3. Support advanced systems :
The tools enable agents to access real-time information, thereby supporting technologies such as Retrieval Augmented Generation (RAG) , significantly increasing the upper limit of capabilities.
Core Value : Tools serve as a bridge , connecting the internal capabilities of the intelligent body with the external world, unlocking a wider range of possibilities (see below for more details).
Orchestration Layer
The orchestration layer defines the cyclic process of the agent's operation . Its core mechanism is: receiving information → internal reasoning → action decision , and continuing to iterate until the goal is achieved or the termination condition is triggered. Its complexity varies significantly depending on the type of agent and the nature of the task:
Simple scenario :
Calculation and judgment based on preset rules (such as "if the inventory level < threshold, trigger the replenishment API").
Complex scenarios :
Chain logic (multi-step task dependencies)
Integrate machine learning algorithms (such as dynamic prioritization)
Probabilistic reasoning techniques (dealing with uncertainty)
Key Features :
Dynamicity : Adjust action strategies based on real-time feedback (such as tool return results, environmental changes).
Goal-oriented : Always take the preset goal as the end point and flexibly choose the path to achieve it (such as performing sub-tasks in a roundabout way).
The specific implementation details of the orchestration layer in the cognitive architecture will be discussed in depth in the subsequent "Cognitive Architecture" chapter.
Agents vs. models
To gain a clearer understanding of the distinction between agents and models, consider the following chart: