Google AI Agent White Paper (2) - Cognitive Architecture

Explore how Google AI Agent's cognitive architecture imitates the human decision-making process.
Core content:
1. Cognitive architecture model of intelligent agent goal achievement
2. Core functions of intelligent agent: memory management, reasoning planning and environment interaction
3. Mainstream prompt engineering framework and its application example analysis
Imagine a chef in a busy kitchen, whose goal is to cook delicious dishes for customers. This process follows a cycle of information collection → planning → execution → adjustment :
Information Collection
Take customer orders and check food inventory in the pantry and refrigerator.
Internal reasoning and planning
Based on existing resources (such as ingredient types and quantities), infer achievable dishes and flavor combinations.
Action Execution
Specific operations: cutting vegetables, mixing spices, and frying meat.
Dynamic Adjustment
Revise plans based on real-time feedback (e.g., running out of ingredients, customer taste reviews), and use historical results to optimize subsequent actions.
This cyclic mechanism constitutes a unique cognitive architecture for chefs to achieve their goals , which is highly similar to the operating logic of intelligent agents.
Cognitive architecture: the goal-achieving mechanism of intelligent agents
Just as the chef completes the task through a cyclic process, the cognitive architecture of the agent also achieves its goal based on a closed-loop mechanism of iterative information processing → decision making → action optimization . Its core relies on the orchestration layer , which has the following key functions:
Memory and state management : Maintaining short-term/long-term memory and tracking task execution context.
Reasoning and Planning : Use the rapidly evolving prompt engineering techniques and frameworks to guide the model to generate a logically coherent decision chain.
Enhanced environmental interaction : Improve the efficiency of the agent’s interaction with the external environment and the task completion rate by dynamically adjusting strategies.
Currently, the research on prompt engineering framework and task planning for language models is progressing rapidly. The following are several mainstream methodologies (as of the time of publication of this article).
Mainstream Prompt Engineering Framework
ReAct (Reasoning-Action Coordination)
Core mechanism : guides the language model to reason about user queries (Reason) and trigger actions (Act) , supporting scenarios with or without contextual examples.
Advantages :
Outperforms the current state-of-the-art (SOTA) baseline models in multiple tasks.
Improving human interpretability and trustworthiness of large language models (LLMs) .
Chain-of-Thought (CoT)
Core mechanism : Explicitly present the model's thinking process through intermediate reasoning steps .
Derivative technologies :
Self-consistency : Aggregate the results of multiple reasoning paths to improve accuracy.
Active-prompt : Dynamically select the best examples to optimize contextual learning.
Multimodal CoT : Integrate multimodal data such as text and images for joint reasoning.
Tree-of-Thoughts (ToT)
Core mechanism : Extending CoT allows the model to explore multiple reasoning paths in parallel to form a tree-like decision structure.
Applicable scenarios :
Tasks that require strategic foresight (e.g. complex games, multi-step task breakdown).
Open-ended problem solving (e.g. idea generation, multiple solution comparison).
Agents can use the above reasoning techniques or other techniques to select the best follow-up actions for user requests. For example, if an agent is programmed to use the ReAct framework to select the correct actions and tools for a user query, its execution flow may be as follows:
The user sends a query to the agent
The agent starts the ReAct process
The agent provides a prompt to the model , asking it to generate the next ReAct step and its corresponding output:
a. Question : Input question from user query, provided with prompt
b. Thought : The model’s reasoning about the next action
c. Action : The model decides the next action to take
i. Here you can select tools
ii. For example, an action may be one of [Flights, Search, Code, None], where the first three represent specific tools that the model can choose, and the last one means "do not choose a tool"
d. Action input : The model determines the input parameters (if any) passed to the tool
e. Observation : The result of the action/action input after execution
i. This thought/action/action input/observation can be repeated N times (on demand)
f. Final answer : The final response generated by the model for the original user query
4. The ReAct loop ends and the final answer is returned to the user
As shown in the figure, the model, tools, and agent configuration work together to return a concise, fact-based response based on the user's original query. Although the model could rely on prior knowledge to make guesses (hallucinations), in this case it chooses to call tools (Flights) to search for real-time external information. This additional information is fed back to the model, enabling it to:
Make more reliable decisions based on real data
Integrate and summarize the information and return it to the user
In summary, the quality of the agent’s response directly depends on the model’s ability to:
Reasoning and action skills for a variety of tasks , including the ability to choose the right tools
Defining quality of tools
Just as a chef prepares a dish using fresh ingredients and pays attention to customer feedback, an intelligent agent relies on sound reasoning and reliable information to deliver the best results. In the next section, we’ll dive deeper into the many ways that an intelligent agent can connect to the latest data.