An article explains in detail how Agent works

Written by
Jasper Cole
Updated on:June-21st-2025
Recommendation

Deepen your understanding of how agents interact with the environment intelligently.

Core content:
1. The definition and core features of agents
2. The key role of prompts in agent workflows
3. The application and impact of LLM large models and memory knowledge bases

Yang Fangxian
Founder of 53A/Most Valuable Expert of Tencent Cloud (TVP)

    Agent, also called "agent" or "intelligent body", is a computer program or entity that can perceive the environment, plan autonomously, make decisions, act independently, and interact with other agents or humans in a specific environment. They often have characteristics such as autonomy, responsiveness, sociability, and adaptability, and can adjust their behavior according to changes in the environment to achieve preset goals.

Typical Agent Workflow

Key step 1: Prompt words [define the role scope, explain the task background, and habit characteristics]

The prompt word is the initial input received by the agent, which describes the task that the agent needs to complete or the problem that it needs to solve. The prompt word can be in various forms such as text, image, voice, etc. The agent needs to parse and understand the prompt word in order to provide guidance for subsequent task planning and action execution.

The prompt words of the intelligent agent should be unified: the dialogue involves the ecological supply side of the development platform, including industry norms, background knowledge and prompt words of the intelligent agent. The prompt words include context and instructions, and attention should be paid to clearly expressing requirements, unifying pronouns and avoiding industry jargon.

(1) General instruction composition

    • Context: Describes the context in which you want the large model to perform a task

    • Instruction: Instructions on what task you want the model to perform

    • Input Data: describes what the user wants to enter.

    • Outpt Indicator Output prompt: Specify the output result content (state clearly what you want + what you don’t want)

(2) Tips

    • Be less vague and clearly state your needs

    • The pronouns used in the instructions should be consistent and should not be switched repeatedly, which may cause confusion in the understanding of the large AI model.

    • Try not to include industry jargon in instructions, as it may cause difficulties in understanding large AI models.


Summary: How well the prompt words are written directly affects the processing results
The simplest instruction formula: the role you want the agent to play + the result you want it to generate based on the user's input + the detailed requirements for the content generated by the assistant
Example
You are an experienced tour guide. My destination and estimated time of visit are xxxx. You need to give me some suggestions and make a travel plan for me based on the destination and estimated time of visit I provide. The travel plan you make must be feasible, not too tight, and must take into account the travel time.
(You can put the user input into the wildcard position to form a complete command and send it to the model to request the result)
  Key Step 2: LLM Big Model [Understanding, Extraction, Identification, Selection]

The LLM (Large Language Model) is an important tool for Agents to perform task planning and knowledge reasoning. By learning from a large amount of text data, it has powerful language processing and knowledge reasoning capabilities. Agents can use the LLM model to conduct in-depth analysis of prompt words, generate possible solutions, and select and optimize them.

Key step 3: Memory knowledge base [call, match, current input content, context content, vector database]

Classification
Brief Description
Sensory memory
The current user input content, including text, images or other forms, retains a sensory impression for a short time
Short-term memory
Context content (including information written in the prompt), temporary storage space for complex tasks, limited by the limited context length
Long-term memory (words)
The knowledge base stored in external vectors can be quickly retrieved by Agents when used, and has the advantage of large storage capacity. It exists in the form of text field content.
Long-term memory (text)
Knowledge base files stored in external vectors can be quickly retrieved by Agent when used, and have the advantage of large storage capacity, such as docx, xlxs, csv, pdf, ppt, jpg, txt, etc.
Long-term memory (web)
Fill in the web page URL, automatically retrieve web page information, and use the web page information as a knowledge base

Key Step 4: Planning Task Planning [Analysis Method, Analytical Thinking, Reasoning Traces]

Task planning is the process of making decisions and planning based on prompts, LLM models, and knowledge bases. It involves character decomposition, goal setting, path planning, and other aspects. The agent needs to consider various factors and develop the most appropriate task execution plan.

Method technology
Brief Description
Prompt
The decomposition of tasks can be accomplished in three ways:
1) Enter simple prompts in the large model, such as "the steps to XYZ" or "what are the sub-goals to achieve XYZ?"
2) Use task-specific instructions, such as asking the big model to "write a story outline" when they need to write a novel;
3) Provide information manually, site map/RPA process best practices
CoT Thinking Chain
It has become a standard prompting technique for improving models’ performance on complex tasks. Models are asked to “think step by step”, breaking down complex tasks into smaller and simpler steps. Thought chaining breaks down large tasks into multiple manageable tasks and helps people understand the model’s thought process.
ToT Thinking Chain
Expand the thinking chain by exploring multiple reasoning possibilities at each step of the task. It first breaks down the problem into multiple thinking steps and generates multiple ideas in each step, creating a tree-like structure. The search process can be BFS (Breadth First Search) or DFS (Depth First Search)

Key step 5: Use the Action tool [Execute, Return, Execute]

Action execution is the process by which the agent performs specific operations based on the results of task planning. It may involve multiple links such as interaction with the environment, data collection and processing, and decision adjustment. The agent needs to accurately perform each step to ensure that the task can be successfully completed.

Method technology
Brief Description
Built-in tools
Large model built-in tools, can be used directly, including: calendar, calculator, code interpreter, search, etc.
Plugin
A plug-in used to extend the Agent's functionality. You can use the Agent plug-in to implement some specific functions or customize the Agent's configuration. The Agent plug-in usually includes the following parts:
1. Plug-in configuration file: used to configure the parameters and properties of the plug-in, usually a file in XML or JSON format.
2. Plugin library: contains the plugin code and dependent libraries, usually a JAR or DLL file
3. Plug-in interface: defines the interaction interface between the plug-in and the Agent, including plug-in initialization, start, stop and other operations.
API
The application programming interface (API interface) is an important part of the application. It provides an entry point for operating data. This entry point can be a function or class method, or a URL address or a network address.
RPA Desktop Automation
It is a robotic process automation technology that allows business processes to be executed by configuring automated software to simulate and interact with human actions in software systems. RPA software robots recognize data on application interfaces and manipulate applications like humans.

    The orchestration of agents really needs to be constantly debugged. On Wednesday, I went to our group company to discuss intelligent agents. A department leader said that the orchestration of intelligent agents is to continuously improve those scenarios. For example, at the beginning, only natural language was supported. Later, the scenarios needed to add multi-modality, and multi-modality needed to consider various types of file processing. This process is a process of filling in the gaps, and slowly the intelligent agents will become more and more "intelligent".