Analysis of the difference in landing of big models: Comparison of prompt word structure of smart Q&A → RAG → Agent

Written by
Caleb Hayes
Updated on:June-10th-2025
Recommendation

In-depth analysis of the differences in large model application modes, from intelligent question answering to Agent, revealing the mystery of LLM.

Core content:
1. Misunderstandings and confusions about large model application modes
2. Comparison of the differences between input prompts and output results
3. The importance of contextual input in multi-round dialogues

 
Yang Fangxian
Founder of 53A/Most Valuable Expert of Tencent Cloud (TVP)

 

Regarding several mature application modes of large models, including intelligent question and answer, RAG, Agent, Agent+MCP, etc., people tend to fall into two extremes when understanding them:

When you first start learning, you will be confused by these concepts. You may think that the big model (LLM) is magical and that it can do everything and can be used in all business scenarios.

When you use tools such as CherryStudio or Dify and follow some online tutorials to implement some scenes, you often feel very rigid. You just follow them and don’t understand the essence of LLM.

This article regards LLM as a black box text system from an easy-to-understand information perspective. By comparing the differences in input prompts + output results under several LLM application modes, it analyzes the differences between these application modes. In essence, it also "demystifies" the process of using LLM.

Personally, I think that even if we have only a vague understanding of the concepts and architectures of LLM such as encoder, decoder, transformer, attention, MoE, etc., it does not prevent us from understanding the application mode of LLM in different scenarios from the perspective of usage.

 

Basic Q&A prompts ( system instructions + user input )

 

In some of the most basic question-and-answer scenarios, you will ask the LLM to imitate a certain role to answer the user's questions, or ask the LLM to generate corresponding content based on the user's input. At the same time, you will require it to generate answers in accordance with a certain structure and impose some restrictions ( such as not being able to play freely).

The above requirements actually correspond to the system prompts of LLM. The configuration interface in Dify is as follows:

The system prompt word in the figure above corresponds to a high-level instruction, which is used to tell LLM how to respond to user requests, what format to output, etc. It is pre-set by the LLM application developer, so it is called a system-level instruction, and the user prompt word corresponds to the actual user input information.

It can be seen that the messages input to the service consist of two parts, one is the role: system instruction content set in advance by the developer, and the other is the role: user content actually entered by the user.

Since we understand it from the perspective of informatization, if there is input, there must be output.

In the choices/message structure, the role: assistant node is used to mark the output information of the large model (why it is marked in this way will be explained later), followed by the content node of the specific LLM-generated content.

Multi-round dialogue prompts (increase context input)

 

The above is the most basic dialogue process. If you want to ask questions repeatedly, the core is to input the result of the previous round of dialogue as part of the prompt word to LLM again.

The reason why the historical conversation context needs to be input into LLM again is that LLM itself has no memory. It always generates content based on input information. Therefore, in order to achieve multiple rounds of conversation, the intelligent agent needs to be responsible for context memory and put the previous conversation content into the input parameters in the next round of conversation. 
Seeing this, some students will ask, Anyway, the historical conversations are used as input, so can LLM have unlimited rounds of conversations? 
Of course not. Various apps usually have a limit on the number of conversation rounds for two reasons. First, LLM has a token limit. Each round of conversation adds to the context of the previous conversation, and the latter will become longer and longer. If it exceeds the token limit of LLM, it will be truncated. 
Second, even if some LLMs can support large tokens, even millions of tokens, if your contextual conversation information is very long, it will still interfere with the LLM's attention, just like the human brain will be confused when receiving too much input, and at this time its answer to your latest question will be biased.

RAG prompt words (increase knowledge input)

 

Based on the above conversations using LLM, when you find that LLM lacks the latest knowledge or lacks internal knowledge in a certain industry field, you usually need to adopt the RAG solution (RAG means enhanced knowledge retrieval), that is, to enhance the generation ability of LLM by searching the plug-in knowledge base. 

The core is to input the retrieved knowledge content into LLM as part of the prompt word. 

The essence is to put the retrieved knowledge fragments into the content of the role: user, so that LLM knows this background knowledge. In fact, they are essentially the same as the information directly input by the user, so the role: user category is still used. For LLM, it does not need to know the existence of the knowledge base at all.

Tips: Not all places where knowledge is needed require the use of a vectorized plug-in knowledge base for sharded vector storage and semantic similarity retrieval, nor does LLM have any special dependence on the vector knowledge base. These combination solutions are just an engineering paradigm.

For example, some simple scenarios require the use of a few fixed pieces of knowledge. There is no need for sharding, vectorization, or retrieval. Each time LLM is called, it is automatically attached to the role: user and can be input into LLM.

Call tool tip words (early function method)

 

Based on the above RAG, if you need to further call tools in the AI ​​Agent, such as calling tools to query the database, or calling tools to query the weather and book tickets based on user intent.

The core is to input the tool description information that can be provided to LLM as part of the prompt word.

A function's input node is added to the corresponding diagram to input the description information of the tool or function, including the function name and the corresponding input parameter information. Note that the functions node here is an array, which means that multiple available tool description information can be entered for LLM to select and judge.

Slightly different from the previous sections, the input of roles such as system, user, and assistant were all in the messages node, while the functions node here is at the same level as the messages node.

For this kind of input with tool description, LLM will determine whether to call a tool and which appropriate tool to call based on the user's question, and finally, feedback the tool to be called in the LLM output information.

Comparing the output structure of LLM with that in the previous basic dialogue scenario, we can see that the corresponding role is still assistant, but the content is null, and an additional function_call node is added.

By parsing the function_call node in the LLM feedback information in your AI Agent, you can get the name of the function to be called and the parameter values ​​to be used, and then execute the function.

Later changed to tools mode

 

The above is the early function call mechanism of LLM. Since Anthropic launched the MCP protocol at the end of 2015, the latest LLMs launched by various manufacturers, after intensive training, also support the standard tool call mode.

The input structure seems to have changed little, except that the functions node is replaced by a tools node, which is also an array.

  1. The tool_calls node replaces the function_call node in the output structure. At the same time, the output of a single function is changed to a tool array. This means that LLM believes that, according to the user's intention, it may be necessary to call a group of tools, so it is output in an array.
  2. In more complex scenarios, when your AI Agent needs to let LLM decide the next action based on the tool call result, that is, the underlying implementation logic introduced in the previous article about the ReAct mode autonomous decision-making AI Agent, you need to input the call result to LLM again.

This is somewhat similar to the input of the multi-round dialogue scenario mentioned above. The output result of the previous round of LLM is fed back to the LLM through the role: assistant node, and the actual tool call result information is fed back to the LLM through the role: tool node.

Of course, when actually using the MCP framework, the tools array input to LLM does not need to be manually organized. The framework will automatically pull the tool description information from the MCP server, including the tool calling process, which is also encapsulated in the MCP client SDK.