Recommendation
Explore the collaborative working mechanism of key technologies in the AI full-stack engineering system.
Core content:
1. Large language model (LLM) as the core driving force of the AI full-stack engineering system
2. The key role of prompt engineering in guiding and controlling LLM behavior
3. How RAG technology solves the knowledge limitations of LLM and enhances its generation ability
4. How AI Agent integrates multiple capabilities to achieve autonomous action and task coordination
Yang Fangxian
Founder of 53A/Most Valuable Expert of Tencent Cloud (TVP)
1. In the AI full-stack engineering system, how do Prompt Engineering, AI Agent, and RAG work together?
LLMs, Prompt Engineering , AI Agent , and RAG play interrelated and collaborative key roles in building complex AI applications (which can be regarded as part of the "AI full-stack engineering system").
Foundation: Large Language Model (LLM)
First, large language models (LLMs) are the core drivers or "brains" of the entire system . They have powerful natural language processing capabilities and are responsible for understanding, reasoning, and generating text. However, LLMs have inherent limitations, such as the possibility of "hallucination" (i.e., fabricating inaccurate or false information) and "knowledge deadlines" (inability to understand events that occurred after the training data).
Guidance and Control: Prompt Engineering is a technique for interacting with LLMs and guiding their behavior and capabilities. Through carefully designed prompts, the quality, relevance, and accuracy of LLM outputs can be significantly improved. In intelligent agents and RAG systems, Prompt Engineering is used to: guide LLMs in reasoning and planning . For example, the ReAct (Reason+Act) framework is a technique that uses Prompt Engineering to guide LLMs to think (Thought) and take action (Action), which is one of the core mechanisms of intelligent agent operation.
In RAG, the retrieved information is used to “enhance” the original user query, forming a new prompt word that is passed to the LLM to provide additional context. This is essentially an application of Prompt Engineering, ensuring that the LLM can leverage external knowledge when generating answers.
Retrieval Augmented Generation
RAG (Retrieval Augmented Generation) is a key technology to solve the knowledge limitations and hallucination problems of LLM. It enhances the model's generation ability by dynamically retrieving relevant information from external knowledge bases (such as vector databases, documents, etc.) and providing this information as context to LLM, making its answers more accurate, fact-based and timely. A typical RAG workflow includes preparing external knowledge sources (such as indexing and vectorization), retrieving based on user queries, and then combining the retrieved information with the query (augmentation) and inputting it into LLM to generate answers. RAG provides LLM with external knowledge beyond its training data , which is crucial for handling tasks that require the latest information or domain-specific knowledge.
Autonomous Action and Coordination : AI Agent is a more complex system that uses LLM as its core “brain” and integrates planning, memory, tool use , etc. The goal of the agent is to be able to autonomously understand the goal, make a plan, execute a series of steps, interact with the external environment (through tools), and adjust its behavior based on feedback to complete complex tasks without continuous human intervention.
Agents connect to the outside world through tools . These tools can be search tools, code interpreters, API calls, and the RAG system itself can also be regarded as an important tool that agents can call .
Collaboration: Agentic RAG as the core model The collaborative work of these three is prominently reflected in the Agentic RAG model. In Agentic RAG, one or more AI Agents are integrated into the RAG process.
The agent plays the role of "intelligent coordinator" here , which uses the reasoning ability of LLM (guided by Prompt Engineering) to: Understand complex queries and possibly decompose them into smaller sub-problems (query decomposition). Decide when to retrieve external information and which tool to use for the retrieval (for example, whether to perform vector search, web search, or call a specific API). Develop a retrieval plan, possibly performing multiple rounds of retrieval to collect comprehensive information. Evaluate the quality and relevance of the retrieval results, and even re-search when necessary. Pass the final retrieved information together with the user query to LLM through Prompt Engineering for final answer generation.
In addition, the agent's memory component (usually also relying on technologies such as vector databases, linked to the RAG infrastructure) helps the agent maintain context across multiple rounds of interactions and learn from past experiences to further optimize its planning and tool usage (including the use of RAG). The agent's instructions define its rules of behavior and how to use its tool set (including RAG).
In general, in an AI full-stack engineering system:
- LLM: Provides basic intelligence and text processing capabilities.
- Prompt Engineering: It is a bridge to communicate with LLM and guide its behavior, which runs through the agent's reasoning, planning, and RAG enhancement stages.
- RAG: As a powerful knowledge enhancement mechanism, it provides external, dynamic, and real-time information for LLM to overcome knowledge limitations.
- AI Agent: It integrates LLM, Prompt Engineering, RAG and other tools to achieve autonomous planning, action and task completion .
- In particular, Agentic RAG demonstrates how agents can intelligently orchestrate RAG processes to make their knowledge acquisition more dynamic and flexible.
-
- They build on each other layer by layer to build intelligent applications that can handle more complex tasks requiring external knowledge and autonomous decision-making.
2. How do key components and techniques in RAG systems (such as chunking and embedding) affect performance?
In the Retrieval-Augmented Generation (RAG) system, chunking and embeddings are two key techniques that have a significant impact on the overall performance of the system, mainly reflected in how they affect the quality of information retrieval and the accuracy and relevance of the final generated responses.
Here's how they affect performance:
Definition and Purpose: Chunking is the process of breaking down large documents into smaller, more manageable pieces of text. It is a key step in the indexing phase of the RAG system, aiming to improve retrieval quality.
How it affects performance: retrieval accuracy and context preservation: Chunking strategies (such as fixed size, recursion-based, semantic chunking, document-based chunking) and chunk size and overlap are critical to retrieval accuracy and context preservation. A suitable chunk size helps ensure that the retrieved snippets contain enough information to answer the question but not too much irrelevant information.
Retrieval efficiency: Small and independent chunks may lead to improved retrieval efficiency, but if the context is insufficient, it may affect the LLM's ability to understand and generate a complete answer.
Missing information: If the chunks are too small or the chunking strategy is inappropriate, related information may be split into different chunks or key information may not be included in the few chunks retrieved.
Redundancy and noise: If the blocks are too large or the overlap is too high, the retrieval results may contain a lot of redundant information, increase the burden of LLM processing, and even introduce noise.
Evaluation Challenges: Evaluating the impact of chunking on the final output of a RAG system is challenging.
Evaluation Metrics: Chunk Utilization is a metric that measures how much of the retrieved chunk text is used to compose the final response. Low chunk utilization may indicate that the chunk size is too large and contains a lot of unused text.
Definition and purpose: Embeddings are numerical representations of text (or data), usually dense, continuous vectors, that capture the semantic meaning of the text and the relationship between words in a high-dimensional space. Converting text chunks into embeddings is a key step in the indexing phase to enable semantic search.
How it affects performance: Semantic search capabilities: The quality of the embedding model directly determines the effectiveness of semantic search in the vector database. A good embedding model can more accurately capture the semantic similarity of text, allowing the system to retrieve the text blocks that are most relevant to the user's query, even if these blocks do not contain the exact terms in the query.
Relevance of retrieval results: Choosing the right embedding type (e.g., dense embedding, sparse embedding, multi-vector embedding) is crucial to improving retrieval performance. Different embedding models are suitable for different use cases and domains.
Cost and performance trade-off: Different embedding models differ in performance and cost. The source material mentions that some smaller encoder models (such as gte-small) can bring significant improvements in attribution (i.e., the proportion of blocks used to construct responses), with performance similar to that of large models, indicating that cost can be saved while maintaining performance.
Evaluation Challenges: Evaluating the effectiveness of embedding models is challenging because their impact on downstream RAG performance may not be transparent enough.
Retrieval failure analysis: By checking whether the retrieved chunks (found based on embedding similarity) contain specific information required to answer the query, we can analyze whether retrieval failed and evaluate the effectiveness of the embedding model or chunking strategy.
Database operation speed: Insertion speed and query speed are key indicators that affect the performance and cost of vector databases.
Chunking and embedding are the basis for effective retrieval in the RAG system. An appropriate chunking strategy helps organize the original document into meaningful and easy-to-retrieve units. A high-quality embedding model can accurately map these text units into the vector space, allowing similarity search to accurately find chunks relevant to the user query. If chunking or embedding performs poorly, the retrieval component may not be able to find the most relevant context, resulting in inaccurate, incomplete, or irrelevant answers generated by the LLM. Techniques such as reranking can make up for the shortcomings of retrieval to a certain extent, by rearranging the retrieved chunks to improve the quality of the context ultimately provided to the LLM.
3. How to evaluate and improve the reliability and security of systems based on large language models (LLMs), AI agents, and retrieval-augmented generation (RAGs)?
Assessing these complex systems requires a multidimensional approach. As system complexity increases, assessment becomes increasingly difficult, but also increasingly important. Assessment should move away from single indicators toward multidimensional, scenario-based assessments.
Common evaluation indicators and methods: Task Completion: Evaluate whether the AI Agent can complete the preset task.
Answer Relevancy: Determines whether the LLM output responds to the input in an informative and concise manner.
Correctness: Determine whether the LLM output is accurate based on the facts.
Hallucination: Determines whether the LLM output contains false or fabricated information. Although the goal of RAG is to reduce hallucinations, they can still occur.
QAG Score: It uses the reasoning ability of LLM to evaluate LLM output by calculating scores for closed questions (usually “yes” or “no” answers), which is considered reliable because it does not directly let LLM generate scores.
Combine automated and manual evaluation methods.
RAG System Assessment:RAG is a multi-phase, multi-step framework that requires both holistic and granular assessments. This ensures component-level reliability and a high level of accuracy.
Component-level evaluation: mainly focuses on evaluating the quality of retrievers and generators.
Retrieval Metrics: Such as Chunk Attribution (whether the document chunk was used to generate the answer) and Chunk Utilization (the extent to which the document chunk text was used to generate the answer). Low utilization may indicate that the chunk size is too large.
Generator Evaluation: Evaluate the quality of LLM generated answers.
System Metrics: Monitor the health, performance, and resource utilization of the RAG deployment infrastructure.
RAGAS: A popular evaluation framework.
Agent system evaluation: Evaluating an agent requires measuring its design decisions, tool usage efficiency, and robustness of task completion.
Behavioral patterns: Evaluate the step-by-step decision-making process.
Tool effectiveness: Evaluates the efficiency of an agent using a specific tool or API.
Cost efficiency: Monitor resource usage across multiple iterations.
TaskSuccessRate: Checks whether the Agent declared success, or actually completed the task.
Evaluator-optimizer pattern: One agent attempts a task, another evaluator agent provides feedback (e.g. “needs more vivid imagery”), and then the original agent makes modifications based on the feedback, and this cycle can be repeated iteratively to improve the output quality.
Improve system reliability
Improving reliability involves optimizing the reasoning ability of LLM, enhancing the knowledge acquisition ability of RAG, and improving the planning and execution capabilities of Agent.
Prompt Engineering: Guide LLM behavior by providing clear and concise instructions, defining roles, clarifying output formats, and using a small number of sample prompts.
Reasoning techniques: Use Chain-of-Thought (CoT) prompts to guide the model to think step by step, Tree-of-Thoughts (ToT) allows exploration of multiple reasoning paths, and ReAct (Reason+Act) combines natural language reasoning with external tools to form a think-act loop.
Model fine-tuning: The model can be further fine-tuned to the agent's task by providing examples that demonstrate the agent's capabilities, including instances of using specific tools or reasoning steps.
RLHF/RLAIF: Reinforcement learning through human feedback or AI feedback enables the model to generate responses that are more in line with human preferences.
Core Capabilities: RAG itself enhances models by ingesting real-time, niche data from external sources, thereby reducing illusions and increasing accuracy and detail.
Optimization techniques: Chunking strategies: Break documents into smaller, more manageable chunks, such as fixed size, recursive segmentation, semantic chunking, etc. Chunk size and overlap are critical to retrieval accuracy.
Embedding models: Selecting an appropriate embedding model to convert text into vector representation affects the effectiveness of semantic search.
Re-ranking: Use a re-ranking model after the initial retrieval to optimize the order of document chunks and improve the quality of the context provided to the LLM.
Agentic RAG: Integrate AI agents into the RAG process, giving agents the ability to autonomously decide when, how, and what information to retrieve. Agents can perform multi-step retrieval, access multiple tools, and be able to verify the retrieved information, thus overcoming the limitations of traditional RAG and improving the accuracy and robustness of responses.
Improving Agent Reliability: Core Components: The reliability of an agent is based on its core components: LLM as a reasoning engine, tools for interacting with the external world, memory for learning from experience and storing context, and reasoning capabilities (planning and reflection) for decomposing tasks and evaluating results.
Clear instructions: High-quality instructions are crucial for agent decision-making, reducing ambiguity and improving workflow execution efficiency.
Tool Use: Agents leverage external tools (e.g., search, API calls, code interpreters, data stores) to obtain real-time information, suggest real-world actions, and perform complex tasks. This extends the capabilities of LLMs to dynamically interact with the real world.
Planning and Reflection: The agent breaks down complex problems into small steps through task decomposition and evaluates the results of each step through reflection and adjusts the plan as needed.
Iterative development: Start with a simple prototype, gradually increase complexity, and refine agent behavior through continuous experimentation and feedback.
Flexibility and composability: Consider flexibility and composability when designing Agents.
Fault handling: Incorporate appropriate failure modes to help the agent "break free" when it is unable to complete the task.
Human-in-the-loop: Plan for human intervention, such as pausing the agent’s execution at checkpoints or when encountering obstacles to obtain human feedback.
Improve system security (Guardrails)
Security guardrails are essential to ensure that the Agent system operates safely and predictably. They are a key component of any LLM deployment.
The role of safety guardrail:
Helps manage data privacy risks (e.g., preventing system-prompted breaches) and reputation risks (e.g., enforcing brand-consistent model behavior). Identified risks can be addressed and added in layers based on new vulnerabilities discovered. Especially important for use cases based on complex decisions, unstructured data, or fragile rule systems.
Implementation of safety guardrails: can include input filtering, tool use control, and manual intervention. It should be combined with strong authentication and authorization protocols, strict access control, and standard software security measures. For example, input guardrails can be used to check specific conditions, such as potential customer churn risk, before the agent processes user input. Anthropic emphasizes building "useful, honest, and harmless" agents. Safety metrics are used to identify sensitive information (such as PII) and harmful content (Toxicity) in model responses.
Evaluating the reliability and safety of LLM, agent, and RAG systems requires a comprehensive approach that covers common metrics, RAG- and agent-specific evaluation strategies. Improving reliability relies on optimizing the reasoning capabilities of LLMs, leveraging RAGs to enhance knowledge acquisition, and designing agent planning, tool use, and reflection cycles. Safety is ensured by implementing strong safety guardrails, incorporating standard safety measures, and continuous monitoring.
4. How does RAG improve the accuracy of LLM?
The Retrieval Augmented Generation (RAG) system significantly improves the accuracy of Large Language Models (LLMs) by:
Solve the knowledge limitations and hallucination problems of LLM: Large language models learn knowledge in their training data, but this part of knowledge is static and has a "knowledge deadline" . LLM cannot perceive events that occur after the training data or new information. LLM may also produce "hallucinations", that is, generate information that sounds reasonable but is actually false or fabricated . Introduce external, dynamic, and real-time knowledge: The core idea of RAG is to dynamically retrieve relevant information from external knowledge bases , which can be vector databases, documents, etc. This external information is real-time, niche, or domain-specific . Through RAG, LLM obtains external knowledge beyond its training data .
Providing retrieved information as context to LLM : A typical RAG workflow involves chunking and embedding external data into a vector database (indexing phase). When a user asks a question, the system retrieves the most relevant chunks of documents. These retrieved chunks of documents are then combined with the original user question to construct a new, “enhanced” prompt . This enhanced prompt is fed into the LLM to guide it in generating an answer.
Grounding LLM responses: By using retrieved, external, verifiable information as context, RAG forces LLM to generate responses based on this factual evidence . This significantly reduces the likelihood that LLM will make up information (reducing hallucinations) and enables its responses to contain the latest or domain-specific information , thereby improving the accuracy and detail of the response.
Improved reliability and credibility of output: By basing LLM responses on external, verifiable information, RAG significantly improves the reliability and credibility of LLM output , making it suitable for a wider range of applications that require factual accuracy. This capability enhances the reliability and accuracy of generated responses .
In summary, RAG effectively solves the knowledge timeliness and illusion problems of LLM by dynamically retrieving external relevant knowledge and providing it as context to LLM , making the generated answers based on factual evidence , thereby significantly improving the accuracy, reliability and credibility of LLM output . In particular, Agentic RAG further improves the dynamics and flexibility of knowledge acquisition by giving the agent the ability to autonomously decide when, how and where to retrieve information, so that more accurate and robust responses can be generated.
5. What is the difference between Agentic RAG and traditional RAG?
The main difference between Agentic RAG and traditional (or ordinary, naive/vanilla) RAG lies in whether an AI agent is integrated into the retrieval-augmented generation (RAG) process, and the autonomous decision-making and dynamic capabilities brought about by this integration .
Here are the specific differences between them:
Core architecture and processes:
Traditional RAG: Usually a relatively fixed, "retrieve then generate" process. It is described as a "one-shot solution". Its basic components include a retrieval component (usually an embedding model and a vector database) and a generation component (LLM). User queries are directly used to perform similarity searches in the vector database to retrieve relevant document blocks, and then the retrieved information is input into the LLM together with the original query for answer generation.
Agentic RAG: Integrate one or more AI Agents into the RAG process . The Agent becomes the core of the architecture and plays the role of "intelligent coordinator" . The Agent uses the reasoning ability of LLM to autonomously manage the information retrieval process .
Dynamic and autonomous retrieval process:
Traditional RAG: The retrieval process is relatively static and passive . The system directly executes the preset retrieval steps based on user input without further thinking or adjustment. It does not have the ability to pre-process queries, perform multi-step retrieval, and verify retrieval information.
Agentic RAG: The retrieval process is more dynamic, flexible, and agent-driven . The agent can proactively decide when to retrieve external information based on context, task progress, and current understanding . The agent can decide how to retrieve , such as which tool to use (see next point). The agent can also decide what to retrieve and even reformulate the query .
Traditional RAG: Usually limited to accessing pre-indexed external knowledge sources (mainly documents in vector databases). It has no ability to access other external tools .
Agentic RAG: Agent, as a more complex system, can connect to the outside world through tools . These tools can include vector search (the traditional RAG function itself can be regarded as an important tool that the agent can call), web search , calculator , or calling various APIs to access email, chat history or other software systems. The use of tools expands the capabilities of LLM, enabling it to dynamically interact with the real world.
Planning, reflection and multi-step skills:
Traditional RAG: lacks planning and reflection capabilities, and usually completes retrieval and generation in a single step .
Agentic RAG: Agents have planning and reflection capabilities. Agents can break down complex queries into smaller sub-questions, develop search plans, and even perform multiple rounds of searches to collect comprehensive information. Agents can also evaluate the quality and relevance of search results and re-search if necessary or adjust plans based on feedback. Agentic RAG can handle complex queries that require multi-step reasoning .
Accuracy, reliability and robustness of results:
Traditional RAG: By leveraging external knowledge, it inherently reduces hallucinations and improves accuracy and detail.
Agentic RAG: Through the agent's reasoning capabilities and fine-grained control over the retrieval process, such as being able to verify retrieved information , route queries to more specialized knowledge sources , and iterate and adjust, Agentic RAG is able to generate more accurate , robust , and reliable responses. It overcomes the limitations of traditional RAG in handling complex queries.
Complexity and application scenarios:
Traditional RAG: Suitable for problems that can be solved by simple searching.
Agentic RAG: Better suited for handling open-ended problems , complex decisions, unstructured data , or tasks that require unfixed steps . Agentic RAG enables autonomous execution of tasks and enhanced human-machine collaboration.
If you compare traditional RAG to a searchable, indexed encyclopedia, Agentic RAG is like a smart research assistant who knows how to consult the encyclopedia when needed, borrow books from the library (other databases), and even search the Internet for the latest information. He will think, plan, and verify information to ensure that he gives you a comprehensive, accurate, and verified answer. The autonomy, tool-using ability, and multi-step reasoning brought by the agent are the essential differences between Agentic RAG and traditional RAG.
6. How to evaluate the performance of the RAG system?
Evaluating the performance of a retrieval-augmented generation (RAG) system is a multifaceted task that involves multiple components and the quality of the overall output. The evaluation aims to ensure the reliability, accuracy, and usefulness of the system.
The following are key aspects and methods for evaluating RAG system performance:
Component-level evaluation: The performance of RAG is highly dependent on its core components, so they need to be evaluated individually.
Evaluation of the Retriever:
Chunking strategy: Chunking is the process of breaking large documents into smaller pieces, which is critical to retrieval quality. It is necessary to evaluate the impact of different chunking strategies (such as fixed size, recursive chunking, semantic chunking, etc.) and chunk size and overlap on retrieval accuracy and context retention. Low Chunk Utilization may indicate that the chunk is too large and contains text that is not used to generate responses.
Quality of embeddings: Embeddings convert text into vectors, which directly affects the performance of semantic search and the relevance of retrieval results. Choosing a suitable embedding model is crucial to improving retrieval performance. Retrieval failures can be analyzed and the effectiveness of embeddings can be evaluated by checking whether the retrieved chunks contain specific information required to answer the query. Some metrics, such as Chunk Attribution , can measure whether the retrieved chunks are used to generate answers.
Re-ranking: After the initial search, using a re-ranking model can optimize the order of documents, putting the most relevant ones at the top. Evaluating re-ranking algorithms (such as Pointwise, Pairwise, Listwise methods) can ensure the best quality of context provided to the LLM.
Retrieval efficiency: Small and independent blocks may improve retrieval efficiency. Evaluating the embedding insertion speed and query speed of vector databases has a key impact on performance and cost.
Generator evaluation: Evaluate the LLM’s ability to generate a final answer after receiving the retrieved context. It is necessary to evaluate whether the LLM can effectively integrate multiple pieces of relevant information to avoid generating accurate but incomplete answers.
System-level and overall performance evaluation: Evaluate the output quality and behavior of the entire RAG system.
Accuracy and Reliability: One of the main goals of RAG is to reduce the "hallucinations" (false information) of large language models and make their answers grounded in factual evidence. It evaluates the degree to which the system generates hallucinations and whether the answers can be confirmed by the retrieved external information. This ability enhances the reliability and accuracy of the response.
Relevance and Completeness: Evaluate whether the generated answer directly and completely answers the user's query and includes necessary information. It is necessary to overcome the problem of missing or incomplete information caused by insufficient retrieval or LLM's failure to integrate all relevant information.
Ability to handle out-of-domain queries: Evaluate whether the RAG system can give appropriate responses when faced with queries that are unrelated to the domain of its training data or external knowledge base, such as admitting that it cannot answer or stating that the query is beyond its knowledge.
Safety Metrics: Evaluate whether the model response contains harmful content (such as toxicity) or sensitive personally identifiable information (PII). This includes checking information such as credit card numbers, phone numbers, addresses, etc.
System operation indicators: Monitor the operation status, performance, response delay (latency) and resource utilization of the RAG infrastructure to ensure optimal operation of the system. Cost is also an important consideration.
Product Metrics: Collect user feedback, such as likes/dislikes or star ratings, to understand user satisfaction.
Assessment methods and tools:
Automated Assessment Framework: RAGAS is a popular RAG assessment framework.
Human evaluation: For complex evaluation, especially understanding the nuances and contextual relevance of responses, human evaluation remains important.
Generating evaluation data using LLM: LLMs (such as GPT-4) can be used to generate queries for testing based on text blocks to create evaluation datasets. Synthetic questions can also be used to evaluate the tone, PII, and toxicity of responses.
Specific evaluation of Agentic RAG: In Agentic RAG, the evaluation also needs to consider the agent's autonomous decision-making process, the efficiency of the use of tools (including RAG tools), and the robustness of achieving goals in complex tasks.
Tool Platform: Galileo GenAI Studio is mentioned as a tool for evaluating and monitoring RAG systems, providing detailed analysis metrics and visualization interfaces. Weaviate also provides RAG integration and examples, suggesting a platform for testing and experimentation.
Iteration and Experimentation: Optimizing the RAG system requires continuous iterative development and continuous experimentation, adjusting the chunking strategy, embedding model, retrieval parameters, etc., and measuring the improvement effect through evaluation indicators.
In general, evaluating the performance of a RAG system requires considering multiple dimensions, from the efficiency and quality of its components (retrieval, generator) to the accuracy, relevance, and security of the entire system output, and continuously optimizing it through a combination of automated tools, manual evaluation, and iterative experiments.