Why is the RAG system "easy to use at first glance, but useless once you put it into practice"?

Written by
Audrey Miles
Updated on:July-09th-2025
Recommendation

The RAG system performs well in combining external knowledge bases and generating models, but faces many challenges in engineering practice.

Core content:
1. The principle and application scenarios of the RAG system
2. 7 common problems of the RAG system in engineering practice
3. 5 other problems of the RAG system and potential solutions

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)

 

With the widespread application of large language models (LLM), retrieval-augmented generation (RAG) technology, as an innovative architecture that combines retrieval technology and LLM hints, has attracted much attention due to its outstanding performance in combining external knowledge bases and generative models. The RAG system significantly improves the accuracy, relevance and credibility of generated content by combining LLM with structured or unstructured external data sources.

RAG performs well in chatbots, knowledge-intensive tasks, and enterprise applications. However, from theory to engineering practice, developing and optimizing an efficient RAG system is not easy, and the RAG system faces many challenges. This paper analyzes the development difficulties and optimization paths of the RAG system as systematically as possible.

12 Questions about the RAG System

First, let’s take a look at the paper “Seven Failure Points When Engineering a Retrieval Augmented Generation System” written by Scott Barnett et al. from the Institute of Applied Artificial Intelligence in Geelong, Australia  . The paper discusses the problems of RAG systems in engineering practice and summarizes seven common problems:

1. Missing Content

When a user's question cannot be retrieved from the document library, it may cause the large model to hallucinate. Ideally, the RAG system can simply reply with a "Sorry, I don't know", however, if the user's question can retrieve a document, but the document content is irrelevant to the user's question, the large model may still be misled.

2. Missed Top Ranked

Due to the context length limit of the large model, when we retrieve from the document library, we generally only return the top K paragraphs. If the paragraph containing the answer to the question is out of the ranking range, problems will arise.

3. Not In Context

The document containing the answer was successfully retrieved, but it was not included in the context used by the larger model. This situation occurs when multiple documents are retrieved from the database and a merge process is used to extract the answer.

4. Not Extracted

The answer is in the context provided, but the large model fails to accurately extract it, which usually happens when there is too much noise or conflicting information in the context.

5. Wrong Format

The question requires information to be extracted in a specific format, such as a table or list, yet the large model ignores this instruction.

6. Incorrect Specificity

Although the large model answers the user's question normally, it is not specific enough or too specific, which does not meet the user's needs. Incorrect specificity can also occur when the user is unsure how to ask the question, or the question is too general.

7. Incomplete Answers

Consider a question like, "What key points do files A, B, and C contain?" Using this question directly may only retrieve partial information about each file, resulting in an incomplete answer from a large model. A more effective approach is to ask these questions for each file separately to ensure comprehensive coverage.

Wenqi Glantz wrote an article titled "12 RAG Pain Points and Proposed Solutions" and raised five more questions:

8. Data Ingestion Scalability

Data ingestion scalability issues in the RAG pipeline refer to challenges that arise when the system has difficulty managing and processing large amounts of data effectively, leading to performance bottlenecks and potential system failures. Such data ingestion scalability issues can lead to problems such as extended ingestion times, system overloads, data quality issues, and limited availability.

9. Structured Data QA

It may be difficult to accurately retrieve the required structured data based on the user's question, especially when the user's question is complex or ambiguous, because the text-to-SQL conversion is not flexible enough and the current large models still have certain limitations in effectively handling such tasks.

10. Data Extraction from Complex PDFs

Complex PDF documents may contain embedded content such as tables and pictures. When answering questions about such documents, traditional retrieval methods often fail to achieve good results. We need a more efficient method to handle such complex PDF data extraction needs.

11. Fallback Model(s)

When using a single large model, we may worry about the model encountering problems, such as encountering the access frequency limit error of the OpenAI model. At this time, we need one or more models as backups in case the main model fails.

12. Security of Large Language Models (LLM Security)

How to effectively prevent malicious input, ensure output security, and protect sensitive information from being leaked are important challenges that every AI architect and engineer needs to face.

12 Optimization Strategies for RAG Systems

From the perspective of RAG's workflow, each link has great room for optimization. We will learn about specific optimization strategies in detail from the five links of the workflow and in combination with the above 12 questions.

1. Clean your data

High-performance RAG systems rely on accurate and clean raw knowledge data. On the one hand, in order to ensure data accuracy, it is necessary to optimize document readers and multimodal models. In particular, when processing files such as CSV tables, simple text conversion may lose the original structure of the table. Therefore, we need to introduce additional mechanisms to restore the table structure in the text, such as using semicolons or other symbols to distinguish data. On the other hand, it is also necessary to do some basic data cleaning of knowledge documents, which can include:

  • •  1.1 Text cleaning: Standardize the text format, remove special characters, interference and irrelevant information; delete duplicate documents or redundant information that may bias the retrieval process; identify and correct spelling errors and grammatical errors. Tools such as spell checkers and language models can help solve this problem.
  • •  1.2 Entity Resolution Disambiguate entities and terms to achieve consistent references. For example, standardize “LLM”, “Large Language Model”, and “Large Model” into common terms.
  • •  1.3 Document Division Reasonably divide documents of different topics. Are different topics concentrated in one place or scattered in multiple places? If humans cannot easily determine which document to consult to answer common questions, then the retrieval system cannot do it either.
  • •  1.4 Data Augmentation Use synonyms, paraphrases, and even translations into other languages ​​to increase the diversity of the corpus.
  • •  1.5 User Feedback Loop The database is continuously updated based on feedback from real-world users, marking their authenticity.
  • •  1.6 Time-Sensitive Data For frequently updated topics, implement a mechanism to invalidate or update outdated documents.

2. Block processing

In the RAG system, documents need to be segmented into multiple text blocks before vector embedding. Without considering the input length limit and cost of large models, the goal is to minimize the noise in the embedded content while maintaining semantic coherence, so as to more effectively find the most relevant document parts to the user query.If the chunks are too large, they may contain too much irrelevant information, thus reducing the accuracy of retrieval. On the contrary, if the chunks are too small, necessary contextual information may be lost, resulting in the generated responses lacking coherence or depth.. Implement appropriate document chunking strategies in the RAG system,
Aims to find this balance and ensure the integrity and relevance of informationIn general, an ideal block of text should still make sense to a human without the surrounding context, and thus also make sense to a language model.

  • •  2.1 Choice of Chunking Method Fixed-size Chunking: This is the simplest and most direct method, where we directly set the number of words in a chunk and choose whether to repeat content between chunks. Usually, some overlap between chunks is maintained to ensure that semantic context is not lost between chunks. Compared to other forms of chunking, fixed-size chunking is simple to use and does not require much computational resources.
  • •  2.2 Content segmentation As the name implies, the document is segmented according to its specific content, for example, according to punctuation marks (such as periods). Or you can directly use the sentence segmentation function provided by the more advanced NLTK or spaCy library.
  • •  2.3 Recursive Blocking
    Recommended approach in most cases.ThatRecursively break up the text by repeatedly applying the chunking rulesFor example, in langchain, the paragraph break character (\n\n) to split the blocks. Then, the size of these blocks is checked. If the size does not exceed a certain threshold, the block is retained. For blocks whose size exceeds the standard, a single line break (\n) and split again. And so on, constantly updating smaller block rules (such as spaces, periods) according to the block size.This approach allows for flexible block size adjustmentsFor example, for information-dense parts of a text, finer segmentation may be required to capture details, while for less informative parts, larger blocks may be used.The challenge is to develop sophisticated rules to decide when and how to segment text..
  • •  2.4 Blocking from small to large Since small blocks and large blocks have their own advantages, a more direct solution is to divide the same document into all sizes from large to small, and then store all blocks of different sizes into the vector database, and save the superior-subordinate relationship of each block for recursive search. But as you can imagine, because we need to store a lot of repeated content, the disadvantage of this solution is that it requires more storage space.
  • •  2.5 Special structure segmentation Special segmenters for specific structured content. These segmenters are specially designed to process these types of documents to ensure that their structure is correctly preserved and understood. The special segmenters provided by langchain include: Markdown files, Latex files, and various mainstream code language segmenters.
  • •  2.6 Selection of block size All of the above methods ultimately need to set a parameter - the block size, so how do we choose? First of all, different embedding models have their optimal input size. For example, Openai's text-embedding-ada-002 model works better on blocks of 256 or 512. Secondly, the type of document and the length and complexity of user queries are also important factors in determining the block size. When dealing with long articles or books, larger blocks help retain more context and topic coherence; for social media posts, smaller blocks may be more suitable for capturing the precise semantics of each post. If the user's queries are typically short and specific, smaller blocks may be more appropriate; on the contrary, if the query is more complex, larger blocks may be required. In actual scenarios, we may still need to continue to experiment and adjust. In some tests,




    128-block size is often the best choice. If you don't know where to start, you can use this size as a starting point for testing..

3. Embedding Model

The embedding model can convert text into vectors. Obviously, different embedding models have different effects. For example, the Word2Vec model, although powerful, has an important limitation: the word vectors it generates are static. Once the model is trained, the vector representation of each word is fixed, which may cause problems when dealing with polysemous words.

for example, "I bought a CD", here "CD" refers to a specific round disk, while in the "Clean Plate Campaign", "Clean Plate" means eating up the food on the plate, which is a behavior that advocates saving. The word vectors with completely different semantics are fixed. In contrast, models that introduce self-attention mechanisms, such as BERT, are able to provide dynamic word meaning understanding. This means that it can dynamically adjust the meaning of a word based on the context, so that the same word has different vector representations in different contexts. The word "CD" will have different vectors in two sentences, thus capturing its semantics more accurately.

In order to make the model have a better understanding of the vocabulary in a specific vertical field, some projects will fine-tune the embedding model. However, we do not recommend this method here: on the one hand, it has high requirements on the quality of training data, and on the other hand, it also requires more manpower and material resources, and the effect may not be ideal, which will eventually outweigh the gains. In this case, for how to choose an embedding model, we recommend referring to the embedding model ranking MTEB launched by Hugging Face. This ranking provides performance comparisons of multiple models, which can help us make more informed choices. At the same time, it should be noted that not all embedding models support Chinese, so you should refer to the model description when choosing.


4. Metadata

When storing vector data in a vector database, some databases support storing the vectors together with metadata (i.e., non-vectorized data).Adding metadata annotations to vectors is an effective strategy to improve retrieval efficiency, and it plays an important role in processing search results..

For example, date is a common metadata tag. It helps us filter based on chronological order. Imagine that we are developing an application that allows users to query their email history. In this case, the most recent emails may be more relevant to the user's query. However, from the embedding point of view, we cannot directly judge how similar these emails are to the user's query. By attaching the date of each email as metadata to its embedding, we can prioritize the most recent emails during retrieval, thereby improving the relevance of search results.

In addition, we can also add metadata such as chapter or section references, key information of the text, section titles or keywords, etc. These metadata not only help improve the accuracy of knowledge retrieval, but also provide end users with a richer and more precise search experience.

5. Multi-level indexing

In cases where metadata cannot adequately distinguish between different context types, we can consider further experimenting with multiple indexing techniques.The core idea of ​​multi-index technology is to classify huge data and information needs and organize them in different levels to achieve more efficient management and retrieval.. This means that the system does not rely on a single index, but instead builds multiple indexes for different data types and query requirements. For example, there may be an index specifically for summary questions, another specifically for questions that directly seek specific answers, and another specifically for questions that need to consider time factors. This multi-indexing strategy enables the RAG system to select the most appropriate index for data retrieval based on the nature and context of the query, thereby improving retrieval quality and response speed. However, in order to introduce multi-indexing technology, we also need to add a multi-level routing mechanism.

Multi-level routing mechanism ensures that each query is efficiently directed to the most appropriate indexQueries are routed to one or more specific indexes based on their characteristics, such as complexity, type of information required, etc. This not only improves processing efficiency, but also optimizes resource allocation and usage, ensuring accurate matching for various types of queries.

For example, for the query "recommendations for the latest science fiction movies", the RAG system may first route it to an index that specializes in current hot topics, and then use an index focused on entertainment and film and television content to generate relevant recommendations.

In general, multi-level indexing and routing technologies can further help us process large-scale data efficiently and extract accurate information, thereby improving user experience and the overall performance of the system.

6. Indexing/Query Algorithms

You can use indexes to filter data, but in the end, you still have to retrieve relevant text vectors from the filtered data. Due to the large amount and complexity of vector data, finding the absolute optimal solution becomes computationally expensive and sometimes even infeasible. In addition,Large models are not completely deterministic systems by nature. They seek semantic similarity when searching - a reasonable match is enough. From an application perspective, this approach makes sense.. For example, in a recommendation system, users are unlikely to notice or care whether each recommended item is the absolute best match; they are more concerned about whether the recommendations are generally consistent with their interests. Therefore, finding items that are exactly the same as the query vector is usually not the goal, but to find items that are "close enough" or "similar", which is the Approximate Nearest Neighbor Search (ANNS). Doing so not only meets the needs, but also provides huge optimization potential for retrieval optimization. However, in vertical fields such as "law", "medical care", and "finance", this uncertainty of large models can bring fatal results, so LLM+RAG is combined to complement each other in the hope of solving the shortcomings of LLM in some scenarios.

7. Query conversion

In the RAG system, the user's query is converted into a vector and then matched in the vector database. It is not difficult to imagine that the wording of the query will directly affect the search results. If the search results are not ideal, you can try the following methods to rewrite the question to improve the recall effect:

  • •  7.1 Re-expression with historical dialogue In the vector space, two questions that seem the same to humans may not necessarily have similar vector sizes. We can directly use LLM to reformulate the question to try. In addition, during multi-round dialogues, a word in the user's question may refer to part of the information in the previous text, so the historical information and the user's question can be given to LLM for re-expression.
  • •  7.2 Hypothetical Document Embedding (HyDE) The core idea of ​​HyDE is to let LLM generate a hypothetical response without external knowledge after receiving a user's question. Then, this hypothetical response is used together with the original query for vector retrieval. The hypothetical response may contain false information, but it contains information and document patterns that LLM believes are relevant, which helps to find similar documents in the knowledge base.
  • •  7.3 Step Back Prompting If the original query is too complex or the information returned is too extensive, we can choose to generate a "step back" question with a higher level of abstraction and use it together with the original question to increase the number of results returned. For example, the original question is "Which school did the table go to during a specific period", and the step back question may be about his "education history". This higher level question may be easier to find the answer.
  • •  7.4 Multi Query Retrieval Use LLM to generate multiple search queries, which is particularly suitable for situations where a question may need to depend on multiple sub-questions.

Through these methods, the RAG system can process and respond to complex user queries more accurately, thereby improving overall search efficiency and accuracy.

8. Search Parameters

After preparing the query, you can enter the vector database for searching. In the specific search process, you can optimize some search parameters according to the specific settings of the vector database. The following are some common parameters that can be set:

  • •  8.1 Sparse and dense search weights Dense search is searching through vectors. However, there may be limitations in some scenarios. In this case, you can try to use the original string for sparse search of keyword matching. An effective sparse search algorithm is Best Match 25 (BM25), which is based on the statistical frequency of words in the input phrase. Frequently occurring words have a low score, while rare words are considered keywords and have a higher score. We can combine sparse and dense searches to get the final result. Vector databases usually allow you to set the weight ratio of the two for the final result score, such as 0.6 means 40% of the score comes from sparse search and 60% from dense search.
  • •  8.2 Number of results (topK) The number of search results is another key factor. Sufficient search results can ensure that the system covers all aspects of the user's query. When answering multifaceted or complex questions, more results provide rich context, which helps the RAG system better understand the context and implicit details of the question. However, it should be noted that too many results may lead to information overload, reduce the accuracy of the answer and increase the time and resource costs of the system.
  • •  8.3 Similarity measurement method The method for calculating the similarity of two vectors is also an optional parameter. This includes using Euclidean distance and Jaccard distance to calculate the difference between two vectors, and using cosine similarity to measure the similarity of angles. Generally, cosine similarity is more popular because it is not affected by the length of the vector and only reflects the similarity in direction. This allows the model to ignore the difference in text length and focus on the semantic similarity of the content. It should be noted that not all embedding models support all measurement methods. For details, please refer to the description of the embedding model used.

9. Advanced search strategies

The most critical and complex step is how to develop or improve the strategy of the whole system based on vector database retrieval. This part is enough to be written into a separate article. To keep it concise, we only discuss some commonly used or newly proposed strategies.

  • •  9.1 Contextual Compression We mentioned that when a document block is too large, it may contain too much irrelevant information. Passing such an entire document may lead to more expensive LLM calls and worse responses. The idea of ​​contextual compression is to compress the content of a single document according to the context with the help of LLM, or to filter the returned results to a certain extent and only return relevant information.
  • •  9.2 Sentence Window Search On the contrary, if the document block is too small, the context will be lost. One solution is window search. The core idea of ​​this method is that after the question matches the block, the blocks around the block are given to LLM as context for output, so as to increase LLM's understanding of the document context.
  • •  9.3 Parent document search Coincidentally, parent document search is also a very similar solution. Parent document search first divides the document into a larger main document, and then divides the main document into two levels of shorter sub-documents. User questions will be matched with sub-documents, and then the main document to which the sub-document belongs and the user's question will be sent to LLM.
  • •  9.4 Automatic Merge Automatic merge is a more complex solution to parent document search. Similarly, we first divide the document structure, for example, divide the document into a three-layer tree structure, with a top-level node block size of 1024, a middle-level block size of 512, and a bottom-level leaf node block size of 128. During retrieval, only leaf nodes are matched with questions. When most leaf nodes under a parent node match the question, the parent node is returned as the result.
  • •  9.5 Multi-vector retrieval Multi-vector retrieval also converts a knowledge document into multiple vectors and stores them in the database. The difference is that these vectors not only include the document in blocks of different sizes, but also include the summary of the document, questions that users may ask, and other information that helps with retrieval. When using multi-vector queries, each vector may represent a different aspect of the document, allowing the system to consider the content of the document more comprehensively and provide more accurate results when answering complex or multi-faceted queries. For example, if the query is more relevant to a specific part or summary of the document, then the corresponding vector can help improve the retrieval ranking of this part of the content.
  • •  9.6 Multi-agent retrieval Multi-agent retrieval, in short, selects some of the 12 optimization strategies we mentioned and assigns them to an intelligent agent for combined use. For example, when using sub-question query, multi-level index and multi-vector query are combined, first let the sub-question query agent break down the user's question into multiple small questions, then let the document agent perform multi-vector or multi-index retrieval on each word question, and finally the ranking agent summarizes all retrieved documents and hands them over to LLM. The advantage of doing this is that it can complement each other's strengths and weaknesses. For example, the sub-question query engine may lack depth when exploring each sub-query, especially in interrelated or relational data. In contrast, document agent recursive retrieval performs well in in-depth research on specific documents and retrieving detailed answers, thereby combining multiple methods to solve problems. It should be noted that there are multi-agent retrievals with different structures on the Internet. There is no definite conclusion on which optimization steps to select in multi-agents. We can explore it in combination with usage scenarios.
  • •  9.7 Self-RAG self-reflective search enhancement is a new RAG framework. The biggest difference between it and traditional RAG is that
    Retrieval score (token) and reflection score (token) to improve qualityIt is mainly divided intoThree steps: search, generate, and criticizeSelf-RAG first uses the search score to evaluate whether the user's question requires a search. If a search is required, LLM will call the external search module to find relevant documents. Then, LLM generates answers for each retrieved knowledge block, and then generates a reflection score for each answer to evaluate whether the retrieved document is relevant. Finally, the document with a high score is handed over to LLM as the final result.

10. Re-ranking

After completing the optimization steps for semantic search, you can retrieve the most semantically similar documents, but have you noticed a key problem:Does the most semantically similar always mean the most relevant? The answer is not necessarilyFor example, when a user queries "recommendations for the latest science fiction movies", the result may be "the historical evolution of science fiction movies". Although this is semantically related to science fiction movies, it does not directly respond to the user's query about the latest movies.

Rearranging the model can help us alleviate this problem.The re-ranking model conducts a deeper relevance assessment and ranking of the initial search results to ensure that the results finally displayed to the user are more in line with their query intent.This process is usually implemented by deep learning models, such as the Cohere model. These models take into account more features, such as query intent, multiple semantics of vocabulary, user historical behavior, and contextual information.

For example, for the query "recommendations for the latest science fiction movies", in the first search phase, the system may return results based on keywords, including historical articles on science fiction movies, introductions to science fiction novels, news about the latest movies, etc. Then, in the re-ranking phase, the model will conduct an in-depth analysis of these results and rank the most relevant results that best match the user's query intent (such as lists, reviews, or recommendations of the latest science fiction movies) at the top, while rank those about the history of science fiction movies or less relevant content at the bottom. In this way, the re-ranking model can effectively improve the relevance and accuracy of the search results and better meet the needs of users.

In practice, any system built using RAG should consider trying out rearrangement methods to evaluate whether they can improve system performance.

11. Prompt words

The decoder part of a large language model usually predicts the next word based on a given input. This means that the way you design the prompt word or question will directly affect the probability of the model predicting the next word. This also gives us some inspiration: by changing the form of the prompt word, you can effectively influence the model's acceptance and answering of different types of questions. For example, modifying the prompt word to let the LLM know what it is doing is very helpful.

In order to reduce the probability of subjective answers and hallucinations in the model, in general, the prompt words in the RAG system should clearly indicate that the answer is based only on the search results, and do not add any other information. For example, you can set prompt words such as:
“You are an AI agent. Your goal is to provide accurate information and solve the questioner’s problem as much as possible. Be friendly but not too wordy. Answer queries that are relevant based on the context provided and without considering prior knowledge.”

Of course, you can also incorporate some subjectivity or its understanding of knowledge into the model's answers according to the needs of the scenario. In addition, using a few-shot method to add the desired question and answer examples to the prompt words and guide LLM on how to use the retrieved knowledge is also an effective way to improve the quality of LLM generated content. This method not only makes the model's answers more accurate, but also improves its practicality in specific situations.

12. Large Language Model

The last step is to generate answers with LLM. LLM is the core component for generating responses. Similar to the embedding model, you can choose LLM according to your needs, such as open model vs. proprietary model, inference cost, context length, etc. In addition, you can use some LLM development frameworks to build RAG systems, such as LlamaIndex or LangChain. Both frameworks have good debugging tools that allow us to define callback functions, see which contexts are used, check which document the retrieval results come from, and so on.

summary

Building a RAG system requires both technical depth and engineering practice, and requires careful design from retrieval accuracy, generation quality to system stability. By combining the latest optimization strategies and tool chains (such as Cursor and MCP technologies), it is possible to effectively solve core problems in development and build an efficient, reliable and adaptable RAG system. With the continuous maturity of multimodal technology, large model context understanding, dynamic learning mechanisms and privacy protection technologies, the RAG system will unleash its potential in more vertical field scenarios and become one of the core technologies of the next generation of intelligent applications, while providing users with more intelligent and reliable services.