A complete analysis of RAG technology: from basic principles to practical optimization

Get an in-depth understanding of RAG technology and explore its application and optimization techniques in enhancing large language models.
Core content:
1. Definition of RAG technology and the problems it solves
2. Detailed analysis of the RAG process, including document processing and retrieval generation
3. Practical RAG optimization, including code implementation and performance evaluation
What is RAG (what is the RAG process) and why is RAG needed? How to chunk documents? How to choose chunk size? How to calculate similarity when searching? What are the disadvantages of cosine similarity? How to optimize the RAG process? What is Re-rank? How to rerank? What is GraphRAG? How to evaluate the effectiveness of the RAG system? What are the disadvantages of RAG?
This article will introduce in detail what is RAG? RAG process and RAG advanced optimization
Full of useful information, I believe you will have a deeper understanding of RAG after reading the article .
About RAG
RAG
Retrieval Augmented Generation ( RAG) is a common method to augment model knowledge without fine-tuning. With RAG, LLM can retrieve contextual documents from the database to improve the accuracy of the answer.
Because LLM large models are trained with massive amounts of data, the data is time-sensitive. If you ask for the latest documents or some professional knowledge, LLM cannot answer. So Retrieval Augmentation Generation (RAG) solves this problem by adding your data to the existing data of LLM.
RAG addresses the limitations of pure generative models (such as hallucinations, outdated knowledge, etc.) and enhances the credibility and timeliness of generated results by dynamically retrieving external knowledge.
RAG Process
The typical RAG process is divided into two parts:
Building vector storage
:Creating vector storage is the first step in building the Retrieval Enhancement Generation (RAG) process. Documents will be loaded, split, and embedded into the vector database.
- Load document
Load various unstructured data, such as TXT text, PDF, JSON, HTML, Markdown, etc. Langchain encapsulates DocumentLoaders in various formats . - Split text
Divide the text into smaller chunks. - Transformation Vector Embedding
Use the Embedding model to convert text into a vector (floating point array) representation Vector database VectorStore : stores embedding vectors and can efficiently retrieve and query the "most similar" data based on vector similarity.
2. Retrieval Generation
:Use the vector database to perform similarity search based on user input, and then feed the user's question and the searched context as context to the LLM large model, and LLM analyzes and infers to answer the user's question.
I built a basic RAG flow using LangGraph :
Retrieval Tool is called to search for relevant documents, and then GradeDocument scores the documents: Scores the documents retrieved from the vector database: If the retrieved document is relevant to the content entered by the user, GenerateAnswer generates an answer and returns it. If it is not relevant, Rewrite regenerates the query for retrieval.
The code has been uploaded to Github:
Agent-demo/tree/main/src/rag_agent" style="box-sizing: border-box !important;overflow-wrap: break-word;text-decoration: none;color: rgb(87, 107, 149);font-weight: 600;"> https://github.com/Liu-Shihao/ai-agent-demo/tree/main/src/rag_agent
Advanced - RAG Optimization
Document Chunking
The number of tokens in a large model conversation is limited. Document segmentation is to divide the document into small text blocks, which are suitable for retrieval and save tokens. The length of the segmented text blocks will also affect the quality of the LLM answer.
Common methods for splitting documents:
- Fixed length segmentation (overlapping segment boundaries)
: Split by characters or number of tokens (such as 512 tokens). Overlap blocks to avoid loss of boundary information. This method is the simplest, but may truncate semantics. - By sentence boundaries
(punctuation) chunking, e.g. using NLP frameworks SpaCy
, but long paragraphs may be semantically broken. - Custom rule segmentation
: Use regular expressions or DOM parsers (such as BeautifulSoup) to split the text into logical blocks (titles, paragraphs). Suitable for structured documents, but requires manual design of segmentation rules. - Semantic-based chunking
: Use Transformer model to analyze semantic relationship blocks.
Optimization principles:
The chunk size needs to match the token limit of the embedding model and the LLM large model. Try to keep key information (entities, relationships) in the same block.
Similarity Algorithm
In RAG (Retrieval Augmented Generation) and other information retrieval tasks, similarity algorithms are used to measure the degree of association between text, vectors, or entities.
Euclidean distance (L2)
: Euclidean distance measures the length of the line segment connecting two points (calculate the straight-line distance between vectors). It is the most commonly used distance metric and is very useful when the data is continuous. The smaller the value, the higher the similarity.Cosine Similarity (COSINE)
: Cosine similarity uses the cosine of the angle between two sets of vectors to measure their similarity. Cosine similarity is always in the interval [-1, 1]. The larger the cosine value, the smaller the angle between the two vectors, indicating that the two vectors are more similar to each other. Suitable for text embedding comparison.BM25 (Best Matching 25)
:BM25 is based on term frequency (TF) and inverse document frequency (IDF) . It scores relevance based on term frequency, inverse document frequency, and document normalization. It is used to evaluate the relevance of a document to a query. It is widely used in search engines and question-answering systems. For example, Elasticsearch uses BM25 sorting by default.
- Term Frequency (TF)
: Measures the frequency of occurrence of query words in the document, but controls the saturation effect of word frequency through parameter k1 to avoid high-frequency words from excessively affecting the score. - Inverse Document Frequency (IDF)
: Penalize common words (such as "的" and "是") and increase the weight of rare words. Reflects the importance of a term in the entire corpus. Terms that appear in fewer documents have higher IDF values, indicating that they contribute more to relevance. - Document length normalization
: Longer documents tend to score higher because they contain more terms. BM25 mitigates this bias by normalizing the document length. The parameters are used to adjust the scores of long documents to avoid term frequency bias caused by document length. Jaccard Similarity (Jaccard Index)
: Compare the intersection and union ratio of sets. Applicable scenarios: keyword sets, recommendation systems (such as user interest matching). Range [0,1], the smaller the value, the higher the similarity.Preliminary retrieval: cosine similarity (quickly screen candidate documents). Reranking: Cross Encoder (finely sort Top-K results). Deduplication: Jaccard similarity (merging duplicate fragments). Ignore vector length information: Cosine similarity only calculates the angle between the vector directions and ignores the length (modulus) of the vector. This means
High-frequency word interference: Long texts with high TF-IDF or word frequency may dominate the direction, but the actual semantics are irrelevant. This will amplify the impact of irrelevant words. Normalization dependency: Unnormalized vectors may lead to similarity calculation bias. Long texts contain more words, and the length (modulus) of the vector after the accumulation of the values of each dimension is significantly larger than that of short texts. Semantic similarity ≠ relatedness: Cosine similarity is based on surface semantic matching.
Surface matching, but related documents are not necessarily semantically similar : If two texts share many of the same keywords (such as "cat", "dog", "pet"), the cosine similarity may still be high even if the logic is different. For example:
Document 1: "Cats and dogs are common pets." (positive description)
Document 2: "Cats and dogs are not suitable as pets." (Negative opinion)
The cosine similarity is high, but the semantics are opposite.
The word order is reversed, but the cosine similarity is the same. Example:
Sentence A: "The doctor treats the patient."
Sentence B: "The patient treats the doctor."
Vector Normalization: Force all vectors to have unit length (like L2 normalization). Combine with other metrics: such as dot product similarity (takes length into account) or BM25 (weighted by term frequency). Re-rank: Use a cross encoder (such as MiniLM) to refine the ranking. Hybrid retrieval: combined with keyword matching (BM25) or knowledge graph relationships. Cross-Encoder: Like MiniLM-L6-v2, it calculates the relevance score between the query and each document (more accurate but slower than the embedding model). Learning to Rank: The training model is used to rank multiple features (such as keyword matching and click-through rate). Rule adjustments: remove duplicate content and prioritize documents with high freshness. - Entity Recognition (NER)
Use the SpaCy NLP model or the LLM large model to perform named entity extraction to identify and extract entities such as names of people, places, organizations, locations, dates, etc. from the text. - Relation Extraction
The LLM model can be used to extract triples (〈Subject, Predicate, Object〉). - Graph Storage
Store nodes and relationships in a graph database, such as Neo4j. Context Precision : It is an indicator to measure the proportion of relevant word chunks in the context. The precision is the ratio of the number of relevant word chunks ranked k to the total number of word chunks ranked k. Context Recall : The proportion of relevant documents in the top K results. It measures the number of relevant documents (or pieces of information) that were successfully retrieved. A higher recall means fewer relevant documents were missed. Response Relevancy : The fit between the generated answer and the question. Measures the relevance of the answer to the user input. The higher the score, the higher the match with the user input; if the generated answer is incomplete or contains redundant information, the score is lower.
Faithfulness : measures the factual consistency between the answer and the search content . Whether the answer is strictly based on the search content, reducing illusions.
The quality of retrieval depends on the external database : if the knowledge base is incomplete, outdated, or noisy, the retrieved content may be irrelevant or wrong, resulting in a decrease in the quality of the generated answers.
Solution: Regularly update the knowledge base (crawl authoritative data sources in real time)
Chunking leads to context fragmentation : fixed-size chunks may cut off key information. The answer may be scattered across multiple chunks.
Solution: Dynamic chunking (split by semantic boundaries, such as paragraphs, chapters, etc.)
Semantic relevance does not equal answer relevance: Vector search (such as cosine similarity) may return documents that are semantically relevant but do not have actual answers. (For example, the query "How to treat a cold?" may retrieve "cold symptom descriptions" instead of treatment plans).
Solution: Introduce re-rank models (such as cross encoders); hybrid retrieval (combined with keyword retrieval, such as BM25).
Generative models ignore retrieved content : Generative models may ignore the retrieved documents and still rely on their own knowledge (hallucination).
Solution: Strengthen prompt engineering (e.g. “Answer strictly based on the following context”).
Unable to handle multi-hop reasoning : Traditional RAGs have difficulty answering questions that require multi-step reasoning (such as “Who is the CEO of Company A’s competitor?”).
Solution: Introduce knowledge graph (GraphRAG) to explicitly model entity relationships.
RAG process takes a long time : The two-stage retrieval + generation process leads to long response times (especially when re-ranking is involved).
Solution: Cache high-frequency query results.
Typical applications in RAG
By flexibly combining these algorithms, the recall rate, accuracy and response speed of the RAG system can be optimized.
Disadvantages of Cosine Similarity (COSINE)
Solution:
Rerank
Reranking is a technique for optimizing the order of preliminary search results, aiming to improve the relevance and accuracy of the results.
The initial search (such as cosine similarity) may return semantically relevant but redundant or low-quality fragments, and re-ranking can incorporate more features to optimize the order.
method:
Graph RAG
Using the Knowledge Graph (KG) to enhance RAG (retrieval-augmented generation) can significantly improve the capabilities of complex reasoning, multi-hop question answering, and relationship mining. By extracting entities and relations in documents as knowledge graphs, not only text fragments but also related subgraph structures are returned during the retrieval phase, thereby enhancing the contextual understanding ability of the generation model.
Differences from traditional RAG :
Implementation steps:
Triple is the basic data unit in the Knowledge Graph, which is used to represent the relationship between entities. Its structure is: 〈Subject, Predicate, Object〉
Through the introduction of knowledge graphs, the RAG system can be upgraded from "plane retrieval" to "three-dimensional reasoning" , which is particularly suitable for complex scenarios that require deep exploration of entity relationships.
RAG Evaluate
The evaluation of RAG can be carried out from the following two parts:
Search quality
Build quality
What are the disadvantages of RAG?
Disadvantage Category | Specific issues | Solution |
---|---|---|
Search quality | ||
Generate Deviation | ||
Efficiency issues | ||
Knowledge Coverage | ||
Complex reasoning |