One article to understand the application practice of quickly building a local RAG knowledge base based on a large model

Master AI big model and RAG technology to quickly build a new tool for enterprise knowledge base.
Core content:
1. RAG technology principle and its application value in knowledge base construction
2. Three solutions for knowledge base question and answer based on big model
3. How RAG technology enables AI to learn to search for information and accurately answer questions
1. Explanation of RAG Knowledge Base Related Names
1. Document parsing (document → structured data) can parse unstructured text documents in PDF (non-scanned), word, and txt formats into standard structured document data.
2. Knowledge construction (structured data → index construction): Based on structured document data and in accordance with the knowledge index construction strategy, several types of knowledge indexes are generated for retrieval tasks in the subsequent knowledge question-answering process.
3. Knowledge retrieval (user question → retrieval index → retrieval results): identify, rewrite and optimize the user question’s intent, and recall knowledge according to the retrieval strategy for subsequent large model generation tasks.
4. Prompt is an instruction to AI, guiding the model to generate response output that meets the business scenario.
5. RAG question-answering generation (question-answering prompt words → large model generation). According to the RAG (Retrieval-Augmented Generation) retrieval enhancement generation paradigm rules, user questions and knowledge retrieval results are constructed into question-answering prompt words and provided to the large model. The text generation capability of the large model is used to output the result response in the form of question and answer.
2. Three solutions for building knowledge base question and answer based on large language models
1. Fine-Tuning
Building a training data set based on proprietary knowledge, fine-tuning the data set with a large model, and changing the weights of the parameters in the neural network is equivalent to letting the model learn this knowledge. The fine-tuning method is suitable for specific tasks, but there are also some problems, such as not solving the problem of reliable factual question answering, high computing resources and time costs, the need to build fine-tuning training corpus in specific fields, and the uncertainty of fine-tuning results.
2. Use Prompt Engineering
Domain expertise is used as input information to the model, which is similar to short-term memory, with limited capacity but clarity. Its advantage is that the model's parsed answers are correct and accurate. Its disadvantage is that all large models have restrictions on the maximum length of input information, and the amount of text that can be processed at one time is limited. For knowledge bases, this is not suitable from the perspective of feasibility and efficiency.
3. Combine knowledge retrieval to enhance large models
When conducting model question-answering, information retrieval is used to construct knowledge base queries, and the retrieval results are provided to the big model for understanding and generation. This method enables the big model to act as an intermediary between the user and the search system, and to exert its natural language processing capabilities: error correction, key point extraction and other preprocessing of user requests to achieve "understanding"; and summarization, analysis and reasoning of the output results on the basis of ensuring correctness. In this way, whether it is data scale, query efficiency, or update method, it can meet the needs of common knowledge base application scenarios.
RAG mainly refers to the third solution mentioned above. Its core is: let the AI large language model learn to search for information, and then use the found content to answer questions. Before DeepSeek came out, many models did not have the ability to search online. The [online search] currently on the market is actually a kind of RAG, also known as a knowledge base plug-in. However, in enterprise applications, RAG is often business data and knowledge bases that are inconvenient to be made public within the enterprise. The value of RAG technology is very obvious in large models in vertical fields. For example, internal user data of the enterprise, data warehouses that have been accumulated for many years, search platform data, research reports, legal texts, contracts, etc. RAG makes data islands no longer lonely, so that companies that do not have the ability to develop large models themselves and do not have the ability to purchase GPUs can also quickly apply the value of data on isolated islands.
3. Development History of Knowledge Base
1. Early knowledge bases were mainly paper documents, which were very inefficient for retrieval and updating. As time went by, the number of files became increasingly large, and paper documents were damaged due to lack of maintenance for many years and could not generate effective value.
2. With the advent of the information age, document management has gradually become electronic, and documents are stored in computer systems for management, which greatly facilitates use. However, they are all stored in isolation, without establishing relationships between document knowledge and lacking linkage of related knowledge.
3. Cloud computing and big data have promoted the implementation and development of a series of artificial intelligence technologies, enabling people to move from the information age to the intelligent age.
IV. Development History of RAG Technology
1. Naive RAG
The earliest RAG application practice used simple full-text search or vector search to obtain data related to the input content. Due to the lack of semantic understanding ability, Naive RAG still has a lot of room for improvement in output effect.
2. Advanced RAG
It is optimized based on Naive RAG and strengthens the front, middle and back ends of retrieval.
(1) During the indexing process, we used technologies such as sliding windows, fine-grained segmentation, and metadata integration to improve the quality of indexed content. This includes knowledge base data quality optimization, index optimization, query content rewriting, and embedding fine-tuning to generate semantic vectors that more accurately understand the context.
(2) In the retrieval stage, pre-retrieval was introduced.
(3) After retrieval, the relevance of the retrieved documents is reranked (such as expansion, rewriting, sorting, and summarization) to achieve higher retrieval efficiency and accuracy, so that the information ultimately provided to the large model is more concentrated, making the generated content richer and more accurate.
3. Agentic RAG
This is the most powerful RAG technology currently. It designs and arranges RAG modules in the process, can dynamically make decisions to integrate multiple APIs or system tool capabilities and call LLM-based Agents to solve complex problems in real time. These designs not only improve the overall performance of the system, but also provide developers with customized solutions.
5. Core Principles of RAG Technology
The advantage of an external knowledge base is that it can use massive amounts of external data to supplement knowledge, thereby improving the quality and accuracy of answers, which is especially important in scenarios with strong dynamics and frequent knowledge base updates. It is a low-cost implementation method that can process high-quality data information into a knowledge base by leveraging local expertise, and then use a large model to complete retrieval, recall and summary generation, assisting all walks of life in achieving accurate question and answer of professional knowledge. RAG is a hybrid deep learning model that combines retrieval and generation, and is often used to handle complex natural language processing tasks. The RAG model can provide more accurate and contextual answers by combining information from an external knowledge base with a large language model.
Specifically, RAG core technology mainly includes three steps: retrieval, enhancement and generation.
1. The question is, search first (retrieval)
When a user queries for content, RAG uses vector retrieval technology to retrieve the documents or information fragments most relevant to the query content from the pre-established external knowledge base (such as databases, web pages, document retrieval, etc.). Specifically, the user's query is converted into a vector through the "embedding model" so that it can be compared with the vectors related to the knowledge stored in the vector database. Through similarity search, the top K most matching data are found from the vector database, and the preliminary search results are re-ranked through technologies such as the "Rerank model" to improve the relevance and quality of the search results. Corresponding to [online search], it is to first search for relevant information, articles, and content on the Internet.
2. Once you have the knowledge, process it (augmentation)
The user's query content and the retrieved related content are embedded in a preset prompt word template.
3. Generate answers and write brilliantly (Generation)
Input the preset prompt word template content above into a large language model (such as DeepSeek, etc.). The large language model will generate a response with clear basis based on the search content, which can greatly improve the interpretability of the large model and reduce the risk of the large model making up something out of thin air.
6. RAG's construction of an efficient knowledge base can be divided into five basic processes: knowledge text preparation, embedding model, vector database, encapsulation retrieval interface, and generating answers based on queries.
1. Knowledge text preparation: preprocessing and loading of documents and cutting them into segments according to certain conditions
The text quality of the knowledge base itself is crucial to the final effect. This is the original corpus for the large model to generate answers. Document preprocessing optimization should formulate standardized document templates to ensure the consistency and readability of the content.
(1) Document naming should be consistent, limited in length, concise in meaning, and free of meaningless numbers, symbols, or abbreviations.
(2) Document language: unify the Chinese description, because the quantitative model supports different Chinese, English, traditional and simplified Chinese, and vectorization may result in garbled or useless data.
(3) Document content: set clear hierarchical titles, and perform special processing on pictures, tables, formulas, hyperlinks, attachments, etc. Establish question-answer pairs: construct question-answer pairs based on the possible questioning methods of users as the original data of the knowledge base
2. Embedding Model: When uploading text corpus and starting to build a local knowledge base, long texts need to be cut into chunks for easy analysis and processing.
(1) By setting a reasonable document cutting chunk_size, verify at what chunk_size the selected Embedding model performs best
(2) Document paragraph processing: based on the chunk_size set during document segmentation, the paragraphs of the knowledge base document are split or merged to ensure that coherent semantic data is not segmented.
(3) Knowledge base document annotation: In order to improve the recall accuracy, the knowledge base document content is annotated before importing.
3. Vector database: put the cut text fragments into the vector database
(1) After cutting the knowledge base documents into chunks of text, they need to be converted into vectors that can be processed by the algorithm through Embedding technology and stored in the vector database.
(2) Select different Embedding models for verification. In practice, it is also found that bge-large-zh is better than m3e-base.
(3) Recall accuracy: Top5 and Top10 are generally used to evaluate the quality of the embedding model. TopN recall accuracy = number of answers contained in the TopN chunks / total number of queries
4. Encapsulated retrieval interface ( question understanding and query retrieval): User asks a question -> Embedding the question with an embedding model -> Vectorizing the question -> Performing similar matching in the vector database where the knowledge base is located -> Reranking the model to retrieve the k chunks with the highest recall scores
(1) After the user asks a question, the user's question is also vectorized. The user's question is matched with the chunks in the vector database, and the top K most similar to the question vector are matched.
(2) The value of k is to continuously test according to the actual scenario and select the best k value. Generally speaking, increasing the k value will increase the probability that the recalled segment contains the correct answer, but more irrelevant information will be recalled, and the quality of the answers generated by the model will be worse.
(3) Temperature parameter setting. 1 means very precise, while 0 means divergent. The generated content will be more creative. You need to set the appropriate Temperature value according to your own scenario.
(4) Top K sorting method: The Top K returned by the search will be sorted according to the order in the database. The purpose is to retain the context structure of the original database. The size of the Top K can be increased, for example, from 10 to 30, and then a more accurate algorithm is used for reranking.
5. RAG knowledge base project. For the specific construction process, please refer to "Dify Knowledge Base Construction RAG" data-itemshowtype="0" target="_blank" linktype="text" data-linktype="2">Start the Journey of Exploring Intelligent Agents and Knowledge Base: Dify Knowledge Base Construction RAG": Query -> Search -> Prompt -> LLM -> Reply
(1) Connecting to the vector database
(2) Loading vector model
(3) Create an index
(4) Take the local text file and split it, then add the data to the index
(5) Use queries to retrieve data from the vector database
(6) Summarize the search results
(7) The retrieved result prompt (the original query text + user prompt + k chunks) is sent to the large language model LLM, which will generate an answer and return it to the user.
7. The enterprise knowledge base still needs some targeted functional design
1. Knowledge base management: Supports the full life cycle management of knowledge bases and knowledge data. Users can use the knowledge base management function through interfaces and other methods.
2. Knowledge processing: Supports knowledge processing of multiple file formats and multiple data types, and obtains high-quality knowledge information from disordered raw data.
3. Knowledge construction: Supports a variety of knowledge construction strategies, and users can configure them according to the actual knowledge data to achieve the best knowledge construction effect.
4. Knowledge retrieval: Supports vector, full-text, hybrid and other knowledge retrieval modes, and provides a wealth of retrieval parameter configuration items. Users can use the knowledge retrieval function through task flow components, interfaces and other methods.
8. Typical knowledge base application scenarios of large language model LLM in the industry
1. Document fragment retrieval: Use the knowledge base to divide the complete document into blocks and build knowledge. Retrieve the text entered by the user and return the top K most similar document blocks.
2. Text retrieval: Use the knowledge base to directly construct knowledge for the text that has been segmented. Retrieve the text entered by the user and return the top K most similar texts.
3. QA retrieval: Use the knowledge base to construct knowledge of QA data. Retrieve QA through the text entered by the user and return the most similar TopK QA data.
4. Simple search and answering: The above three types of knowledge are searched through user questions, and the search results + user questions are spliced into prompt words, which are then provided to the large model to obtain the model's response (a typical RAG search to generate question and answer process).
5. Complex search dialogue: Complex dialogue logic (such as refusal to answer, question classification, answer splitting, etc.) needs to be choreographed and developed through task flow. The business side needs to design dialogue logic and paths to clarify the specific links of knowledge retrieval.
IX. The most basic steps of the RAG-based knowledge base and optimization suggestions for practical application
1. Dynamic update of knowledge base: As time goes by, the information in the knowledge base may become outdated or invalid. Therefore, it is necessary to design an automated knowledge update mechanism to ensure the accuracy and timeliness of the system’s answers.
2. Model fine-tuning: In different application scenarios, you may need to select the model to be called or optimize the prompt words to guide the model to generate output that is more in line with the expected output.
3. Hybrid search strategy: Vector search can be combined with traditional keyword search strategy to improve the recall rate while ensuring the search accuracy.
4. System scalability: Ensure that the system can scale as the amount of data and requests increases, avoiding performance bottlenecks. Using distributed retrieval and generation technology is key to achieving this goal.
5. User feedback loop: Introduce a user feedback mechanism, regularly analyze user queries and system responses, continuously improve the model, and regularly maintain and update the knowledge base to enhance the overall system intelligence level.
6. Multimodal knowledge processing: Use OCR technology and other technologies to process non-text knowledge sources such as images and videos, convert them into understandable plain text data, and enrich the content of the knowledge base.