RAG from entry to mastery - RAG Introduction

Written by

Jasper Cole

Updated on:July-09th-2025

A new topic is about to be opened. This topic will guide you to use the RA G architecture step by step to build your own knowledge base question-answering system. Starting from the official API of ChatGPT to deploying the open source large model yourself. All will appear in this topic. Welcome everyone to pay attention.

Preface

RAG (Retrieval Augmented Generation) is an engineering framework that combines large-scale language models (LLMs) with retrieval from external knowledge sources to improve question-answering capabilities. This article briefly introduces RAG.

The problem of knowledge updating in LLM

Before we get into the introduction of RAG, we need to first understand a concept: it is very difficult to update the knowledge of LLM. The main reasons are:

The training data set of LLM is fixed, and once the training is completed it is difficult to update its knowledge by continuing training.
LLM has a huge number of parameters, and fine-tuning at any time consumes a lot of resources and takes a considerable amount of time.
The knowledge of LLM is encoded in tens of billions of parameters, and the knowledge graph cannot be directly queried or edited.

Therefore, the knowledge of LLM is static, closed and limited. In order to give LLM the ability to continuously learn and acquire new knowledge, RAG came into being.

How it works

RAG essentially solves the problem of difficulty in updating LLM knowledge through engineering means . Its core means is to use the knowledge database plug-in to LLM (usually a vector database) to store new data and domain data that do not appear in the training data set. Generally speaking, RAG divides knowledge question answering into three stages: indexing, knowledge retrieval, and content-based question answering.

The first stage is knowledge indexing, which requires processing text data in advance, mapping text to a low-dimensional vector space through vectorization techniques such as word embedding, and storing the vectors in a database to build a searchable vector index. In this stage, RAG involves components such as data loaders, segmenters, vector databases, prompt engineering, and LLM itself.

The second stage is knowledge retrieval. When a question is entered, RAG will search the knowledge base to find a batch of documents that are most relevant to the question. This requires relying on the vector index established in the first stage to perform fast retrieval based on the similarity between vectors.

The third stage is to generate answers. RAG will provide the input question and the corresponding search result documents to LLM, allowing LLM to fully integrate these external knowledge into the context and generate corresponding answers. RAG controls the generation length to avoid generating irrelevant content.

In this way, LLM can make full use of the information of the external knowledge base without modifying its own parameters. When the knowledge base is updated, new knowledge can also be injected into LLM in real time through prompts. This design not only brings into play the powerful language generation ability of LLM, but also avoids the dilemma of its knowledge update, so that it can answer various questions more intelligently, especially those that require external knowledge support.

advantage

The advantages of RAG are mainly reflected in the following aspects:

Large-scale external knowledge can be exploited to improve the reasoning ability and factuality of LLM.
Using frameworks such as LangChain allows for rapid prototyping.
The first stage of knowledge indexing can add new data at any time, and the delay is very low and negligible. Therefore, the RAG architecture can theoretically achieve real-time update of knowledge.
Strong interpretability: RAG can use techniques such as prompt engineering to make the answers generated by LLM more interpretable, thereby improving users' trust and satisfaction with the answers.

shortcoming

The disadvantages of RAG are mainly manifested in the following aspects:

The knowledge retrieval stage relies on similarity retrieval technology, not precise retrieval, so it is possible that the retrieved documents are not very relevant to the question.
When producing answers in the third stage, since LLM summarizes based on the retrieved knowledge, it may lack some basic world knowledge, resulting in the inability to respond to basic questions asked by users outside the knowledge base.
Vector database is an immature technology and lacks a general solution for handling large amounts of data. Therefore, when the amount of data is large, there are challenges in speed and performance.
During inference, user input needs to be preprocessed and vectorized, which increases the time and computational cost of inference.
Updating and synchronizing external knowledge bases requires a lot of manpower, material resources and time.
The need for additional retrieval components increases the complexity of the architecture and maintenance costs.

How to improve

Due to the above shortcomings, the RAG framework implemented directly using frameworks such as LangChain can hardly be used directly in production, and requires a lot of engineering optimization, which generally includes at least the following:

Check and clean input data quality.
Tune block size, top k retrieval, and overlap.
Leverage document metadata for better filtering.
Improve prompt to provide helpful instructions.

Summarize

RAG is a promising but still evolving technology that requires careful tuning and optimization to achieve reliable performance. As research continues, it will likely become more robust and suitable for industrial applications.