LangChain Advanced Guide: RAG Practice Summary

Written by
Jasper Cole
Updated on:July-12th-2025
Recommendation

Explore RAG's breakthrough progress in improving the accuracy and reliability of language models.

Core content:
1. The principle and advantages of RAG in solving the "hallucination" problem of LLMs
2. Analysis of RAG's model architecture and workflow
3. The potential and challenges of RAG in practical applications

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)

Why use RAG?


Current leading large language models (LLMs) acquire broad general knowledge through large-scale data training, which is stored in the weights of their neural networks. However, if LLMs are required to generate knowledge beyond their training data (such as recent, proprietary, or domain-specific information), factual errors (called "hallucinations") will occur.


This problem can be solved by using fine-tuning or retrieval-augmented generation (RAG). However, fine-tuning usually consumes a lot of computing resources, is costly, and requires extensive fine-tuning experience. In addition, fine tuning requires a representative data set and a certain amount of data. Fine tuning is not suitable for adding new knowledge to the model or dealing with situations that require rapid iteration of new scenarios. This article introduces another method, which is to retrieve data information through RAG. This method is low-cost and can be implemented quickly.


Retrieval-augmented generation (RAG) is a concept that aims to provide large language models (LLMs) with additional information from external knowledge sources to improve the quality and reliability of AI application responses. In this way, LLMs can generate more accurate and contextual answers while effectively reducing the possibility of misleading information. RAG mainly addresses common challenges of LLMs, such as providing false information, outdated information, and non-authoritative information.


In the paper "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks", a new technique is introduced. Retrieval-Augmented Generation (RAG). This paper proposes a new model architecture that addresses the challenges in knowledge-intensive NLP tasks by combining the two stages of retrieval and generation. RAG effectively uses external knowledge bases to enhance model performance, making it easier for LLM to query external knowledge information.

There is a complementary relationship between RAG and LLM. RAG can be regarded as an LLM with expanded functions. It enables LLM to effectively utilize external knowledge by introducing additional retrieval steps. The combination of these two technologies can achieve better results in many complex natural language processing tasks, providing new possibilities for developing more powerful NLP systems.

RAG Workflow


  • Retrieval: When a user query is received, the most relevant documents are found using the retrieval index. Specifically, the query information entered by the user is converted into a vector, and then other matching context information is retrieved from the vector database. Through this similarity search, the most matching related data in the vector database can be found. 


  • Enhancement:  Put the information retrieved from the vector database and the user query information into our customized prompt template.

  • Generate:  Finally, the above searched and enhanced prompt content is input into LLM, and LLM generates the final result based on the above information.


The vector database stores external knowledge, and the unstructured data is stored in the vector database through the embedding model. The specific process is shown in the figure below:
  • Load: Load the specified data. Different files can be loaded through different document loaders .


  • Split: Split the content into smaller chunks using a text splitter. This is useful both for indexing data and for passing it to models, as large chunks are harder to search and don't fit into the limited context window of a model.


  • Embedding: Embedding technology can be used to map high-dimensional data (such as text, pictures, and audio) to low-dimensional space, that is, pictures, audio, and text are ultimately converted into vectors for representation. A vector is a set of values ​​that can represent the position of a point in a multidimensional space.


  • Storage: A vector database is needed to store and index the segmented vectors to facilitate quick retrieval of data in the future.

    Retrieval enhancement generation method based on LangChain
LangChain is an open source toolkit that aims to help developers quickly build AI applications. It provides a range of functions, such as model hosting, data management, and task automation. Through LangChain, we can quickly build a RAG application.


In this section, we will show how to use Python with OpenAI's large language model, Weaviate's vector database, and OpenAI's embedding model to implement a retrieval-augmented generation (RAG) pipeline. In this process, we will use LangChain for overall orchestration.


  • Part 1 Writing external knowledge into the vector database


Prepare a vector database. A vector database is a vector-based data storage and query system that uses a vector space model to represent and query data. Through the vector database, we can convert the data we specify into vectors and write them into the database. The specific steps are as follows:
   i. Collect data and load it into the system


  ii. Process the document in chunks

 iii. Embed the chunked content and store the chunks

  • First, load the specified data. The data type supports different files such as text, PDF, Word, Excel, CSV, HTML, etc. The following example uses TextLoader to load text. Different loaders can be used for different file types, such as CSVLoader to load CSV files, WebBaseLoader to load web page data through URL, and PDF also has many loading methods, such as PDFMinerLoader is one of them.

import requestsfrom langchain.document_loaders import TextLoaderurl="https://raw.githubusercontent.com/langchain-ai/langchain/master/docs/docs/modules/state_of_the_union.txt"res = requests.get(url)withopen("state_of_the_union.txt","w")as f:f.write(res.text)loader = TextLoader('./state_of_the_union.txt')documents = loader.load()
  • Secondly, the document needs to be chunked. Chunking is the process of breaking up large chunks of text into smaller pieces. Chunking can help us optimize the accuracy of the content recalled from the vector database. LangChain also provides many text segmentation tools. For this example, CharacterTextSplitter can be used for segmentation. Set the chunk size chunk_size to 500 and the number of overlapping tokens chunk_overlap to 50 to ensure the coherence between text chunks.



from langchain.text_splitter import CharacterTextSplittertext_splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=50)chunks = text_splitter.split_documents(documents)
  • Finally, we convert it into vectors through embedding technology and store these text blocks. In order to make LLM understand the content in the file, we need to embed the data after segmentation and store them. There are many ways to embed. In this example, we use the embedding model of OpenAI; finally, we use the Weaviate vector database to save the vectors. In LangChain, by executing the .from_documents() operation, we can automatically fill these block vectors into the vector database.

from langchain.embeddings import OpenAIEmbeddingsfrom langchain.vectorstores import Weaviateimport weaviatefrom weaviate.embedded import EmbeddedOptionsclient = weaviate.Client(embedded_options = EmbeddedOptions())vectorstore = Weaviate.from_documents(client = client,documents = chunks,embedding = OpenAIEmbeddings(),by_text =False)
  • Part 2 Using the data in the vector library to retrieve questions raised by users


  • Retrieval: Now that the file has been converted to vector data and written into the vector database, it can be set up as a retrieval component. This component can retrieve additional contextual information based on the semantic similarity between the user query and the embedded text block.

retriever = vectorstore.as_retriever()
  • Enhancement: Prepare a prompt template to enhance the original prompt by adding preset prompt information and retrieved context information.

from langchain.prompts import ChatPromptTemplatetemplate ="""You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. ChatPromptTemplate.from_template(template)print(prompt)
  • Generate: Through the RAG chain, you can combine the retriever, prompt template and LLM. The following RAG chain will first perform a vector search on the question raised by the user, then combine the retrieved data with the prompt template, and finally LLM generates the answer based on the above information.


from langchain.chat_models import ChatOpenAIfrom langchain.schema.runnable import RunnablePassthroughfrom langchain.schema.output_parser import StrOutputParserllm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)rag_chain =({"context": retriever,"question": RunnablePassthrough()}| prompt| llm| StrOutputParser())query="What did the president say about Justice Breyer"rag_chain.invoke(query)
Among them, rag_chain is the core part. In this step, a RAG chain is constructed using the GPT-3.5 large model, and the question is enhanced by prompt to retrieve the answer from the vector database.


In the above example, we asked LLM the question: "What is the president's opinion of Justice Breyer?" LLM replied after searching the vector library:
"The President thanked Judge Breyer for his service and praised his contribution to the nation."


"The President also mentioned that he has nominated Judge Ketanji Brown Jackson to succeed Justice Breyer and continue her distinguished legacy."




Summarize

There are three benefits of using RAG in a question-answering system based on LLM:


  1. Ensure that LLM answers with the latest, most accurate content. And users can access the source of the model content, ensuring that its claims can be checked for accuracy and ultimately trusted.


  2. By basing the LLM on an external, verifiable set of factual data, there is less opportunity for the model to extract information into its parameters. This reduces the chances of the LLM leaking sensitive data or “hallucinating” incorrect or misleading information.


  3. RAG also reduces the need for users to continuously train models based on new data and update training parameters as the data changes. In this way, enterprises can reduce related financial costs.