A comprehensive guide to building a RAG system based on DeepSeek (with code)

Written by

Silas Grey

Updated on:July-15th-2025

1. RAG Technology Principles and Advantages

As artificial intelligence develops rapidly, efficiently processing, understanding, and retrieving information from massive documents have become key requirements in many fields. Retrieval-Augmented Generation (RAG) technology has emerged as a result, representing a major advancement in AI information processing. Traditional language models rely only on pre-trained data, while the RAG system dynamically retrieves relevant information before generating a response, just like equipping AI with a dedicated "library" that can be consulted at any time before answering questions.

The RAG system mainly includes two core capabilities: retrieval and generation. The retrieval capability is responsible for accurately finding the most relevant information from the knowledge base, while the generation capability uses the retrieved information to build coherent and accurate response content. Its workflow follows a clear path of "user query → document retrieval → context enhancement → large language model (LLM) response".

In the document retrieval stage, the system first converts the user query into a vector embedding, then searches for similar documents in the vector database, and finally extracts the most relevant text blocks. In the context enhancement stage, the retrieved documents are integrated and arranged in a format that is easy for the large language model to handle. Finally, the large language model combines the query with the context information to generate insightful responses.

The RAG system has many significant advantages. First, it can greatly improve the accuracy of responses and effectively reduce the phenomenon of model "hallucination" by providing relevant context, that is, generating content that seems reasonable but has no factual basis. Second, the RAG system is easy to update knowledge, and the knowledge reserve can be easily expanded by simply adding new documents. Third, its responses are verifiable, and all answers can be traced back to the original document source. From a cost perspective, there is no need to continuously retrain the model, which reduces time and resource costs. In addition, by updating the document library, the RAG system can quickly adapt to the needs of new fields, demonstrating strong domain adaptability.

2. Analysis of DeepSeek Model Family

DeepSeek has emerged as a rising star in the field of artificial intelligence, providing a rich selection of application scenarios with its open source language model family, which covers two categories: basic models and professional models.

The basic model parameter scale ranges from 7 billion to 67 billion, and has general language understanding and generation capabilities, which is the cornerstone of supporting a variety of natural language processing tasks. Professional models are optimized for specific fields. For example, DeepSeek Coder is designed for programming tasks and can efficiently handle code generation, code understanding and other tasks; DeepSeek-MoE adopts an expert hybrid mechanism to significantly improve the overall performance of the model.

When building a RAG system, choosing the DeepSeek-R1-Distill-Llama-70B model has many advantages. In terms of performance, its performance in many tasks is comparable to GPT-3.5, and it can provide users with high-quality responses. Through distillation optimization technology, the model improves computing efficiency and reduces operating costs while maintaining performance. Multilingual support is also a highlight. It has demonstrated strong strength in multiple language processing tasks, providing a strong guarantee for building a multilingual RAG system. Its open source nature provides developers with a broad space for customization and modification. Coupled with active community support, the model can be regularly updated and continuously improved.

3. Construction of RAG system based on DeepSeek

1. System Architecture Overview

This RAG system is composed of multiple key components. FastAPI, as the backend framework, is responsible for building API interfaces and realizing external interactions; ChromaDB is used for vector storage and efficient management of vector representations of documents; LangChain is responsible for orchestration and coordination between components; DeepSeek-R1-Distill-Llama-70B model is responsible for generating reply content; HuggingFace's embedding technology is used to convert text into vector form for easy retrieval in the vector database.

2. Core component implementation details

Document Processing Pipeline : The DataIngestionPipeline class takes on the core task of document processing. When initialized, it receives configuration parameters containing information such as file path and persistence directory. In the processing flow, the file is first loaded and verified through the FileComponent class to ensure the accuracy and integrity of the data. Next, the SplitTextComponent class is used to split the file content into text blocks of appropriate size. The recommended block size here is 1000 characters, and an overlap of 200 characters is set to maintain the coherence of the context, while taking the structure of the document into full consideration for segmentation. Subsequently, the text embedding is created with the help of the HuggingFaceEmbeddings class and stored in the Chroma vector database. Finally, the segmented text is converted to the LangChain document format and added to the vector database to complete the preprocessing and storage of the document.

class DataIngestionPipeline:def __init__(self, config: DataIngestionConfig):self.config = configdef process(self):# Load and validate filesfile_loader = FileComponent(path=self.config.file_path,silent_errors=self.config.silent_errors)loaded_data = file_loader.load_file()# Split into manageable chunkstext_splitter = SplitTextComponent(data_inputs=[loaded_data],chunk_size=self.config.chunk_size,chunk_overlap=self.config.chunk_overlap)split_data = text_splitter.split_text()# Initialize embeddings and storeembeddings = HuggingFaceEmbeddings(model_name=self.config.model_name,encode_kwargs={"device": self.config.encode_device})vector_store = Chroma(persist_directory=self.config.persist_directory,embedding_function=embeddings,collection_name=self.config.collection_name)# Store documentsdocuments = [data.to_lc_document() for data in split_data]vector_store.add_documents(documents)

Chat system implementation : The RetrievalChatSystem class is responsible for processing the user's chat input and generating replies. During initialization, it creates a HuggingFaceEmbeddings instance for text vectorization, loads the Chroma vector database, and initializes the DeepSeek model. When processing the chat input, first retrieve the 5 documents most relevant to the user's question through the similarity_search method of the vector database. Then, concatenate the contents of these documents into context information. Next, construct a prompt message containing the context and question and send it to the DeepSeek model for processing. After the model generates a reply, it encapsulates it into a Message object and returns it, realizing the complete processing flow from user question to system reply.

class RetrievalChatSystem:def __init__(self, collection_name: str, persist_directory: str):self.embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")self.vectorstore = Chroma(persist_directory=persist_directory,embedding_function=self.embeddings,collection_name=collection_name)self.llm = ChatGroq(model_name="DeepSeek-R1-Distill-Llama-70B",temperature=0.7)def process_chat_input(self, input_message: Message) -> Message:# Retrieve relevant documentsretrieved_docs = self.vectorstore.similarity_search(input_message.text,k=5)# Create context from retrieved documentscontext = "\n\n".join([doc.page_content for doc in retrieved_docs])# Generate responseprompt = f"""Based on the following context, provide a relevant response in Hinglish language. If the question isn't related to the context, indicate that clearly:Context:{context}Question:{input_message.text}Response:"""response = self.llm.invoke(prompt)return Message(text=response.content,sender="AI",sender_name="AI Assistant")

3. API endpoint design

Document upload endpoint : The /upload endpoint is responsible for receiving documents uploaded by users. After receiving the uploaded file, save it to a temporary path first. Then, create a DataIngestionConfig configuration object and initialize the DataIngestionPipeline document processing pipeline. The pipeline processes the uploaded document, converts it into a vector and stores it in ChromaDB. If the processing goes smoothly, the "success" status is returned; if an exception occurs, an HTTP 500 error is thrown with detailed error information.

@app.post("/upload")async def upload_pdf(file: UploadFile = File(...)):try:# Save uploaded filefile_path = f"temp_{file.filename}"with open(file_path, "wb") as f:f.write(file.file.read())# Process fileconfig = DataIngestionConfig(file_path=file_path,persist_directory="path/to/chroma_db")pipeline = DataIngestionPipeline(config)vector_store = pipeline.process()return {"status": "success"}except Exception as e:raise HTTPException(status_code=500, detail=str(e))

Chat endpoint : The /chat endpoint is used to process user chat requests. It receives a ChatRequest object containing the user message and the sender's name, and converts it into a Message object. Then, the user message is processed through the process_chat_input method of the RetrievalChatSystem instance to obtain the system reply. Finally, the reply content is encapsulated into a ChatResponse object and returned to the user. If an error occurs during the processing, an HTTP 500 error is also thrown and the error details are returned.

@app.post("/chat", response_model=ChatResponse)async def chat(request: ChatRequest):try:input_message = Message(text=request.message,sender="user",sender_name=request.sender_name)response = chat_system.process_chat_input(input_message)return ChatResponse(message=response.text,sender=response.sender,sender_name=response.sender_name)except Exception as e:raise HTTPException(status_code=500, detail=str(e))

IV. System Performance Optimization Strategy

1. Block strategy

A reasonable block strategy is the key to improving system performance. It is recommended to divide the document into text blocks of about 1,000 characters, which can ensure that each block contains enough information and is easy for the model to process. At the same time, setting an overlapping part of 200 characters helps maintain the coherence of the context and avoid information loss. When segmenting, it is also necessary to fully consider the structure of the document, such as chapter division, paragraph logic, etc., so that the segmented text blocks have more semantic integrity.

2. Vector Search Optimization

The maximum marginal relevance (MMR) algorithm can achieve a balance between relevance and diversity in the search results to obtain more valuable documents. The number of documents retrieved (k value) can be flexibly adjusted according to different application scenarios. For scenarios that require precise answers, the k value can be appropriately reduced; for scenarios that require extensive reference information, the k value can be increased. In addition, the implementation of an effective indexing mechanism can significantly improve the speed and efficiency of vector search.

3. Error handling mechanism

Build a complete log system to record various events and error information during system operation in detail to facilitate problem troubleshooting. Gracefully handle boundary conditions such as empty queries and overlong texts to avoid system crashes. Provide users with clear and meaningful error prompts to help them understand the problem and improve user experience.

Using DeepSeek's advanced language model to build a multilingual RAG system has opened up a new path in the field of intelligent document interaction. By integrating efficient vector storage and retrieval technology, the system can provide accurate, context-aware responses while also having the ability to dynamically update knowledge. Whether it is a customer support system, document query tool, or knowledge base construction, this system architecture provides a solid foundation for meeting actual needs. With the continuous development and improvement of technology, its application prospects will be even broader.