Woter AI detection.Hurry - ends Jun 29th

New Year Sales :up to 80% OFF

AI Humanize AI Translator Bypass AI AI Rewriter AI Detector

PRICING

TRY FOR FREE

Overview: A comprehensive guide to building a RAG system from scratch (with code)

Written by

Caleb Hayes

Updated on:June-21st-2025

Although large language models have excellent reasoning capabilities and extensive general knowledge, they often have difficulty retrieving accurate information, obtaining the latest data, or providing verifiable answers. Retrieval-Augmented Generation (RAG) came into being. This innovative method effectively improves the performance of large language models by combining them with external knowledge sources. This article will explore the concept and importance of RAG in depth, and build a complete RAG system from scratch using Python and popular open source libraries.

1. What is RAG

RAG is an architecture that combines information retrieval with text generation. Its core principle is to retrieve relevant information from an external knowledge base before generating an answer, thereby enhancing the capabilities of the language model. This process mainly includes the following key steps:

Search
When the system receives a query, the retrieval system searches the knowledge base for the most relevant documents or text blocks. For example, when a user asks "What are the latest products of Apple", the retrieval system will search the knowledge base containing information about Apple products.
Enhancement
The retrieved information is injected into the prompt sent to the language model. This additional information provides the language model with richer context, helping it generate more accurate responses.
generate
The language model combines its pre-trained knowledge with the retrieved specific information to generate the final answer. In the above example, the language model will refer to the retrieved Apple product information and give a response such as "Apple's latest products include iPhone 15 series phones, Apple Watch Series 9, etc."

The emergence of RAG effectively solves several key problems existing in traditional large language models:

Knowledge limitations
The knowledge of standard large language models is limited by the training data, while RAG allows the model to access newer or more professional information. Taking the medical field as an example, a large language model may be trained using medical research results from several years ago, while RAG can provide users with the latest medical research progress and treatment plans by connecting to the latest medical databases.
Hallucination Problems
Large language models can sometimes generate information that seems plausible but is actually wrong. RAG greatly reduces the occurrence of such "illusions" by basing responses on verifiable sources. For example, when answering questions about historical events, RAG relies on reliable sources such as historical documents to avoid making up details of events that do not exist.
transparency
Models in the RAG system can cite their information sources, which makes it easier to verify answers. This feature is particularly important in academic research scenarios, where researchers can further consult materials based on the sources provided by the model to ensure the accuracy of the information.
Adaptability
The RAG system can adapt to new information by updating the knowledge base without retraining the entire model. This means that when faced with rapidly changing information, such as financial market data and technology news, the RAG system can provide the latest information in a timely manner.

2. Development History of RAG

The concept of RAG was formally proposed by researchers at Facebook AI Research (now Meta AI) in a paper titled "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" in 2020. The paper proposes combining sparse and dense retrievers with sequence-to-sequence models to handle knowledge-intensive tasks.

However, the philosophy behind RAG has deep roots in multiple fields:

Question answering system
Early question-answering systems used document retrieval to find relevant information before attempting to answer the question. These early systems laid the foundation for the development of RAG and inspired the idea of combining information retrieval with answer generation.
Information Retrieval
Decades of research in the search engine field have provided a solid foundation for efficient document retrieval. From simple keyword matching to complex semantic understanding, the continuous advancement of information retrieval technology has provided strong support for the retrieval process in RAG.
Neural Information Retrieval
The application of neural networks in the field of information retrieval enables retrieval to pay more attention to the meaning at the semantic level. By converting text into vector representation, neural networks can better understand the semantic associations between texts and improve the accuracy of retrieval.
Transfer Learning in Natural Language Processing
The emergence of pre-trained language models such as BERT makes document representation and retrieval more effective. Pre-trained language models can learn rich language features and semantic information, providing a powerful tool for text processing in RAG systems.

With the rise of open source alternative models such as GPT-3, GPT-4, Claude, and LLaMA, RAG's popularity has skyrocketed. Enterprises soon realized that although these models are powerful, they need to be combined with trusted sources of information in order to be used reliably in commercial applications. Today, RAG has become the cornerstone of the development of large language models for applications, and frameworks such as LangChain and LlamaIndex provide rich tools to simplify the implementation of RAG.

Why is RAG important?

RAG has many significant advantages in the field of artificial intelligence:

Get the latest information
The RAG system can access the latest information, overcoming the limitations of the knowledge cutoff of large language models. In the fields of news information, scientific and technological trends, etc., users can obtain the latest event reports and technological progress through the RAG system.
Field Specialization
By providing a knowledge base in a specific field, RAG can make a general large language model behave like a professional model. In the legal field, by combining the knowledge base of legal provisions and cases, the RAG system can provide users with professional legal advice; in the financial field, by connecting the knowledge base of financial data and market analysis, the RAG system can provide investors with accurate investment advice.
Reduce hallucinations
RAG bases its answers on the retrieved documents, significantly reducing the possibility of large language models generating erroneous information. This feature is particularly critical in the healthcare field, ensuring that the medical advice provided to patients is accurate and reliable, and avoiding medical risks caused by erroneous information.
Reduce costs
Compared with fine-tuning or retraining large models, RAG can adapt to new domains by simply changing the knowledge base, which greatly reduces the cost. For small companies or research teams with limited resources, this advantage enables them to develop efficient intelligent applications at a lower cost.
Verifiability
The RAG system can cite information sources, making its output more transparent and verifiable. In scenarios such as academic research and business reports, this feature increases the credibility of information and facilitates users to further review and verify information.
Privacy and Security
Sensitive information can be kept in a controlled knowledge base without being included in the model’s training data, which effectively protects data privacy and security when dealing with sensitive information such as personal medical records and corporate trade secrets.

4. Building a RAG System: Core Components

A typical RAG system consists of several key components:

Document Loader
Responsible for importing documents from various sources (such as PDF files, web pages, databases, etc.). When processing PDF files, it can extract the text content in them and prepare for subsequent processing.
Text Chunker
Split documents into small chunks that are easy to index and retrieve. A reasonable chunking strategy is crucial to system performance. Chunks that are too large may contain too much irrelevant information, while chunks that are too small may lose important context.
Embedding Model
The text blocks are converted into numerical vectors that can capture the semantic meaning of the text. With the vector representation, the semantic similarity between texts can be measured by calculating the distance between the vectors.
Vector Storage
Index and store vectors for efficient retrieval. Common vector storage tools such as FAISS provide fast similarity search capabilities.
Retriever
Find the most relevant documents in the vector store for a given query. The performance of the retriever directly affects the quality of the results returned by the system.
Language Model
Generates answers based on the query and retrieved information. The choice and configuration of the language model affects the quality and style of the answers.
Prompt Template
Guide the language model on how to use the retrieved information. A well-designed prompt template can guide the language model to generate answers that better meet the user's needs.

5. Implementation: Gradually build the RAG system

1. Setting up the environment

Building the RAG system requires the use of multiple Python libraries, including langchain, langchain-core, langchain-community, langchain-experimental, pymupdf, langchain-text-splitters, faiss-cpu, langchain-Ollama, langchain-openai, etc. These libraries each have different functions:

LangChain
It provides an overall framework and components for building large language model applications, simplifying the development process.
PyMuPDF
Able to extract text from PDF documents and support processing of various PDF features.
FAISS
Provides efficient similarity search capabilities for vector databases.
Ollama and OpenAI integration
Allowing different language models provides users with more choices.

These libraries can be installed using the pip command:

pip install langchain langchain-core langchain-community langchain-experimental pymupdf langchain-text-splitters faiss-cpu langchain-ollama langchain-openai

Component 1: PDF Loader

from langchain_community.document_loaders import PyMuPDFLoader


class PdfLoader:
    def __init__(self):
        pass

    def read_file(self, file_path):
        loader = PyMuPDFLoader(file_path)
        docs = loader.load()
        return docs

The above code defines a PdfLoader class whose read_file method uses PyMuPDFLoader to load a document from a specified PDF file path. PyMuPDFLoader is based on the PyMuPDF library (also known as fitz), which can efficiently handle various PDF features, including text, tables, and even some images through OCR. The load() method returns a list of Document objects, each of which represents a page in the PDF file, containing the extracted text content (page_content) and metadata (metadata), such as the source file path and page number. In actual applications, this class can be extended to handle other document types.

Component 2: Text Segmentation

from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.documents import Document


class Chunker:
    def __init__(self, chunk_size=1000, chunk_overlap=100):
        self.text_splitter = RecursiveCharacterTextSplitter(
            separators=["\n\n", "\n", " ", ".", ",", "\u200b", "\uff0c", "\u3001", "\uff0e", "\u3002", ""],
            chunk_size=chunk_size,
            chunk_overlap=chunk_overlap,
            length_function=len,
            is_separator_regex=False
        )

    def chunk_docs(self, docs):
        list_of_docs = []
        for doc in docs:
            tmp = self.text_splitter.split_text(doc.page_content)
            for chunk in tmp:
                list_of_docs.append(
                    Document(
                        page_content=chunk,
                        metadata=doc.metadata
                    )
                )
        return list_of_docs

The Chunker class is responsible for splitting the loaded document into smaller chunks of text. At initialization, the chunk size and overlap are controlled by setting chunk_size (default 1000 characters) and chunk_overlap (default 100 characters). RecursiveCharacterTextSplitter uses a series of delimiters (including paragraph separators, line breaks, spaces, punctuation marks, etc.) to split text, giving priority to splitting at natural boundaries. The chunk_docs method processes the input document list, creates a new Document object for each text chunk, and retains the metadata of the original document.

Component 3: Vector Storage

import faiss
from langchain_community.docstore.in_memory import InMemoryDocstore
from langchain_community.vectorstores import FAISS
from langchain_ollama import OllamaEmbeddings
from uuid import uuid4


class VectorStore:
    def __init__(self):
        self.embeddings = OllamaEmbeddings(model="llama3.2:3b")
        self.index = faiss.IndexFlatL2(len(self.embeddings.embed_query("hello world")))
        self.vector_store = FAISS(
            embedding_function=self.embeddings,
            index=self.index,
            docstore = InMemoryDocstore(),
            index_to_docstore_id={}
        )

    def add_docs(self, list_of_docs):
        uuids = [str(uuid4()) for _ in range(len(list_of_docs))]
        self.vector_store.add_documents(documents=list_of_docs, ids=uuids)

    def search_docs(self, query, k=5):
        results = self.vector_store.similarity_search(
            query,
            k=k
        )
        return results

The VectorStore class is the core of the retrieval system. During initialization, an OllamaEmbeddings embedding model is created (llama3.2:3b model is used here), and an index for L2 distance calculation is created based on FAISS. At the same time, a vector storage containing embedding functions, indexes, and document storage is initialized. The add_docs method generates a unique ID for each document and adds the document to the vector storage, which calculates the embedding of the document content and indexes it. The search_docs method converts the input query into an embedding, performs a similarity search in the vector storage, and returns the most similar k documents. In actual production, you can consider using persistent vector storage, adding metadata filtering functions, or implementing hybrid search.

Component 4: RAG system

from langchain_core.prompts import PromptTemplate
from langchain_openai import OpenAI
from langchain_ollama import OllamaLLM
from pdf_loader import PdfLoader
from vector_store import VectorStore
from chunk_text import Chunker


class RAG:
    def __init__(self):
        self.instructor_prompt = """Instruction: You're an expert problem solver you answer questions from context given below. You strictly adhere to the context and never move away from it. You're honest and if you do not find the answer to the question in the context you politely say "I Don't know!"
        So help me answer the user question mentioned below with the help of the context provided
        User Question: {user_query}
        Answer Context: {answer_context}
        """
        self.prompt = PromptTemplate.from_template(self.instructor_prompt)
        self.llm = OllamaLLM(model="llama3.2:3b") # OpenAI()
        self.vectorStore = VectorStore()
        self.pdfloader = PdfLoader()
        self.chunker = Chunker()

    def run(self, filePath, query):
        docs = self.pdfloader.read_file(filePath)
        list_of_docs = self.chunker.chunk_docs(docs)
        self.vectorStore.add_docs(list_of_docs)
        results = self.vectorStore.search_docs(query)
        answer_context = "\n\n"
        for res in results:
            answer_context = answer_context + "\n\n" + res.page_content
        chain = self.prompt | self.llm
        response = chain.invoke(
            {
                "user_query": query,
                "answer_context": answer_context
            }
        )
        return response


if __name__ == "__main__":
    rag = RAG()
    filePath = "investment.pdf"
    query = "How to invest?"
    response = rag.run(filePath, query)
    print(response)

The RAG class integrates the components built previously to form a complete RAG system. At initialization, a prompt template that guides the language model is defined, a PromptTemplate object is created, and the language model, vector storage, PDF loader, and text chunker are initialized. The run method implements the complete RAG workflow: load the PDF document, chunk it, add it to the vector storage, search for relevant text chunks based on the user query, combine the retrieved text chunks to form a context, and combine the prompt template with the language model to generate an answer. In the main program, create a RAG instance, specify the PDF file path and query, run the system, and print the results.

6. Advanced Considerations and Improvements

Although the above implementation has laid a solid foundation for the RAG system, there are still many aspects that can be further optimized and improved in actual production applications:

Document processing enhancements
Supports multiple document formats, such as Word documents, web pages, databases, etc.; extracts metadata of documents, such as creation date, author, title, etc.; integrates OCR technology to process scanned documents or images; and implements specialized extraction and processing of table data.
Block strategy optimization
Use semantic chunking to segment text based on its semantic meaning rather than simply based on the number of characters; implement hierarchical chunking to maintain document structure and establish parent-child relationships between chunks; include chapter titles or document structure information in chunk metadata to improve retrieval and comprehension.
Embedding and retrieval improvements
Add a re-ranking step to optimize the initial search results; combine vector similarity and keyword-based (such as BM25) hybrid search to improve the accuracy of retrieval; automatically expand queries to improve retrieval performance; use cross encoder re-ranking, which has a higher computational cost but can obtain more accurate results.
Large language model integration optimization
Implement streaming responses to improve user experience, especially when dealing with long answers; modify prompts to guide the model to perform step-by-step reasoning; let the model evaluate and optimize its own answers; decompose complex queries into sub-problems to improve the ability to handle complex tasks.
Assessment and monitoring
Evaluate the relevance of retrieved documents to the query; compare generated answers with standard answers when available and evaluate the accuracy of the answers; detect whether the model produces hallucination information; establish a user feedback loop and continuously improve system performance based on user feedback.

7. RAG Extension in Production Environment

1. Security and Compliance

In a production environment, the data processed by the RAG system may contain sensitive information, such as the company's business secrets, customers' personal data, etc. Therefore, it is crucial to implement strict security and compliance measures.

Access Control
Set up multi-level access permissions for sensitive documents to ensure that only authorized personnel or services can access specific knowledge base content. Permissions can be divided based on factors such as user roles, departments, and data sensitivity. For example, users in the finance department can only access documents related to finance, and the access permissions of personnel at different levels are also different.
Logging
Detailed system operation logs, including document access records, query content, model responses, etc. These logs not only help track system usage, but also provide a basis for security audits. By analyzing the logs, potential security risks can be discovered in a timely manner, such as abnormal query behavior or unauthorized access attempts.
Data compliance processing
Ensure that the processing of personally identifiable information (PII) complies with relevant regulations, such as GDPR, CCPA, etc. In the process of data collection, storage, use and sharing, follow strict data protection principles, encrypt the storage and transmission of PII, and avoid legal risks caused by data leakage.

2. Performance Optimization

In order to meet the needs of a large number of users and complex queries in the production environment, comprehensive performance optimization of the RAG system is required.

Precomputed Embeddings
For large document collections, the embedding vectors of text blocks are pre-calculated when the system is initialized or when the document is updated. This eliminates the need to calculate embeddings in real time when querying, greatly reducing response time. Embeddings can be recalculated periodically to adapt to changes in document content or to adopt more advanced embedding models.
Cache mechanism
Caching is implemented at multiple levels, including query cache, embedding cache, and response cache. The query cache can store common queries and their corresponding search results. When the same query appears again, the cached results are directly returned; the embedding cache is used to save the already calculated text block embedding vectors to avoid repeated calculations; the response cache stores the answers generated by the model to improve the response speed of the same question.
Quantitative techniques
Quantization technology is used to convert high-dimensional embedded vectors into low-precision representations, reducing the storage space and computational complexity of the vectors without significantly losing semantic information. For example, converting a 32-bit floating-point vector into a 16-bit or 8-bit representation speeds up similarity searches while reducing the consumption of memory and computing resources.

3. Infrastructure

Reasonable infrastructure architecture is the key to ensuring the stable operation and scalability of the RAG system in a production environment.

Containerized deployment
Use container technology (such as Docker) to encapsulate the various components of the RAG system (document loader, text chunker, vector storage, language model, etc.) into independent containers. Containerized deployment makes component deployment, management, and updating more convenient, while isolating the operating environment of different components and improving system stability and security.
Microservices Architecture
The RAG system is split into multiple microservices, each of which is responsible for a specific function, such as document processing service, retrieval service, language model service, etc. The microservice architecture improves the scalability of the system and can independently expand the resources of each service according to business needs. At the same time, it reduces the coupling of the system and facilitates maintenance and upgrades.
Queue system
Introduce a queue system (such as Kafka, RabbitMQ) to handle asynchronous tasks for large numbers of documents, such as document loading, embedded calculations, etc. When there are a large number of documents to be processed, put the tasks into a queue and process them in sequence by the background work process to avoid system performance degradation due to task accumulation and ensure that the system can still run stably under high load conditions.

4. Persistence

Ensure that the data and model status in the RAG system can be persisted so that the system can be quickly restored when it is restarted or fails.

Persistent Storage
Choose a reliable persistent database (such as vector databases such as Pinecone, Weaviate, Chroma, and relational databases or NoSQL databases for storing document metadata) to store embedded vectors and document information. These databases provide persistent storage of data, efficient indexing and query functions, and ensure data security and accessibility.
Incremental Updates
Implement an incremental update mechanism. When a new document is added or an existing document is updated, only the changed part is processed instead of reprocessing the entire document collection. For example, in vector storage, only the embedding vector of the newly added or modified document is updated, reducing the data processing overhead and improving the update efficiency of the system.

Retrieval-augmented generation (RAG) is an important breakthrough in the development of large language models. By combining external knowledge sources, it significantly improves the practicality, reliability and credibility of language models. This article introduces the concept, development history and importance of RAG in detail, as well as the whole process of building a RAG system from scratch using Python and open source libraries, including the implementation of core components such as document loading, text segmentation, vector storage and response generation.

At the same time, in response to the needs of the production environment, a series of advanced improvement strategies and extension points were discussed, covering document processing optimization, block strategy improvement, embedded retrieval enhancement, large language model integration optimization, system evaluation monitoring, and production environment deployment. Through these measures, the RAG system can be continuously improved to better adapt to various practical application scenarios.