Analysis of the implementation of RAG technology in the RAGFlow project

Written by
Caleb Hayes
Updated on:June-13th-2025
Recommendation

In-depth exploration of the innovative practices and core implementations of RAG technology in the RAGFlow project.

Core content:
1. Overview of the RAGFlow project and its application in deep document understanding
2. Multi-format document parsing and visual information processing technology
3. Key points of document segmentation strategy and Embeddings model selection and configuration

Yang Fangxian
Founder of 53A/Most Valuable Expert of Tencent Cloud (TVP)

 

Project Overview

RAGFlow is an open source RAG (Retrieval-Augmented Generation) engine based on deep document understanding. It provides a smooth RAG workflow, combines a large language model (LLM) to provide real question-answering capabilities, and supports answers through references to various complex formatted data. This article analyzes RAGFlow's source code and official documentation to summarize the best practices it adopts when implementing RAG technology.

Document processing and segmentation

Deep Document Understanding (DeepDoc)

A core feature of RAGFlow is its deep document understanding capability, which is mainly achieved throughdeepdocModule implementation:

  1. 1.  Multi-format document parsing :
  • • Supports multiple document formats including PDF, DOCX, Excel, PPT, TXT, pictures, etc.
  • • Each format has a dedicated parser (e.g.pdf_parser.py,docx_parser.pywait)
  • • Ability to handle complex layouts and formats, preserving the document's structural information
  • 2.  Visual Information Processing :
    • • OCR technology: extract text from images or scanned documents
    • • Layout recognition: recognize 10 basic layout components (text, title, chart, table, etc.)
    • • Table Structure Recognition (TSR): handles complex tables, including hierarchical headers, cross-cells, etc.
  • 3.  Document segmentation strategy :
    • • Template-based chunking
    • • Preserve the semantic structure and context of the document
    • • Visualize segmentation results, allowing manual intervention and adjustment

    From the source code analysis, we can see that RAGFlowdeepdoc/parserSpecialized parsers are implemented for different document types in the directory to ensure that the content and structure of various documents can be accurately extracted. This deep understanding ability is the basis for high-quality retrieval in the RAG system.


    Document segmentation implementation

    existrag/nlp/rag_tokenizer.pyIn RAGFlow, complex text segmentation and processing logic is implemented:

class RagTokenizer : 
    # Implemented a variety of word segmentation and processing methods
    def dfs_ ( self, chars, s, preTks, tkslist, _depth= ​​0 , _memo= None ):
        # Depth-first search algorithm for text segmentation
        # …

    def maxForward_ ( self, line ):
        # Maximum forward matching algorithm
        # …

    def maxBackward_ ( self, line ):
        # Maximum backward matching algorithm
        # …

RAGFlow uses a variety of segmentation strategies, including:

  1. 1.  Semantic-based segmentation : Maintain semantic integrity and avoid fragmenting context
  2. 2.  Length-based segmentation : Control the number of tokens in each chunk to avoid being too long or too short
  3. 3.  Segmentation based on structure : respect the original structure of the document (paragraphs, titles, etc.)
  4. 4.  Configurable segmentation templates : Provide multiple segmentation template options to adapt to different document types and retrieval requirements

This flexible segmentation strategy ensures that sufficient context is obtained during retrieval while avoiding interference from irrelevant information.

Embeddings model selection and configuration

RAGFlow supports multiple embeddings models.rag/llm/embedding_model.pyA rich model interface is implemented in:

class DefaultEmbedding ( Base ): 
    # Use FlagEmbedding model by default
    def __init__ ( self, key, model_name, **kwargs ):
        # BAAI/bge-large-zh-v1.5 model is used by default
        # …

class OpenAIEmbed ( Base ):
    # OpenAI's embedding model
    # …

class QWenEmbed ( Base ):
    # Tongyi Qianwen's embedding model
    # …

class ZhipuEmbed ( Base ):
    # Embedding model of Zhipu AI
    # …

RAGFlow's embeddings model selection practices include:

  1. 1.Multiple  model support :
  • • Support mainstream embeddings models, such as OpenAI, Zhipu AI, Tongyi Qianwen, etc.
  • • Default use BAAI/bge-large-zh-v1.5 Model, optimized for Chinese and English
  • • Support for locally deployed models (such as Ollama, LocalAI, etc.)
  • 2.  Batch processing optimization :
    • • Implemented batch encoding to improve efficiency
    • • Automatically handle long text truncation to avoid exceeding the maximum length limit of the model
  • 3.Consistency  guarantee :
    • • Ensure that all documents in the same knowledge base use the same embedding model
    • • Ensure the consistency of vector space and improve retrieval quality
  • 4.  Flexible configuration :
    • • Allows selection of different embedding models for different knowledge bases
    • • Supports adjusting model parameters through configuration files

    The official Docker image of RAGFlow (non-slim version) has two optimized embedding models pre-installed:BAAI/bge-large-zh-v1.5 and maidalun1020/bce-embedding-base_v1,These two models are optimized for Chinese and English, providing good multi-language support.

    Selection and implementation of vector database

    RAGFlow uses a flexible architecture for vector database.rag/utils/doc_store_conn.pyImplement database connection and operation:

    class DocStoreConnection : 
        # Vector database connection abstract class
        # …

    class OpenSearchConnection ( DocStoreConnection ): 
        # OpenSearch Implementation
        # …

    RAGFlow's vector database practices include:

    1. 1.  Use OpenSearch by default :
    • • As the default vector database, it provides high-performance vector search
    • • Supports complex query and filtering operations
    • • Good horizontal scalability, suitable for large-scale data
  • 2.  Index design :
    • • Create separate indexes for each knowledge base
    • • Use naming conventionsragflow_{uid}Ensure index uniqueness
    • • Store document metadata and vector data, and support multiple query methods
  • 3.  Hybrid search strategy :
    • • Combine keyword search and vector similarity search
    • • Use weighted fusion (weighted fusion) Improve search quality
    • • Support multiple similarity calculation methods (such as cosine similarity)
  • 4.  Performance optimization :
    • • Implemented batch operations to improve write efficiency
    • • Use cache to reduce repeated calculations
    • • Supports sharding and replication to improve availability and performance

    RAGFlow's vector database is designed with emphasis on flexibility and performance, capable of handling large-scale document collections and supporting complex retrieval requirements.

    Retrieval implementation mechanism

    RAGFlow's retrieval mechanism is mainly based onrag/nlp/search.pyIn the implementation, a variety of technologies are used to improve the search quality:

    class Dealer : 
        def search ( self, req, idx_names:  str  |  list [ str ], kb_ids:  list [ str ], emb_mdl= None , highlight= False , rank_feature:  dict  |  None  =  None ):  
            # Implemented complex retrieval logic
            # …

        def insert_citations ( self, answer, chunks, chunk_v, embd_mdl, tkweight= 0.1 , vtweight= 0.9 ): 
            # Implemented reference insertion logic
            # …

    RAGFlow's retrieval practices include:

    1. 1.  Hybrid search strategy :
    • • Combine full-text search and vector similarity search
    • • The default weights are: keyword weight 0.05, vector similarity weight 0.95
    • • Supports adjusting weights to suit different scenarios
  • 2.  Multi-stage retrieval :
    • • First perform a preliminary search to obtain candidate documents
    • • Then use reranking to improve relevance
    • • Finally, filter and sort the results
  • 3.  Adaptive retrieval :
    • • When there are insufficient results, automatically lower the matching threshold for a second search
    • • Supports filtering based on similarity thresholds to ensure result quality
    • • Configurable search parameters, such as topk, similarity threshold, etc.
  • 4.  Reference and tracing :
    • • Automatically insert quotes into generated answers
    • • Supports viewing of citation sources to increase transparency and credibility
    • • The citation format is[ID:n], which is convenient for users to track the source of information
  • 5.  Keyword enhancement :
    • • Support adding keywords to document blocks to improve ranking for specific queries
    • • Implemented synonym expansion to improve recall rate
    • • Supports custom stop words and weight adjustment

    RAGFlow's retrieval mechanism focuses on accuracy and explainability, improves retrieval quality through a combination of multiple technologies, and provides flexible configuration options to adapt to different application scenarios.

    RAG Process Integration and Optimization

    RAGFlow provides a complete RAG workflow, from document uploading, parsing, segmentation, indexing to retrieval and generation, forming a set of automated processes:

    1. 1.  Process Automation :
    • • Provides an intuitive web interface to simplify operation
    • • Automated document processing and indexing
    • • Support batch processing of multiple documents
  • 2.  Human intervention :
    • • Visualize segmentation results and allow manual adjustments
    • • Support adding keywords and modifying document block content
    • • Provide retrieval test function to verify the configuration effect
  • 3.  Multi-model integration :
    • • Supports configuring different LLMs as generated models
    • • Support multiple embedding models
    • • Support image understanding models and handle multimodal content
  • 4.  Performance optimization :
    • • Use parallel processing to improve document parsing speed
    • • Batch operations reduce the number of API calls
    • • Cache mechanism to reduce repeated calculations
  • 5.  Scalable Architecture :
    • • Modular design for easy expansion and customization
    • • Supports integration into other systems via API
    • • Support Docker deployment to simplify environment configuration

    RAGFlow's overall architecture focuses on ease of use and flexibility, and is suitable for business needs of all sizes, from personal use to large-scale enterprise applications.

    Summarize

    RAGFlow's best practices in implementing RAG technology are mainly reflected in the following aspects:

    1. 1.  Deep document understanding : Through specialized parsers and visual processing technology, we can achieve deep understanding of complex documents and lay the foundation for high-quality retrieval.
    2. 2.  Intelligent document segmentation : Adopt a template-based segmentation strategy to retain the semantic structure and contextual relationship of the document and improve the accuracy of retrieval.
    3. 3.  Flexible model support : supports multiple embeddings models and LLM, allowing users to choose the most suitable model combination according to their needs.
    4. 4.  Efficient vector storage : Use OpenSearch as the vector database to provide high-performance retrieval capabilities and good scalability.
    5. 5.  Hybrid retrieval strategy : Combine keyword search and vector similarity search to improve retrieval quality through weighted fusion.
    6. 6.  Explainable design : Provide reference and traceability functions to enhance the credibility and transparency of generated content.
    7. 7.  Human-machine collaboration : Allows human intervention and adjustment, combined with automated processes to improve overall efficiency.

    These best practices of RAGFlow provide a valuable reference for building high-quality RAG systems, especially the innovative designs in processing complex documents, improving retrieval quality, and enhancing the credibility of generated content.

    Later I will conduct a more detailed analysis of each module listed in this article.