Analysis of RagFlow document parsing process

In-depth analysis of the innovation and optimization of the RagFlow document processing engine, exploring new ideas for open source RAG applications.
Core content:
1. RagFlow's document parsing and retrieval mechanism
2. Task segmentation and deduplication optimization strategy
3. The application of multiple document parsers and PDF document parsing process
RagFlow is a popular open source RAG application. It is characterized by a document processing engine based on DeepDoc, which can greatly improve the actual effect of RAG. Some time ago, I read through the source code of Ragflow (based on 0.17.0) for work needs, and found that it does have some unique features in document parsing and document retrieval. Here I will share some of my understanding with you, hoping to help you find some new ideas for RAG optimization.
The most important part of RAG is document parsing, the so-called "Garbage in Garbage out" . If the document parsing effect is not good and the information that should be collected is not collected, then no matter how much optimization is done in the subsequent retrieval process, it will be useless. So let's first look at how RagFlow does document parsing.
Task generation and management
When a user submits a document parsing request on a page, RagFlow will encapsulate it as an asynchronous task and process it in the background.
1. The task segmentation system will segment tasks according to document types and configuration rules. For example:
• PDF files are split by page range (such as pages 1-50, 51-100, etc.). • The Excel file is split by rows (every 3,000 rows is a subtask). • The split subtasks will be placed in the asynchronous task queue and managed and distributed by Redis.
2. Task deduplication optimization : By extracting the hash value of the task information, the task queue is deduplicated to avoid repeated processing. • File type parser : core logic for PDF, PPT, Word and other file formats, source code is located at deepdoc/parser
.• Content type parser : Further refine the processing methods for different types of documents based on the document content characteristics (such as papers, Q/A, tables, etc.). Users can choose the appropriate parser to achieve the best parsing effect.
Document Parser
When processing document parsing tasks, RagFlow will determine how to parse the file based on the file type of the document and the parser selected by the user. RagFlow provides a variety of parsers that are optimized for different document types and content characteristics. Parsers are divided into two categories:
class ParserType(StrEnum): PRESENTATION = "presentation" LAWS = "laws" MANUAL = "manual" PAPER = "paper" RESUME = "resume" BOOK = "book" QA = "qa" TABLE = "table" NAIVE = "naive" PICTURE = "picture" ONE = "one" AUDIO = "audio" EMAIL = "email" KG = "knowledge_graph" TAG = "tag"
Document parsing process
Here we take the parsing process of PDF documents as an example to explain. PDF should be one of the most common document types we encounter in daily life. Due to the complexity of its source (exported from word, ppt and other files, photocopied pure image PDF, standard generated PDF documents, etc.), the processing process is also the most complicated among all types of documents. Its parsing process is mainly divided into 6 steps (the general parser is selected here, the source code is located in rag/app/naive.py
)
def __call__(self, filename, binary=None, from_page=0, to_page=100000, zoomin=3, callback=None): start = timer() first_start = start callback(msg="OCR started") self.__images__( filename if not binary else binary, zoomin, from_page, to_page, callback ) callback(msg="OCR finished ({:.2f}s)".format(timer() - start)) logging.info("OCR({}~{}): {:.2f}s".format(from_page, to_page, timer() - start)) start = timer() self._layouts_rec(zoomin) callback(0.63, "Layout analysis ({:.2f}s)".format(timer() - start)) start = timer() self._table_transformer_job(zoomin) callback(0.65, "Table analysis ({:.2f}s)".format(timer() - start)) start = timer() self._text_merge() callback(0.67, "Text merged ({:.2f}s)".format(timer() - start)) tbls = self._extract_table_figure(True, zoomin, True, True) # self._naive_vertical_merge() self._concat_downward() # self._filter_forpages() logging.info("layouts cost: {}s".format(timer() - first_start)) return [(b["text"], self._line_tag(b, zoomin)) for b in self.boxes], tbls
1. Image conversion and OCR extraction
• Convert PDF pages to high-definition images. • Use OCR technology to extract text information, and combine it with PDF native text extraction function to improve text extraction performance. • Advantages : Unified image and text processing logic, compatible with scanned PDF scenarios.
2. Layout Analysis
• Use pre-trained models to analyze the layout of each page and divide the page into several different types of areas such as text, title, chart, header/footer, etc. • Record the type of area and its coordinate position in the image, and associate it with the OCR text block results to provide data support for subsequent processing.
3. Table enhancement
• For areas identified as tables in layout analysis, use a pre-trained table model to extract more detailed structured table data (row and column information).
4. Simple text block merging
• Merge the previously recognized text blocks to improve text coherence and readability. Compared with the merging in the fifth step, no pre-trained model is used here, only a simple merging based on layout rules. • Merge conditions : • Layout consistency: same layout area and normal text. • Vertical alignment: The vertical distance of the text box should be less than 1/3 or 1/5 of the average line height of the page. • Horizontal continuity: The horizontal spacing meets the threshold or there is punctuation connection. • Merge operations : expand coordinates, center alignment, text splicing, and remove redundancy.
5. Merge text blocks vertically
• Further merge text blocks in the vertical direction, merge vertically continuous and semantically related text blocks (such as cross-line paragraphs and cross-page content) into complete text paragraphs, and solve the problem of incorrect text segmentation in OCR results. Here, the XGBoost model is mainly used for continuity judgment. • Model feature input : geometric features (spacing and height ratio of text blocks), contextual features (punctuation at the end, number of pages), semantic features (word segmentation continuity), and layout features (table relevance).
After the final parsing is completed, the text block (chunk) generated and inserted into ES mainly contains five parts of information:
• The document title information where the text block is located (title content and word segmentation results). • Text information of the text block (content and word segmentation results. The content is affected by the maximum token specified by the user, but there is no strict limit and it may be exceeded during merging) • Vectorized data of the text information of the text block (used for subsequent vector-based similarity comparison) • The text block corresponds to the image information of the document page • Coordinate information of the text block in the page image
From the perspective of the entire PDF document processing process, a large number of pre-trained small models are used to handle functions such as OCR, layout recognition, table content recognition, etc., which can indeed be called **"DeepDoc"**. However, this also makes the entire PDF parsing process slower than other similar applications, and has certain requirements for hardware. However, through a series of complex processing, the recognition rate of effective content in the document has indeed been improved.
Other types of parsers
Other types of parsers are mainly based on the general parser with some adjustments and deletions in the process. The overall process is not too large. Here are just two simple examples:
• Presentation parser : only performs image conversion and text extraction, each page is split into an independent text block (the number of tokens in a text block is not limited), and does not perform table parsing and complex merging. • QA parser : performs the first four steps of parsing, matches questions and answers through regular expressions, and generates a complete question-answer text block (the number of tokens in the text block is not limited)
QUESTION_PATTERN = [ r"Question ([0一23456789十一百0-9]+)", r"Article ([0一23456789十一百0-9]+)", r"[\((]([0一23456789十一百]+)[\))]", r"Question ([0-9]+)", r"Article ([0-9]+)", r"([0-9]{1,2})[\. 、]", r"([0一23456789十一百]+)[ 、]", r"[\((]([0-9]{1,2})[\))]", r"QUESTION (ONE|TWO|THREE|FOUR|FIVE|SIX|SEVEN|EIGHT|NINE|TEN)", r"QUESTION (I+V?|VI*|XI|IX|X)", r"QUESTION ([0-9]+)",]
LLM auxiliary enhancement
After completing the original text block parsing process, RagFlow also supports further enhancement of the slicing process through LLM to improve the subsequent retrieval recall rate. The main functions include:
1. Automatic keyword extraction (auto_keywords)
LLM is used to automatically extract keywords from each text block (the number is determined bytopn
configuration), the extracted keywords will update the text blockimportant_kwd
(original keyword) andimportant_tks
(Keywords after word segmentation) field.
2. Automatic question generation (auto_questions)
Use LLM to automatically extract possible issues related to a text block (the number is determined bytopn
Configuration), the extracted questions will update the text blockquestion_kwd
(original question) andquestion_tks
(Post-segmentation) fields. These newly added fields will be stored in ES together with the text block. When performing mixed search (keyword matching + vector) in the query stage, the keyword matching will assign different matching weights to different fields of the text block (see below). From this, we can see the significance of the above fields, which is to enhance the accuracy of the keyword search stage. The specific process of the search will be written in a separate article later, so I will not expand it here.
self.query_fields = [ "title_tks^10", "title_sm_tks^5", "important_kwd^30", "important_tks^20", "question_tks^20", "content_ltks^2", "content_sm_ltks", ]
3. RAPTOR recall enhancement strategy
When this strategy is enabled, after parsing the original document, it will also try to aggregate and refine the generated text blocks and summarize them layer by layer (which will greatly increase the number of text blocks in a document). The general process is as follows:
1. Cluster the original text block set based on vector similarity and aggregate it into different groups (using GMM). 2. Concatenate the text of all text blocks in the group and summarize them into a new text using LLM. 3. Repeat clustering and summarizing until the number of groups is 1. 4. Return the original text block and all new text blocks obtained through summarization.
In addition, there is also knowledge graph enhancement (GraphRAG), which has been introduced a lot on the Internet, so I won’t go into details here. It should be said that after turning on LLM document parsing enhancement, the parsing effect will indeed be significantly improved (especially RAPTOR), but it will also significantly increase the time consumption of document parsing (this increase is not a little bit, if the documents are large and there are many, the parsing process will drive you crazy), and if you connect to an external LLM, it will also consume a lot of token costs. How to choose depends on the specific business scenario.
Summarize
RagFlow provides a wealth of configuration items for users to choose from during the document slicing process, covering almost all the latest research results in the current RAG field. In particular, it uses a series of deep learning models to introduce layout recognition, table structure analysis and other proprietary technologies during document parsing, effectively improving the quality of document content acquisition, and is worthy of the SOTA in the open source RAG field. However, because there are too many configuration items, you also need to carefully select according to the content and form of the document when using it. Blind configuration will not only lead to an extremely long parsing process, but may not actually work. I hope this article can help you configure and use it better.