Application of multimodal models in RagFlow

Written by
Caleb Hayes
Updated on:June-13th-2025
Recommendation

New features of RagFlow 0.19.0 version: multimodal model helps document parsing.

Core content:
1. RagFlow introduces multimodal model image2text to improve image parsing
2. Application of image2text model in PDF document content extraction
3. Detailed rules and prompt words for extracting image text

Yang Fangxian
Founder of 53A/Most Valuable Expert of Tencent Cloud (TVP)

 

In the latest version of RagFlow (0.19.0), in order to improve the parsing effect of various images in documents, we also tried to introduce a multimodal model (image2text) to enhance the parsing of image content. Let's analyze the relevant process in detail.

First, you need to configure an image2text model under the current tenant (there is a pitfall here, which will be discussed later). There are three main scenarios in which this image2text model is used in the document parsing process of RagFlow. Let's take a look at them one by one:

PDF document content extraction

If the image2text model is configured, then in the configuration page of the knowledge base, the PDF parser can select the image2text model you configured in addition to the original deepdoc and naive.


The whole parsing process is relatively simple. It just converts all PDF documents into images, then calls image2text to extract relevant text, and finally performs subsequent processing such as word segmentation. It does not perform more in-depth analysis of formats, tables, etc. like deepdoc. It is worth noting that the prompt words for extracting image text are:
INSTRUCTION:
Transcribe the content from the provided PDF page image into clean Markdown format.
- Only output the content transcribed from the image.
- Do NOT output this instruction or any other explanation.
- If the content is missing or you do not understand the input, return an empty string.

RULES:
1. Do NOT generate examples, demonstrations, or templates.
2. Do NOT output any extra text such as 'Example', 'Example Output', or similar.
3. Do NOT generate any tables, headings, or content that is not explicitly present in the image.
4. Transcribe content word-for-word. Do NOT modify, translate, or omit any content.
5. Do NOT explain Markdown or mention that you are using Markdown.
6. Do NOT wrap the output in ```markdown or ``` blocks.
7. Only apply Markdown structure to headings, paragraphs, lists, and tables, strictly based on the layout of the image. Do NOT create tables unless an actual table exists in the image.
8. Preserve the original language, information, and order exactly as shown in the image.

It can be seen that the main function of this prompt word is actually OCR, which prohibits the large model from further refining and processing information, and only transcribes word by word. I personally think that there is still a lot of room for optimization in this part. The current prompt word does not give full play to the reasoning and summarizing ability of the multimodal large model to integrate and filter the image information, but is only used as an advanced OCR model, and the actual effect is relatively limited. However, this function is currently only an experimental feature, and we can only look forward to subsequent updates and optimizations.

In addition, if the image2text model is currently used to parse PDF, the image information corresponding to the current slice will not be stored when storing the slice result chunk. In this way, only text information can be retrieved without the corresponding image. In theory, the PDF should have been converted into an image, which can be consistent with the parsing process of deepdoc. This is also a point that can be optimized.

Extraction of chart information from documents

The chart information in various documents was difficult to be accurately extracted and retrieved in the previous document parsing process, but the data in them is very important, which has become a pain point for RAG applications. In the new version of RagFlow, this part has finally been completed, that is, using a large multimodal model to extract the hidden information and data in the chart. This process defaults to the user not needing to make any explicit configuration. As long as the image2text model currently exists and the PDF parser uses deepdoc, then if a chart is recognized during the parsing process, the image2text model will be automatically called to try to extract information. The relevant code is as follows:

 if  layout_recognizer ==  "DeepDOC" :
            pdf_parser = Pdf()

            try :
                vision_model = LLMBundle(kwargs[ "tenant_id" ], LLMType.IMAGE2TEXT)
                callback( 0.15"Visual model detected. Attempting to enhance figure extraction..." )
            except  Exception:
                vision_model =  None

            if  vision_model:
                #pdf_parser will split the content into three parts: ordinary chapters, tables and charts according to the results of layout recognition, among which figures will be submitted to VisionFigureParser for further information extraction
                sections, tables, figures = pdf_parser(filename  if not  binary  else  binary, from_page=from_page, to_page=to_page, callback=callback, separate_tables_figures= True )
                callback( 0.5"Basic parsing complete. Proceeding with figure enhancement..." )
                try :
                    pdf_vision_parser = VisionFigureParser(vision_model=vision_model, figures_data=figures, **kwargs)
                    boosted_figures = pdf_vision_parser(callback=callback)
                    tables.extend(boosted_figures)
                except  Exception  as  e:
                    callback( 0.6f"Visual model error:  {e} . Skipping figure parsing enhancement." )
                    tables.extend(figures)
            else :
                sections, tables = pdf_parser(filename  if not  binary  else  binary, from_page=from_page, to_page=to_page, callback=callback)

            res = tokenize_table(tables, doc, is_english)
            callback( 0.8"Finish parsing." )

There is a tricky part here. The code determines whether the current tenant has a vision_model by detecting whether img2txt_id exists in the tenant table. I don’t know if it is a problem with my environment being upgraded from an old version or other problems. Even if I create an image2text model in the current tenant, this field is always empty. Only by manually modifying the tenant table and changing the field to the name of the image2text model I created can the chart enhanced parsing function be used normally.

Let’s take a look at the hint words for chart enhancement:

You are an expert visual data analyst. Analyze the image and provide a comprehensive description of its content. Focus on identifying the type of visual data representation (eg, bar chart, pie chart, line graph, table, flowchart), its structure, and any text captions or labels included in the image.

Tasks:
1. Describe the overall structure of the visual representation. Specify if it is a chart, graph, table, or diagram.
2. Identify and extract any axes, legends, titles, or labels present in the image. Provide the exact text where available.
3. Extract the data points from the visual elements (eg, bar heights, line graph coordinates, pie chart segments, table rows and columns).
4. Analyze and explain any trends, comparisons, or patterns shown in the data.
5. Capture any annotations, captions, or footnotes, and explain their relevance to the image.
6. Only include details that are explicitly present in the image. If an element (eg, axis, legend, or caption) does not exist or is not visible, do not mention it.

Output format (include only sections relevant to the image content):
- Visual Type: [Type]
- Title: [Title text, if available]
- Axes / Legends / Labels: [Details, if available]
- Data Points: [Extracted data]
- Trends / Insights: [Analysis and interpretation]
- Captions / Annotations: [Text and relevance, if available]

Ensure high accuracy, clarity, and completeness in your analysis, and includes only the information present in the image. Avoid unnecessary statements about missing elements.

The effect is quite obvious. For the same picture, if no chart enhancement analysis is performed, only the OCR operation will be performed by default to extract the text information:


2012 Sales price trend chart 120 103 100 S0 60 40 Sales price Average price Maximum Minimum 20 January February March April May June July August September October November December

After turning on the chart enhancement, various information of the chart can be basically extracted completely:

- Visual Type: Line Graph - Title: 2012 Sales Price Chart - Axes / Legends / Labels: - X-axis: Months (January to December) - Y-axis: Sales Price - Legend: - Black solid line: Sales Price - Green dashed line: Average Price - Red dots: Maximum/Minimum - Data Points: - Sales Price: - January: 41 - February: 82 - March: 79 - April: 32 - May: 41 - June: 75 - July: 67 - August: 49 - September: 76 - October: 105 - November: 95 - December: 103 - Average Price: Not explicitly marked with values ​​but implied by the green dashed lines line. - Maximum/Minimum: - January: 41 - February: 82 - March: 79 - April: 32 - May: 41 - June: 75 - July: 67 - August: 49 - September: 76 - October: 105 - November: 95 - December: 103 - Trends / Insights: - The sales price shows significant fluctuations throughout the year. - There is a peak in February (82) and another peak in October (105). - The lowest point occurs in April (32). - The trend generally increases from January to October before decreasing slightly in November and then rising again in December. - Captions / Annotations: - The red dots indicate the maximum and minimum sales prices for each month. - The black solid line represents the sales price over time. - The green dashed line represents the average price, although specific values ​​are not provided.2012 sales price trend chart 120 103 100 S0 60 40 一Sales price Average price Maximum Minimum 20 January February March April May June July August September October November December

Parsing of image files

In the analysis of image files, the multimodal model is also used by default to improve the analysis effect. The core code is in rag/app/picture.py. The key points can be seen in my comments:

def chunk ( filename, binary, tenant_id, lang, callback= None , **kwargs ): 
    img = Image.open (io.BytesIO(binary)).convert( ' RGB' )
    doc = {
        "docnm_kwd" : filename,
        "title_tks" : rag_tokenizer.tokenize(re.sub( r"\.[a-zA-Z]+$""" , filename)),
        "image" : img,
        "doc_type_kwd""image"
    }
    // First, use OCR to extract text from the image
    bxs = ocr(np.array(img))
    txt =  "\n" .join([t[ 0for  _, t  in  bxs  if  t[ 0 ]])
    eng = lang.lower() ==  "english"
    callback( 0.4"Finish OCR: (%s ...)"  % txt[: 12 ])
    # If the extracted text is greater than 32 characters, no subsequent multimodal large model extraction will be performed and the result will be returned directly
    # This is probably to improve performance. If enough information can be obtained through OCR, no large model will be used for parsing.
    if  (eng  and len (txt.split()) >  32or len (txt) >  32 :
        tokenize(doc, txt, eng)
        callback( 0.8"OCR results is too long to use CV LLM." )
        return  [doc]
    try :
        callback( 0.4"Use CV LLM to describe the picture." )
        cv_mdl = LLMBundle(tenant_id, LLMType.IMAGE2TEXT, lang=lang)
        img_binary = io.BytesIO()
        img.save(img_binary,  format = 'JPEG' )
        img_binary.seek( 0 )
        #Describe the image using the large model. Note that the input here is only the image, without any additional prompt words
        ans = cv_mdl.describe(img_binary.read())
        callback( 0.8"CV LLM respond: %s ..."  % ans[: 32 ])
        txt +=  "\n"  + ans
        tokenize(doc, txt, eng)
        return  [doc]
    except  Exception  as  e:
        callback(prog=- 1 , msg= str (e))

    return  []

In general, with the continuous development of current multimodal models, models with smaller parameters (3B, 7B, etc.) have made significant progress in the effect of image understanding, and have also promoted the continuous expansion of their application scenarios. Introducing multimodal models for analysis in the RAG field can effectively solve the problem of effective information extraction from images that has been difficult to solve before, avoid the complexity caused by the continuous use of multiple small models for OCR, layout analysis, and text merging, and further improve the retrieval effect of data. Although there are still a series of problems such as performance and latency in large-scale scene applications, including how to accurately extract business information from images, there is still a lot of room for optimization, but I believe that these problems will be slowly solved over time. End-to-end extraction of image information based on multimodal large models, and even direct indexing and retrieval of images based on semantics is the direction of future development.