Adobe launches MDocAgent, a multi-agent, cross-modal framework: performance of complex document understanding increases by 12% and error rate drops by 21%

Adobe's breakthrough AI technology, the MDocAgent multi-agent framework, achieves a new level of document understanding.
Core content:
1. Existing challenges and limitations of complex document understanding
2. Innovative design of the MDocAgent multi-agent framework
3. Cross-modal information fusion and document question-answering performance improvement
Documentation Q&A is too difficult
When answering long documents containing rich text and visual elements (such as charts and images), existing solutions have limitations:
Traditional large language models (LLMs) can only process textual information. Although Large Vision-Language Models (LVLMs) are capable of processing visual content, they are often inefficient when processing long documents and have difficulty in effectively fusing and reasoning about textual and visual information. Although existing retrieval-augmented generation (RAG) methods are capable of extracting key information from long documents, they usually rely only on a single modality (text or image) for retrieval and lack the ability to integrate information across modalities.
MDocAgent
5 Agents
MDocAgent introduces multiple specialized agents to collaboratively process text and image information to achieve a deep understanding of document content. Specifically, MDocAgent contains the following five agents:
General Agent: Responsible for the initial integration of multimodal information and providing a basis for subsequent analysis. Critical Agent: Identifies and extracts textual and visual information that is critical to answering questions and provides guidance to other agents. Text Agent: Focuses on analyzing text information and extracting details related to the question from the text. Image Agent: Focuses on analyzing image information and extracting visual details relevant to the problem from the image. Summarizing Agent: Combines the outputs of all agents to generate the final answer.
MDocAgent Architecture
MDocAgent achieves a comprehensive understanding of document content through the collaborative work of multimodal and multi-agent agents. This framework not only focuses on the independent analysis of text and image information, but also emphasizes the fusion and reasoning of cross-modal information. By organically combining document preprocessing, multimodal context retrieval, key information extraction, specialized agent processing, and answer synthesis, MDocAgent can accurately locate and integrate key information in complex document environments to generate accurate answers.
Document preprocessing: The purpose of document preprocessing is to convert documents into a format suitable for subsequent analysis. For each page in the document, OCR technology is used to recognize the text content in the image, and PDF parsing technology is used to extract the digitized text. The extracted text is represented as a sequence of text paragraphs, each of which contains part or all of the text in a page. At the same time, the original image of each page is retained for subsequent visual analysis. Multimodal contextual retrieval: The goal of the multimodal context retrieval stage is to retrieve the text and image information most relevant to the question from the document. Use ColBERT to index text paragraphs in documents and retrieve the most relevant text paragraphs based on the question. Use ColPali to process the image pages in the document, generate visual embedding vectors, and retrieve the most relevant image pages based on the question. By comparing the relevance scores of text and images, the most relevant text paragraphs and image pages are selected as the context for subsequent analysis. This stage provides rich contextual information for subsequent agent analysis by combining text and image retrieval.
Key information extraction: The purpose of the key information extraction stage is to extract information that is crucial to answering the question from the retrieved context. The general agent performs preliminary analysis on the retrieved text and image information and generates preliminary answers. The key information extraction agent further analyzes this information and extracts the textual and visual information that is crucial to answering the question. The extracted key information will serve as input for subsequent specialized agents to guide their analysis process. This stage improves the efficiency and accuracy of the system by extracting key information and providing clear guidance for subsequent specialized agent processing. Specialized agents handle: The purpose of specialized agent processing is to conduct in-depth analysis of the extracted key information. The text agent receives the extracted key text information and related text context, and generates detailed text answers by analyzing the text content. The image agent receives the extracted key visual information and relevant image context, and generates detailed visual answers by analyzing the image content. The text agent and image agent analyze the question from the perspective of text and image respectively, and the generated answers will serve as input for the subsequent answer synthesis stage. In this stage, through the collaborative work of specialized intelligent agents, in-depth analysis of text and image information is achieved, providing rich basis for the generation of the final answer.
Answer synthesis: The purpose of the answer synthesis stage is to integrate the outputs of all agents to generate the final answer. The Summarization Agent receives answers from the General Agent, the Text Agent, and the Image Agent. The summarizing agent analyzes these answers and identifies commonalities, differences, and complementary information. Based on these analysis results, the summarizing agent generates a comprehensive answer that takes into account not only the text and image information but also the relationship between them. This stage generates a comprehensive and accurate answer by comprehensively analyzing the outputs of all agents.
MDocAgent solution VS M3DocRAG & ColBERT+Llama3.1
Case 1
Request to compare population counts of two different Latino groups in the document: foreign-born Latinos and Latinos interviewed via cell phone. The document contains relevant text descriptions and tabular data, but this information is scattered in different locations and needs to be extracted and integrated from both the text and the image.
Retrieval phase: ColBERT and ColPali successfully retrieved pages containing relevant information, but simply retrieving the pages is not enough, and further analysis of the specific content on the pages is required. ColBERT: Relying only on text information, it failed to accurately parse the numerical data in the text and incorrectly concluded that “there are more foreign-born Latinos.” M3DocRAG: Although it combines text and image information, it fails to answer the question correctly due to the lack of detailed extraction of key information and cross-modal integration capabilities. MDocAgent avoids the limitations of single-modality approaches by utilizing both text and image information through multimodal contextual retrieval. Preliminary analysis and key information extraction: The General Agent generated a preliminary but vague answer, noting that “more Latinos were interviewed via cell phone.” The Critical Agent identified critical information, including “Foreign Born (Excluding Puerto Rico)” in the text and the “Cell Phone Sampling Frame” table in the image. Specialized agents handle: The Text Agent extracts from the text “the number of foreign-born people (excluding Puerto Rico) is 795” based on the clues provided by the Key Information Extraction Agent. The Image Agent extracts from the table that “the number of people interviewed via mobile phone is 1051.” Answer synthesis: The Summarizing Agent combines the outputs of all agents to generate the final answer: "The number of Latinos interviewed via phone (1051) is greater than the number of foreign-born Latinos (795)".
Case 2:
The requirement is to identify the reason for the only image from a list that does not contain a person. The document contains a list of reasons for NTU's smart campus, but the list is not clearly numbered, and there is a corresponding image next to each reason.
Retrieval phase: ColBERT failed to retrieve the correct evidence page, causing ColBERT + Llama-3.1-8B to be unable to answer the question. ColPali successfully retrieved pages containing evidence, but failed to correctly answer the question due to its lack of detailed extraction of key information and cross-modal integration capabilities. MDocAgent successfully locates the page containing key information through multimodal retrieval. Preliminary analysis and key information extraction: The General Agent generated a preliminary answer but failed to accurately identify it. The Critical Agent identified the key text clue “Most Beautiful Campus” and the corresponding visual element (an image of the NTU campus). Specialized agents handle: The Text Agent tried to find relevant information from the text based on the clues provided by the Key Information Extraction Agent, but it was unable to find the answer directly because there were no clear list numbers in the text. The Image Agent uses the key information to correctly identify “NTU campus without people” as the answer. Answer synthesis: The Summarizing Agent combines the outputs of all agents and finally determines the answer to be "Most Beautiful Campus", noting that there are no people in the image for this reason.
Case 3:
Request to identify Professor Lebour's degree from the document. The document contains relevant text descriptions and images, but the information is scattered in different locations and needs to be extracted and integrated from both the text and the images.
Retrieval phase: ColBERT successfully retrieved the page containing relevant information, but ColBERT + Llama-3.1-8B incorrectly identified “FGS” as a degree when generating the answer, resulting in an inaccurate answer. ColPali failed to retrieve the correct page, causing M3DocRAG to be unable to answer the question. MDocAgent successfully locates the page containing key information through multimodal retrieval. Preliminary analysis and key information extraction: The General Agent generated a preliminary answer but failed to accurately identify all the information, incorrectly identifying "FGS" as a degree. The Critical Agent identified the key text clue “MA” and extracted relevant visual clues from the image. Specialized agents handle: The Text Agent extracts “GA Lebour, MA, FGS” from the text based on the clues provided by the Key Information Extraction Agent and confirms that “MA” is a degree. The Image Agent uses the key information to confirm that no additional degree information is provided in the image, but supports the “MA” in the text. Answer synthesis: The Summarizing Agent combines the outputs of all agents and determines the answer to be “Prof. Lebour holds a Master of Arts (MA) degree.”