SuperRAG: Layout-aware graph modeling beyond RAG

Written by

Jasper Cole

Updated on:July-09th-2025

This paper introduces layout-aware graph modeling for multimodal RAG. Unlike traditional RAG methods that mainly deal with flat text blocks, the proposed method considers the relations between multimodalities by using graph structures. To this end, a graph modeling structure is defined based on document layout parsing. The structure of the input document is preserved through the connection of text blocks, tables and graphs. This representation allows the method to handle complex problems that require information from multiple modalities. To confirm the efficiency of graph modeling, a flexible RAG pipeline is developed using powerful components. Experimental results on four benchmark test sets confirm the contribution of layout-aware modeling to the performance improvement of the RAG pipeline.

Retrieval-augmented generation (RAG) is an emerging paradigm that improves the reasoning ability of large language models (LLMs) by providing them with additional context to reduce their hallucination phenomena. This approach has received extensive attention in recent years due to its significant effect in enhancing the capabilities of LLMs. In this field, graph-based RAG methods have emerged, which further improve system performance and interpretability by introducing structured knowledge.

Unlike traditional RAG methods that directly process raw data as independent text blocks, graph-based RAG methods innovatively represent input data as a graph structure that considers the relationship between text blocks. Although existing RAG pipelines perform well in text modality, they still face major challenges when processing multimodal inputs. This is mainly due to two key factors: first, input documents usually contain diverse layout, structure, and multimodal information, which need to be effectively captured in the RAG pipeline, among which layout information plays an important role in LLM understanding documents; second, input questions often require the integration of information from different modalities. For example, when answering questions such as "Please list the standard steps for creating teaching materials for Internet navigation software", it may be necessary to refer to both flowcharts and text content.

This paper proposes a novel graph-based RAG solution that effectively addresses the above challenges in multimodal question answering. The solution consists of four core steps: document parsing, data modeling, advanced information retrieval, and reasoning. In the document parsing stage, the system is able to handle multiple input types and integrate internal and third-party readers. In the data modeling stage, a knowledge graph (KG) that retains the document layout and structure is innovatively introduced. This layout-aware representation method significantly improves the performance of the information retrieval (IR) step. By combining data modeling in the form of KG with full-text search and vector search, the system constructs an advanced IR module.

The innovation of this study is mainly reflected in three aspects: first, a new layout-aware graph modeling (LAGM) structure is proposed to represent the input document of RAG, which effectively retains the document layout information; second, the most advanced powerful technology is integrated to build a unified RAG pipeline; finally, the experimental results on public benchmark datasets show that the proposed SuperRAG method has achieved significant improvement compared with other strong RAG baselines. In addition, the study also provides a RAG pipeline system for user experience.

RAG (Retrieval-Augmented Generation) is an innovative approach designed to enable large language models (LLMs) to fill knowledge gaps and reduce hallucinations. By retrieving relevant information from external knowledge sources, RAG can help LLMs generate more accurate and reliable answers. This approach has shown significant results in multiple tasks, including code generation, domain-specific question answering, and open-domain question answering.

Graph-based RAG methods further extend this paradigm by using graph structures to capture the relationships between concepts. Graph structures have been widely used in various scenarios, such as building knowledge graphs, processing long context information, and integrating multimodal data. Graph structures have also been used to improve the quality of RAG in different ways, such as through hyper-relational knowledge graphs, graph-based agents for processing long context, knowledge graph summarization, and graph neural networks. However, most existing studies focus on text modalities, and relatively little attention has been paid to multimodal data.

This study follows the direction of building multimodal knowledge graphs and proposes a new layout-aware graph modeling (LAGM) approach. Compared with previous work, SuperRAG places special emphasis on structural granularity and document layout analysis, introduces a modern and general data model, and combines the table of contents (ToC) and main section information to improve the retrieval effect of large documents. These innovations not only preserve the structure of the document, but also significantly improve the accuracy and efficiency of retrieval. In addition, our method enhances the processing ability of diverse document types through an internal reader, not just limited to the text structure of PDF files.

Layout-Aware Graph Modeling (LAGM)

Layout-aware graph modeling aims to efficiently represent input documents while preserving their original layout and structure. This approach stems from the need to enhance the understandability and manageability of property graphs, especially in applications involving multimodal and complex data. For example, when a query requires extracting information from a table or chart, the RAG pipeline needs to be clear about the section or subsection to which these contents belong.

Document layout parsing

The first step in building LAGM is to parse input documents of different modalities, including text, tables, charts, and images, using specialized readers. This step outputs a structured format that lays the foundation for graph creation. We combine enhancements from our in-house document parser and Azure Document Intelligence (DI) to ensure robust handling of diverse layouts.

Internal document parser
Our internal parser is designed as a modular pipeline that processes each page of a document independently. It starts with a loading layer for format conversion and preprocessing, followed by an AI model to extract layout, table structure, OCR text, and chart content. The processed data goes through post-processing steps such as reading order sorting and relationship extraction, and is finally output in JSON or Markdown format.

The key components of the internal parser include document layout analysis (DLA), reading order detection, table structure recognition, and chart classification. The DLA module is pre-trained on the DocLayNet dataset and further fine-tuned using a large number of internally annotated PDF pages, enabling the model to recognize 9 different layout tags such as titles, tables, and charts. This design ensures the system's efficiency and accuracy when processing complex documents.

Figure 1: Pipeline of the internal parser. For reading order detection, the parser adopts the method proposed by Wang et al. (2021) to extract natural reading sequences using 5010 annotated document images. Table structure recognition is implemented using an internal library to accurately recognize various table formats. Finally, chart and table classification relies on a curated dataset to classify tables into subtypes (e.g., full lines, borderless) and charts into specific types (e.g., charts, diagrams), ensuring accurate extraction of visual elements. Table 1 reports

Comparison of the internal reader with other strong read methods. NID stands for Normalized Insertion Deletion Distance for Layout and Sequential Reads. TEDS is a tree edit distance similarity based text and table structure identification. TEDS-S is a tree edit distance similarity based structure for table structure identification only. We can observe that the internal reader achieves competitive results, which is good for implementing a practical RAG pipeline.

Enhance PDF parsing with Azure DI

Azure DI enhances the parser’s capabilities by excelling in section heading and paragraph detection. It supports both searchable and non-searchable PDFs and helps in creating a table of contents (ToC). To generate the ToC, we use the tables, chapters, and figures output by Azure DI to do the following: (1) Match physical and printed page numbers. (2) Detect ToC based on keywords. (3) Replace printed page numbers with physical page numbers. This integration ensures superior layout-aware diagram modeling and improves ToC generation for structured navigation.

Data Modeling

After parsing, each document page can be decomposed into titles, headers, chapters, text blocks, tables, and charts. The data modeling step aims to create a granular level design for the property graph. Figure 2 shows the definition of LAGM.

Figure 2: Knowledge graph used for data modeling. The company node is the root node, representing the overall entity or corpus, such as a company, and captures metadata such as the company's name. Each document node is linked to a company and represents a single document, with attributes such as document name, type, and path.

The document is linked to the Page node, which represents each page and includes properties such as page index, header, footer, and text content. The Table of Contents node is also linked to the document, provides an overview of the document's structure, and is linked to the Main Section node. The Main Section organizes the content hierarchically and is linked to the Section, Table, and Figure nodes.

Chapter nodes represent logical divisions in a document and include properties such as chapter title and content. Chapters are connected sequentially via “has_next” relationships, ensuring the flow of content. They can also be linked to more fine-grained SectionChunk nodes, capturing the text under the chapter. Table nodes represent tabular data, and graph nodes represent visual elements, providing additional structure. Tables may be further connected to TableChunk nodes, which are used to store text content within the table. These explicit “is_under” and “has_next” relationships reflect the natural hierarchy and flow of documents. This design supports layout-aware graph modeling and efficient information retrieval, enhancing applications like the RAG pipeline by enabling precise navigation and knowledge extraction.

SuperRAG Frame

Based on Layout-Aware Graph Modeling (LAGM), we propose an advanced retrieval enhancement framework that combines a large language model (LLM) and a heuristic-driven approach to achieve flexible and efficient retrieval. This framework significantly enhances the performance of RAG-based pipelines by improving application adaptability and scalability.

LLM-based graph traversal

The approach leverages a large language model (LLM) for context-aware graph traversal. By taking a graph schema (as shown in Figure 2) as input, the LLM dynamically generates Cypher queries, enabling intelligent and relation-driven retrieval. This approach is particularly well suited for processing complex multimodal data as well as complex document structures encoded in graphs. Detailed information about the LLM hint can be found in the Appendix.

Heuristic-based retrieval

To complement LLM-based approaches, the framework uses table of contents, tables, and figures as heuristics for information retrieval (IR) enhancement. In terms of table of contents processing, the framework combines the structured output of LLM with hint engineering (as shown in Figure 4) and uses heuristics to extract the table of contents for indexing. This is because the table of contents contains important structured information during the indexing process. During the retrieval process, the system calculates the semantic similarity score between the section title and the query to achieve targeted content retrieval. In addition, with a small number of example hints, LLM is able to directly extract relevant pages based on a given query.

In terms of table processing, the framework uses the DETR model for table detection and recognition, and then reconstructs the table structure through the OCR engine to ensure that the table content is accurately captured and retrieved in the SuperRAG pipeline. For chart processing, the system uses an OCR model to extract text from charts, and inputs image and text information into a multimodal LLM (such as GPT-4) to better interpret the chart content. This approach achieves context-aware understanding of visual elements, ensuring that charts are better integrated in the retrieval and reasoning process. These heuristic methods show high efficiency, robustness, and significant effects when processing structured content.

Comparison and Advantages

The dual design of the SuperRAG framework balances flexibility and efficiency. LLM-based traversal performs well in unstructured and exploratory tasks, while heuristic methods provide predictable performance for high-throughput systems. The two complement each other to build a scalable and adaptive RAG pipeline that fully utilizes the graph structure to achieve optimal retrieval. This combination not only improves the overall performance of the system, but also provides strong support for processing complex multimodal data.

Graph Enhancement

To further enrich LAGM (Language-Augmented Graph Model), we introduce K-Nearest Neighbors (KNN) (Cover & Hart, 1967) as a graph augmentation technique that aims to establish new "is_similar" relationships between nodes in the graph structure. The KNN algorithm achieves this goal by calculating the similarity of node attributes, where the choice of similarity metric (such as cosine similarity, Jaccard similarity coefficient, or Euclidean distance, etc.) depends on the specific data type. In addition, we also generate "has_stem" relationships through synonym expansion and stemming techniques to establish connections between term nodes representing related concepts.

application

Figure 3 shows the overall processing flow of LAGM, which integrates multiple retrievers and rerankers, and combines heuristic graph traversal, similarity search, and language model-based technologies to achieve efficient retrieval and ranking functions. The process has the following notable features: first, it uses graph representation to integrate contextual information across pages; second, for documents containing structured information, the system is equipped with a dedicated table of contents (TOC) retriever to improve the contextual quality of specific queries; in addition, the process also uses a graph expansion mechanism to handle queries that need to extract information from tables and graphs, and optimizes retrieval results through a self-reflection layer.

Figure 3: The proposed SuperRAG framework. It evaluates whether the query intent requires tabular or graphical information. It selectively integrates these elements only when they contribute to more accurate answers, reducing the retrieval of irrelevant content. It is worth noting that LAGM is pipeline-agnostic and can be integrated into any RAG pipeline.

For SPIQA, SuperRAG performs well in all three test sets, especially in the graph- and table-based question answering tasks. In Test-A, it achieves the highest average accuracy (59.9%) and a remarkable 63.5% on table-based questions, 7% higher than the best baseline. For Test-B, SuperRAG leads again with an average accuracy of 63.2%, surpassing the strongest baseline Claude3.5 Sonet (49.5%). It achieves 66.2% on graph-based tasks and 58.9% on table-based tasks, demonstrating a balanced advantage across modalities. In Test-C, SuperRAG achieves an overall accuracy of 57.2%, with outstanding performance on graphs (58.2%) and tables (56.7%). In contrast, the runner-up Claude-3.5 Sonnet only achieves 46.0%, a gap of 12.2%. These results highlight SuperRAG's ability to effectively handle multimodal inputs even when competing with enterprise systems.

This paper introduces layout-aware graph modeling for multimodal data construction of RAG. The modeling takes into account the structure of the input document to build a graph containing the relationships between text blocks, tables, and charts. A RAG pipeline is also developed to confirm the effectiveness of the modeling. Experimental results on four public test sets show two important points. First, layout-aware modeling is beneficial for improving the performance of RAG compared to non-layout-aware and other strong RAG pipelines. Second, the designed RAG pipeline is flexible, and adding more complex RAG-related components can improve the performance of the system. The modeling and RAG pipeline are practical in commercial scenarios. ## Limitations

First, our approach is highly dependent on accurate document layout parsing and high-quality data modeling. If these components are not aligned or if document structure extraction tools are limited, the effectiveness of the pipeline may be reduced. In particular, noisy layouts or document structure variations in different domains may affect the quality of information retrieval (IR) and, in turn, the reasoning performance of the pipeline. In addition, incorporating tables, diagrams, and non-text elements into a coherent graph structure may increase the computational overhead and make the pipeline resource-intensive. This may affect scalability, especially in real-world applications that require high throughput or have limited computational resources.