Tsinghua pioneered multimodal + knowledge graph + RAG, with a question-answering accuracy of over 94%

Written by

Silas Grey

Updated on:June-13th-2025

1. Difficulties faced by multimodal RAG

Knowledge graphs (KGs) provide feasibility for multi-hop reasoning and precision-recall context by encoding entities and their relationships in a structured form.
However, in multimodal resources, the relationships between entities are very complex, leading to fragmented retrieval output and persistent hallucination problems.
Furthermore, the construction and maintenance of knowledge graphs requires a lot of manual labor, and combining them with vector search and LLM hints will increase a lot of engineering overhead.
Existing knowledge graph and RAG hybrid frameworks face scalability bottlenecks and require a lot of manual adjustments to maintain robustness when knowledge updates.

DO-RAG’s Solution

The core idea of DO-RAG is to transform unstructured, multimodal domain data into a dynamic, multi-level knowledge graph, and retrieve structured, context-rich information by combining graph traversal and semantic vector search.

In the generation phase, the output is verified through fact-based refinement steps to reduce hallucination phenomena and thus improve the factual accuracy of the answers.

The DO-RAG system architecture consists of four key phases:

Multimodal document extraction and chunking
Multi-level entity-relation extraction to build knowledge graph
Hybrid retrieval that combines graph traversal and dense vector search, and for generating fact-based
A multi-stage generation pipeline for user-aligned answers.

2.1 Multimodal Document Extraction and Chunking

First, heterogeneous domain data (such as logs, technical manuals, diagrams, and specifications) are parsed into meaningful chunk units.
These blocks are stored in a pgvector-enabled PostgreSQL instance while retaining their vector embedding.
Meanwhile, the agent-based chained thinking entity extraction pipeline transforms the document content into a structured multimodal knowledge graph (MMKG) that captures multi-granularity relations such as system parameters, behaviors, and dependencies.

2.2 Multi-level knowledge graph construction

sex.

DO-RAG designs and implements a hierarchical agent-based extraction pipeline to automatically build and update knowledge graphs that capture entities, relations, and attributes.

The pipeline consists of four specialized agents, each operating at a different level of abstraction:

Advanced Agents : Identify structural elements such as chapters, paragraphs, etc.
Mid-level agents : extract domain-specific entities such as system components, APIs, and parameters.
Low-level agents : capture fine-grained operational relationships such as thread behavior or error propagation.
Covariant Agents : Attach properties to existing nodes, such as default values, performance impact, etc.

In this way, DO-RAG can extract structured knowledge from documents in multiple modalities such as text, tables, code snippets, and images, and build a dynamic knowledge graph.

2.3 Hybrid Retrieval and Query Decomposition

When a user submits a question, DO-RAG uses an LLM-based intent analyzer to perform structural decomposition of the user query and generate subqueries to guide retrieval from the knowledge graph and vector storage.

First, relevant nodes are retrieved from the knowledge graph by semantic similarity, and then a multi-hop traversal is performed to expand the retrieval scope and generate structured, domain-specific context.
Next, the original query is rewritten using the graph-derived context to make it more specific and unambiguous.
Finally, the rewritten query is vectorized and used to retrieve semantically similar text fragments from the vector database.

All relevant information sources (original query, rewritten query, knowledge graph context, retrieved text snippets, and user interaction history) are integrated into a unified hint structure and passed to the generation pipeline.

2.4 Fact-based answer generation and hallucination mitigation

In the generation phase, DO-RAG adopts a staged prompting strategy.

An initial simple prompt instructs the LLM to answer questions based only on the evidence retrieved and to explicitly avoid unsupported content.
The outputs were refined with prompts, answers restructured and verified, and then passed through a compression phase to ensure coherence and brevity.
In addition, DO-RAG generates follow-up questions based on the refined answers and proposes the next step of inquiry based on the overall conversation context, enhancing user engagement and supporting multi-round interactions.
If the system cannot find enough evidence, the model returns “I don’t know” to maintain reliability and prevent hallucinations.

3. Some details

3.1 Knowledge Graph Construction

Multimodal document ingestion :
The system receives heterogeneous domain data containing text, tables, and images, normalizes them, and segments them into meaningful chunks.
At the same time, metadata such as source file structure, chapter hierarchy, and layout tags are preserved for easy traceability.
Entity-Relation Extraction :
Extracting structured knowledge via a multi-agent pipeline
Advanced agents identify structural elements of documents
Intermediate agents extract domain-specific entities
Low-level agents capture fine-grained operational relationships
Covariant agents attach properties to existing nodes
The output of each agent is integrated into a dynamic knowledge graph, where nodes represent entities, edges represent relationships, and weights represent confidence.
Deduplication and optimization :
Avoid redundancy in the knowledge graph by computing the cosine similarity between new entity embeddings and existing ones.
In addition, summary nodes are synthesized to group similar entities and reduce graph complexity.

3.2 Hybrid retrieval steps

Query breakdown :
User queries are decomposed into multiple sub-queries by the LLM-based intent analyzer, each of which represents a discrete information intent.
Knowledge graph retrieval :
The initial query is embedded into the vector space and matched with relevant entities in the knowledge graph.
Then, a multi-hop traversal is performed to expand the retrieval scope and generate structured, domain-specific context.
Query Rewrite :
Leveraging graph-derived context, the original query is rewritten through graph-aware hint templates to make it more specific and explicit.
Vector search :
The rewritten query is vectorized and semantically similar text snippets are retrieved from the vector database.
Information Integration :
All relevant information sources (original query, rewritten query, knowledge graph context, retrieved text snippets, and user interaction history) are integrated into a unified prompt structure.

3.3 Generation and refinement steps

Initial answer generation :
Based on the retrieved evidence, a simple prompting strategy is used to generate an initial answer. At this point, the LLM is explicitly instructed to avoid generating unsupported content.
Refinement of answer :
Pass the initial answer to a refinement prompt to reframe and verify the answer for factual accuracy.
Answer condensed :
The compression phase adjusts the tone, language, and style of the answer to align it with the original query while ensuring coherence and brevity.
Follow-up question generation :
Based on the refined answers and the overall conversation context, generate follow-up questions to guide users to explore more deeply.