Using RAG's ideas to build a document-level knowledge graph framework - RAKG

Written by
Clara Bennett
Updated on:June-30th-2025
Recommendation

Explore new paradigms for document-level knowledge graph construction and break through the limitations of traditional methods.

Core content:
1. RAKG framework solves the challenges of long text processing and cross-document integration
2. Knowledge base vectorization processing and pre-entity construction process
3. Relationship network construction: integration method from text blocks to knowledge graphs

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)

GraphRAG has verified through some scenarios that the KG+LLM paradigm can effectively enhance the performance of the RAG system. As for how to combine documents to establish multimodal GraphRAG, the author has previously shared related information, such as: "Preliminary Exploration of Multimodal GraphRAG: Document Intelligence + Knowledge Graph + Large Model Combination Paradigm, https://mp.weixin.qq.com/s/coMc5jNPJldPk9X74tDAbA".

Next, let's look at a framework for building a document-level knowledge graph using the RAG idea. The problem this idea aims to solve is how to automatically build a document-level knowledge graph . Traditional knowledge graph construction methods face the problems of long-distance forgetting in long text processing , complex entity disambiguation , and insufficient cross-document knowledge integration . You can refer to the overall idea below.

method

The process in the figure above: The RAKG framework processes documents through sentence segmentation and vectorization, extracts preliminary entities, and performs entity disambiguation and vectorization. The processed entities are retrieved through corpus retrospective retrieval to obtain relevant text and graph structure retrieval to obtain relevant knowledge graphs. Subsequently, LLM is used to integrate the retrieved information to build relationship networks, which are merged for each entity. Finally, the newly constructed knowledge graph is combined with the original knowledge graph.

A. Assumptions of an Ideal Knowledge Graph

RAKG assumes that there is a theoretically perfect knowledge graph construction process that can transform documents into an ideal complete knowledge graph. This ideal knowledge graph can be expressed as:

in, From the documentation  The ideal knowledge graph constructed contains all semantic relationships.

B. Knowledge Base Vectorization

RAKG vectorizes documents and knowledge graphs to facilitate subsequent retrieval and generation operations.

  1. Document segmentation and vectorization:  The document is segmented into multiple text chunks, usually in units of sentences. Each text chunk is vectorized for subsequent processing and analysis. Similar to RAG, this method can reduce the amount of information processed by LLM each time while ensuring the semantic integrity of each segment, thereby improving the accuracy of named entity recognition.

  2. Knowledge graph vectorization:  Each node (such as entity) in the initial knowledge graph is vectorized by extracting its name and type. The BGE-M3 model is used for vectorization to facilitate use in the retrieval process.

C. Pre-construction

RAKG identifies entities in text through named entity recognition (NER) and processes these entities as pre-entities.

  1. Entity recognition and vectorization:  The entire process of NER is completed by LLM (Qwen2.5-72B). First, named entity recognition is performed on each text block to identify the entities in it. Then, a type and attribute description are assigned to each pre-entity to distinguish different entities with similar names. Finally, the name and type of the entity are combined and vectorized.

  2. Entity disambiguation:  After completing entity recognition and vectorization of the entire document, a similarity check is performed. For entities whose similarity exceeds a threshold, further disambiguation processing is performed to ensure that each entity has only one unique representation.

D. Relationship network building

RAKG builds a relationship network through the RAG method .

  1. Document text chunk retrieval:  For a specified entity, retrieve the relevant text chunks by chunk identifier (chunk-id). Use vector retrieval to obtain text chunks similar to the selected entity.

  2. Graph structure retrieval:  Perform vector retrieval in the initial knowledge graph to obtain other entities similar to the selected entity and their relationship network.

  3. Relation network generation and evaluation:  The retrieved text and relation network information are integrated and input into LLM to generate the attributes and relations of the central entity. LLM is used as a judge to evaluate the generated triples to ensure their authenticity and accuracy.

E. Knowledge Graph Fusion

RAKG fuses the newly constructed knowledge graph with the initial knowledge graph . Naturally, KG fusion has two core contents.

  1. Entity merging:  disambiguate and merge entities in the new knowledge graph with those in the initial knowledge graph to ensure entity consistency.

  2. Relation integration:  Integrate the relations in the new knowledge graph with those in the initial knowledge graph to obtain a more comprehensive knowledge graph.

Evaluation Metrics

The evaluation indicators are mainly used to evaluate KG, so it is a good opportunity to review the common evaluation indicators of KG.

1. Entity Density (ED)

Entity density refers to the number of entities in the knowledge graph. The formula is as follows:

in, Represents the number of entities extracted from the knowledge graph. A higher entity density usually means more information is extracted from the text and a wider coverage of the knowledge graph.

2. Relationship Richness (RR)

Relationship richness refers to the ratio of the number of relationships to the number of entities in the knowledge graph. The formula is as follows:

in, Represents the number of relations extracted from the knowledge graph. The higher the relation richness, the more complex the relations between entities in the knowledge graph, and the better it can capture the interactions between entities.

3. Entity Fidelity (EF)

Entity fidelity is used to evaluate the credibility of the extracted entities. The formula is as follows:

in, is a function that evaluates each extracted entity  It is based on LLM's judgment of the entity and returns a value between 0 and 1, indicating the credibility of the entity.

4. Relationship Fidelity (RF)

Relation fidelity is used to evaluate the credibility of the extracted relations. The formula is as follows:

in, is a function that evaluates each extracted relation  It is based on LLM's judgment of the relationship and returns a value between 0 and 1, indicating the credibility of the relationship.

5. Accuracy

Accuracy refers to the performance of the knowledge graph in the question-answering task. The accuracy of answering questions through the constructed knowledge graph. Higher accuracy means that the knowledge graph can better preserve the semantic information of the text.

6. Entity Coverage (EC)

Entity coverage measures the degree of match between entities in the evaluation knowledge graph and entities in the standard knowledge graph. The formula is as follows:

in, is the set of entities in the evaluation knowledge graph, It is the set of entities in the standard knowledge graph. The higher the entity coverage, the better the integrity of the knowledge graph at the entity level.

7. Relation Network Similarity (RNS)

The relational network similarity measures the similarity between the evaluation knowledge graph and the standard knowledge graph at the relational level. The formula is as follows:

in, Indicates the similarity of the relationship network of the same entity in the evaluation knowledge graph and the standard knowledge graph. is the weight of the corresponding entity. The higher the similarity of the relationship network, the better the accuracy of the knowledge graph at the relationship level.

These metrics are used to comprehensively evaluate the quality of the knowledge graph, ensuring its performance in entity extraction, relationship construction, and overall accuracy.

Experimental results