A simple framework for solving the RAG question-answering hallucination problem based on KG generated corpus enhancement - Walk & Retrieve

Written by
Clara Bennett
Updated on:June-18th-2025
Recommendation

A new framework for solving the RAG question-answering hallucination problem using knowledge graphs to improve answer accuracy and robustness.

Core content:
1. Knowledge graph-based corpus generation method for the Walk&Retrieve framework
2. Application of random walks and breadth-first search in graph traversal
3. Experimental performance evaluation on MetaQA and CRAG datasets

Yang Fangxian
Founder of 53A/Most Valuable Expert of Tencent Cloud (TVP)

Walk&Retrieve is based on the knowledge graph and uses graph traversal and knowledge representation to generate the corpus of zero-shot RAG. It solves the hallucination problem of the RAG system. The framework idea is relatively simple, and the core point is the corpus generation of zero-shot RAG . Let's take a look at it for reference.

Corpus Generation

In the framework, corpus generation is the core step of the method. This stage extracts relevant information from the knowledge graph and converts it into a text format suitable for LLM processing. Corpus generation includes the following steps: graph-based traversal, knowledge representation, and indexing.

1. Graph-based traversal

  1. Random walk : Random walk is a random process that starts from a node and selects neighbor nodes of the current node for movement with uniform probability each time.

    in, Representation Node  The number of neighbors. For each node ,generate  The length of the bar is  The random walk path of The final corpus  is the set of random walk paths of all nodes.

  • Advantages : Simple and easy to use, suitable for large-scale graphs.
  • Disadvantages : May generate duplicate paths and noise.
  • Breadth First Search - BFS Walk : BFS is a graph traversal algorithm that starts from the root node and visits its neighbor nodes layer by layer. For each root node , build a hierarchical structure, where the nodes at each level represent the shortest path distance to the root node. Then, traverse in hierarchical order to ensure that each node is visited only once.

    in,, is the maximum depth.

    • Advantages : Avoids repeated paths and generates more diverse walking paths.
    • Disadvantages : High computational complexity, especially when traversing deep layers.

    2. Knowledge Representation

    LLM requires text input and needs to convert the extracted graph traversal path into a natural language description. Using predefined prompt templates, the walk path of each node is converted into a natural language sentence. For example, for a random walk path , you can generate something like " Through relationships  Connect to ,and  Through relationships  Connect to "Sentence.

    3. Index

    Each walking path  Convert to vector representation and calculate the global representation of each node as the concatenation of all its walk path vectors. Store the nodes and their corresponding walk path vectors for fast retrieval during the inference phase.

    Search Questions and Answers

    This stage is not the focus and is the same as traditional RAG, including query encoding, similarity retrieval (k-nearest neighbor search), context integration, and answer generation.

    Experimental performance

    Performance on MetaQA: Walk&Retrieve-BFS performs best in answer accuracy and reduction of false answers, with a relative improvement of 38.64%. Other KG-based RAG systems have high accuracy but more false answers. Walk&Retrieve-BFS excels in truthfulness and reduction of non-responses on 1-hop, 2-hop, and 3-hop questions.

    Performance on CRAG: The Walk&Retrieve variant outperforms LLM-only and text-based RAG in answer accuracy, while being comparable to them in false answer and non-response rates. Due to the higher complexity of CRAG, the performance of Walk&Retrieve drops slightly, but still shows good robustness.