Knowledge graphs, big models, and hallucinations: a natural language processing perspective

Written by

Silas Grey

Updated on:June-13th-2025

summary

Large language models (LLMs) have revolutionized natural language processing (NLP) based applications, including automated text generation, question answering systems, chatbots, etc. However, they face a major challenge: hallucination, where the model generates content that sounds reasonable but is actually wrong. This undermines trust and limits the applicability of LLMs in different domains. On the other hand, knowledge graphs (KGs) provide a structured collection of interconnected facts, represented as entities (nodes) and their relationships (edges). In recent studies, KGs have been used to provide context that can fill the gaps in LLMs' understanding of certain topics, providing a promising way to mitigate LLMs' hallucinations, enhance their reliability and accuracy, and benefit from their wide applicability. Despite this, it remains a very active research area with many unresolved open problems. In this paper, we discuss these open challenges, covering the latest datasets and benchmarks, as well as methods for knowledge integration and evaluation of hallucinations. In our discussion, we consider the current use of large language models (LLMs) in LLM systems and identify future directions for each challenge.

Core Overview

Background

Research Questions
：The problem that this paper aims to address is the factual inconsistency phenomenon, or “hallucination”, that large language models (LLMs) are prone to when generating text. Such hallucinations can undermine users’ trust in AI systems and generate misleading information in some cases.
Research Difficulties
: The research difficulties of this problem include: the multifaceted nature of hallucinations (such as world knowledge, self-contradiction, hallucinations that are inconsistent with prompt instructions or given context), the complexity of evaluating hallucinations (the need to evaluate the semantic consistency of the output), and the limitations of existing datasets and benchmarks.
Related Work
: Related research work on this issue includes: using knowledge graphs (KGs) to provide structured factual information to alleviate the hallucination problem of LLMs, existing hallucination detection methods and knowledge integration models.

Research Methods

This paper proposes to use knowledge graphs (KGs) to alleviate the hallucination problem of LLMs. Specifically,

Utilization of knowledge graphs : KGs is a structured knowledge representation consisting of entities (nodes) and the relationships (edges) between them. By integrating the information of KGs into LLMs, a factual basis can be provided during the reasoning or generation process, thereby improving the consistency and accuracy of the output.
Classification of knowledge integration models : Different knowledge integration models can be classified according to their underlying architecture. The paper proposes a classification framework that shows the possibility of adding additional information at different stages to enhance factuality.
Hallucination detection method : GraphEval proposed a two-stage hallucination detection and mitigation method by extracting atomic assertions from LLMs output and comparing them with given text context. Other methods such as KGR and Fleek also adopted similar methods, but all have some limitations.
Multi-prompt Evaluation : The DefAn dataset evaluates the robustness and consistency of LLMs by providing 15 different question restatements for each question-answering data point.

Experimental design

Dataset
：The paper evaluates multiple hallucination detection and datasets, including Shroom SemEval 2024, MuShroom SemEval 2025, MedHalt, HaluEval, TruthfulQA, FELM, HaluBench, DefAn, SimpleQA, etc. These datasets cover multiple fields and task types, such as law, politics, medicine, technology, art, finance, etc.
Evaluation Metrics
: Various evaluation metrics such as accuracy, calibration, F1 value, etc. are used to evaluate the performance of the hallucination detection model. For the knowledge integration method, semantic similarity metrics such as BERTScore and BARTScore are also used.
Experimental setup
: The experimental setup includes the division of each dataset (training, validation, testing), the definition of subtasks, and the sources of external knowledge (such as text context, web pages, etc.).

Results and Analysis

Hallucination Detection Effect
: Existing hallucination detection methods have made some progress in identifying and handling hallucinations, but some problems still exist. For example, the multi-stage pipeline methods have limited robustness and scalability and are highly dependent on LLMs for cues.
Knowledge integration effect
: By integrating KGs information into LLMs, the consistency and accuracy of the output can be significantly improved. However, existing knowledge integration methods still have challenges in fast knowledge updating and avoiding prompt fragility.
Multiple prompt assessment
: The evaluation results on the DefAn dataset show that the multi-hint approach can improve the robustness and consistency of LLMs, but further research is still needed to verify its effectiveness in different scenarios.

Overall conclusion

This paper summarizes the current status and challenges of using knowledge graphs (KGs) to alleviate the hallucination problem of LLMs. Although existing methods have made some progress, hallucination mitigation remains an ongoing research problem. The paper proposes future research directions, including large-scale datasets, multi-language and multi-task evaluation, fine-grained hallucination detection, reducing reliance on text cues, and mixing different hallucination mitigation methods. Through these research directions, the paper hopes to provide more effective solutions to the hallucination problem of LLMs.

Paper Evaluation

Advantages and innovations

Comprehensiveness
：The paper discusses in detail the potential of knowledge graphs (KGs) in alleviating the phenomenon of generative hallucinations in large language models (LLMs), covering the current research status, limitations, and future research directions.
Classification method
: A classification method of knowledge integration models based on architecture is proposed, and the categories of additional information added in different stages are summarized.
Resource sorting
: We review existing datasets and benchmarks for evaluating hallucinations and provide a detailed overview of the resources.
Multi-dimensional evaluation
: Emphasized the importance of multidimensional evaluation, including multi-language, multi-task and multi-angle evaluation methods.
Fine-grained detection
: Fine-grained hallucination detection methods, such as sentence-level and paragraph-level detection, are proposed to better capture the details of hallucinations.
Future Directions
: Several future research directions are proposed, including large-scale datasets, robust evaluation, fine-grained hallucination detection, knowledge integration methods for non-text cues, and exploration of hybrid methods.

Shortcomings and reflections

Dataset limitations
: Most existing datasets lack high-quality knowledge graph triples as external knowledge, which limits the development of parameterized methods for knowledge integration models.
Limitations of the evaluation method
: Current evaluation methods mainly rely on single prompts and lack of multilingual evaluation, failing to comprehensively evaluate the robustness and generalization ability of the system.
Method Dependency
: Many methods still rely on textual cues, which suffer from the problems of cue fragility and high computational cost.
Limitations of Knowledge Graphs
: Existing knowledge graphs have limitations in data completeness, accuracy, and multilingual coverage, which may affect the effectiveness of hallucination relief.
Suggestions for future research
: Further research is needed on how to integrate knowledge in parameterized settings, reduce reliance on textual cues, and explore effective combinations of different methods.

Key questions and answers

Question 1: What are the specific applications of the knowledge graphs (KGs) mentioned in the paper in alleviating the LLMs hallucination problem?

Pre-training phase
: KG triples are used as part of the training data and fused with the original text input through the masked entity prediction task. For example, the Ernie 3.0 model improves language understanding and generation capabilities through large-scale knowledge augmentation pre-training.
Reasoning stage
: KG triples are combined with queries through prompting to form input pairs (P={\mathcal{K},\mathcal{Q}}) for retrieval augmentation generation (RAG) tasks. For example, semantic similarity metrics such as BERTscore and BARTScore are used to evaluate the quality of LLMs output.
Post-generation phase
: After generating the answer, it is fact-checked by an external KG and the original output is corrected based on the verification result. For example, the GECKO method completely relies on KG information for text generation.

Question 2: What are the hallucination detection methods mentioned in the paper? What are their respective advantages and disadvantages?

GraphEval
: A two-stage hallucination detection and mitigation method is proposed. The first stage extracts atomic assertions and forms subgraphs through LLM hints, and the second stage compares these subgraphs with the given text context. The advantage is that it can provide fine-grained error analysis, and the disadvantage is that it relies on the robustness of LLM hints.
KGR
: Extract KG subgraphs by named entities and compare the alignment between source text and generated text. The advantage is that it can identify specific error parts, but the disadvantage is that detailed information of abstract concepts may be lost.
Fleek
: By extracting structured triples and using another LLM for fact checking. The advantage is that it can perform fact verification, but the disadvantage is that it relies on the reasoning of multiple LLMs and has high computational cost.
DefAn
: Evaluate the robustness and consistency of LLMs by providing multiple question restatements for each question-answering data point. The advantage is that multi-prompt evaluation can improve the robustness of the model, but the disadvantage is that it requires a large amount of labeled data and computing resources.

Question 3: How effective is the knowledge integration method mentioned in the paper in improving the consistency and accuracy of LLMs output? What challenges exist?

Effect
: By incorporating KGs information into LLMs, the consistency and accuracy of the output can be significantly improved. For example, the performance of the Ernie 3.0 model on sentiment analysis tasks has been significantly improved after large-scale knowledge augmentation pre-training.
challenge
Existing knowledge integration methods still face challenges in fast knowledge updating and avoiding prompt fragility. For example, prompt-based methods rely on manually designed templates and are vulnerable to format and content limitations. In addition, multi-stage pipeline methods also have limited robustness and scalability and are highly dependent on prompts from LLMs.