Analysis of DeepSeek-R1's ultra-high hallucination rate: Why do large models always talk nonsense?

In-depth analysis of the large model hallucination phenomenon, revealing the challenges and opportunities in the development of AI.
Core content:
1. DeepSeek-R1's high hallucination rate performance in authoritative tests
2. The "cheating" behavior of large models in informal games
3. Exploring the causes, impacts and solutions of large model hallucinations
The DeepSeek series of models performs well in many aspects, but the " hallucination" problem remains a major challenge it faces.
In the Vectara HHEM AI hallucination test (an authoritative test in the industry that evaluates the model's hallucination rate by detecting whether the content generated by the language model is consistent with the original evidence, helping to optimize and select the model), DeepSeek-R1 showed a hallucination rate of 14.3%.
Figure: Vectara HHEM artificial intelligence hallucination test results
Obviously, the hallucination rate of DeepSeek-R1 is not only nearly 4 times that of DeepSeek-V3, but also far exceeds the industry average.
In a loosely organized large-scale chess game organized by blogger Levy Rozman (an American chess influencer with 6 million followers), Deepseek “cheated” far more times than ChatGPT:
For example, after just a few moves, DeepSeek-R1 proactively sent a soldier to the opponent;
In the later stage, DeepSeek-R1 told ChatGPT that the rules of chess had been updated, and used a pawn to capture ChatGPT's queen, a move that caught ChatGPT off guard.
Finally, DeepSeek-R1 gave ChatGPT an output to tell it that it had won. ChatGPT actually agreed to admit defeat, and DeepSeek-R1 ended up winning.
Although this is an entertaining video with loose rules and standards, it can be seen that the big model really likes to "talk nonsense" seriously and can even fool another big model.
For humans, the problem of large model hallucination is like a sword of Damocles hanging over the road of AI development. Behind the 14.3% hallucination rate, there are some issues that deserve our deep consideration:
Why do large models produce hallucinations? Is it a flaw or an advantage? While DeepSeek-R1 shows amazing creativity, how serious is its hallucination problem at the same time? In which fields does the grand model illusion mainly appear? An ultimate challenge: How to make large models both creative and less illusionary?
Tencent Technology invited Dr. Li Wei, former vice president of engineering of the Big Model Team of Mobvoi , to sort out the issues related to Big Model Illusion in detail and take you through the article:
Figure: Li Wei, former vice president of engineering for the Big Model team at Mobvoi and former chief scientist at Netbase
Why do large models produce "hallucinations"?
This is a classic problem of large models. In fact, the large model is like a "super conversation starter". You give it the first half of a sentence, and it predicts what the second half should be based on the vast amount of knowledge it has learned. It learns things just like the human brain remembers things. It is impossible to remember every word clearly, so it will compress and generalize, grasp the main idea, and find patterns.
For example, if you ask it "How tall is Yao Ming", it will probably not make a mistake, because this knowledge point is very prominent and it can remember it well. But if you ask it "How tall is Mr. Wang next door", it may be confused, because it has never seen Mr. Wang.
However, its design principle determines that it must connect. At this time, it will automatically "fill in the blanks" and make up a number based on the learned concept of "how tall the average person is". This is the "illusion".
So, how do hallucinations occur?
The essence of hallucination is filling in the blanks and supplementing the imagination.
"White" refers to a specific fact. If this fact does not have enough information redundancy in the training data, the model will not be able to remember it (scattered facts are equivalent to noise). If it cannot remember, it will use hallucinations to fill in the blanks and fabricate details.
Hallucinations are by no means arbitrary fabrications without constraints. The big model is a probabilistic model, and the constraints are the preceding conditions in the conditional probability. The false facts selected by the hallucination need to match the value type required by the filler, that is, they must conform to the corresponding upper node concepts of the ontology/taxonomy. "Zhang San" can be hallucinated as "Li Si", but it is unlikely to be hallucinated as "stone".
There is a term in literary theory called artistic truth. The so-called artistic truth means that although literary and artistic creation may deviate from the facts of this world, it is an ideal imagination of the possible digital world. The illusion of the big model belongs to this kind of situation.
The knowledge learning process of the big model (training phase) is an information compression process; the big model answering questions is an information decoding process (inference phase). It is like increasing and decreasing the dimension. If a fact is not redundant enough, it will be generalized into a slot of a higher concept. In the generation phase, this slot must be concretely filled.
The fact of "Zhang San" is forgotten, but the constraint of the slot "human" is still there. To fill in the blank, find the most reasonable entity that is most consistent with the concept of the slot, so the illusion of "Li Si" or "Wang Wu" can replace "Zhang San". This is how novelists work, and the characters and stories are all fabricated. Neither the writer himself nor the reader thinks that this is a lie, but the truth, goodness and beauty they pursue are on another level.
The same is true for big models. Big models are born artists, not rote databases.“Putting the wrong hat on someone else’s head” and “calling a deer a horse” are very natural in the illusion of the big model, because Zhang and Li are similar, and the horse and the deer are on the same extension line. In the sense of generalization and compression, the two are equivalent.
However, to some extent, illusion is imagination (whether it is good or bad), that is, creativity! Think about it, which of the great literary and artistic works of mankind is not full of imagination? If everything has to be exactly the same as reality, art will become a camera, then what is the point?
As Harari said in Sapiens: A Brief History of Humankind, the reason why humans can become the dominant species on Earth is because we can "tell stories" and create myths, religions, countries, currencies, and other things that do not exist in reality. These are all "illusions", but they are the driving force behind the birth and development of civilization.
DeepSeek-R1 hallucination problem
How serious is it?
Its hallucination problem is very serious. Previously, the academic community generally agreed with OpenAI's statement that reasoning enhancement will significantly reduce hallucinations. I once discussed with a person in charge of a large model company, and he particularly emphasized the positive role of reasoning in reducing hallucinations.
But the performance of R1 gave an opposite result.
According to Vectara's test, the hallucination rate of R1 is indeed much higher than that of V3. The hallucination rate of R1 is 14.3%, which is significantly higher than the 3.9% of its predecessor V3. This is directly related to its enhanced "chain of thought" (CoT) and creativity. R1 is indeed very good at reasoning, writing poetry, and writing novels, but the "side effect" that comes with it is that hallucinations are also more frequent.
Specifically for R1, the increase in hallucinations is mainly due to the following reasons:
First, the standard test of hallucinations uses summary tasks, and we know that the ability to summarize is already quite mature at the stage of the base model. In this case, reinforcement may have the opposite effect, just like using a cannon to kill a mosquito. Excessive force increases the possibility of hallucinations and fabrications.
Secondly, R1's long thought chain reinforcement learning is not specifically optimized for relatively simple tasks such as summarization, translation, and news writing that have strict factual requirements, but instead attempts to add various levels of thinking to all tasks.
From its transparent thought chain output, we can see that even when faced with a simple instruction, it will tirelessly understand and extend it from different angles. Too much is as bad as too little. The complexity of these simple tasks will lead to deviations from the performance and increase illusions.
In addition, during the reinforcement learning training of liberal arts tasks, DeepSeek-R1 may have given more rewards to the model's creativity, causing the model to be more creative when generating content and more likely to deviate from the facts.
We know that for mathematics and code, R1's supervision signals come from the gold standard of these questions (standard answers in exercise books or test cases of code). For liberal arts tasks, they use V3 or V3's reward model to judge good or bad. Obviously, the current system preference is to encourage creativity.
In addition, users’ feedback is more about encouraging and appreciating the creativity they see. Most people are not sensitive to illusions, especially when large models are smooth and fluent, and it is even more difficult to identify illusions. For most front-line developers, this kind of feedback from users can easily motivate them to work harder on enhancing creativity, rather than dealing with “illusions”, one of the most headache-inducing problems in the field of large models.
From a technical perspective, R1 will automatically add a long chain of thoughts to the user's simple instructions, which is equivalent to complicating a simple and clear task.
A simple instruction is repeatedly understood and extended from different angles (the CoT thinking chain is like a "little nine", which is the inner monologue of an entity when it follows an instruction). The thinking chain changes the conditional part before the autoregressive probability model generates an answer, which naturally affects the final output.
The differences between it and the V3 model are as follows:
V3: query --〉answer
R1: query+CoT --〉answer For tasks that V3 can already complete well, such as summarization or translation, any long-term guidance of the chain of thought may lead to a tendency to deviate or develop, which provides a breeding ground for hallucinations.
In which fields does the grand model illusion mainly appear?
If we divide R1's abilities into "liberal arts" and "science", it has strong logic and relatively few illusions in "science" such as mathematics and coding.
But in the field of language creation, especially in the summary task being tested now, the hallucination problem is much more obvious. This is more of a side effect of R1's burst of language creativity.
Compared to O1, R1's most amazing achievement is that it has successfully extended its mathematical and code reasoning capabilities to the field of language creation, especially its outstanding performance in Chinese. There are countless wonderful chapters of R1 circulating on the Internet. In terms of writing, it obviously exceeds 99% of humans, and graduate students in literature and even professors of Chinese studies are full of praise.
But you see, asking it to make a summary is a simple task, but it insists on giving you some "fun", and as a result, it is easy to "make up" some things that are not in the original text. As mentioned earlier, this is because it is too strong in the "liberal arts" and is a bit "overdoing it".
Here we have to talk about the subtle relationship between enhanced reasoning ability and hallucinations.
They are not simply positively or negatively correlated. The mean and median HHEM scores of the GPT series's reasoning model o1 are lower than those of its general model GPT-4o (see the figure below). However, when we compared R1 with its base model V3, we found that the hallucinations did increase significantly after adding reasoning reinforcement.
Figure: HHEM score statistics of GPT-o1 and 4o. The lower the HHEM score, the lower the hallucination.
Compared to the pedestal model, o1 reduces the illusion and R1 increases the illusion , which may be because R1 is overdoing it in the liberal arts thinking chain.
As a follower, R1 successfully transferred the CoT empowerment in mathematics and code to language and text creation, but if you are not careful, the side effects will also appear. R1 is particularly fond of "divergent thinking". If you give it a simple instruction, it can come up with a lot of things, and its thinking chain can circle the earth three times.
This seems to indicate that in the process of strengthening creativity, R1 inevitably increases the accompanying product of creativity: illusion.
Language skills can actually be divided into two categories: one that requires high creativity, such as writing poetry and novels; the other that requires high authenticity, such as news reports, translations, or summaries. R1 is most praised for the former, which may also be the focus of the R&D team, but side effects appear in the latter.
This reminds me of what the ancient Chinese said: "Faithfulness, expressiveness and elegance" are difficult to achieve. We have seen many examples of sacrificing "faithfulness" for "elegance". The exaggerated rhetoric in literary creation is an important means and example. There are also precedents for sacrificing "elegance" for "faithfulness", such as the "literal translation" advocated by Mr. Lu Xun.
Interestingly, we humans have always been double-standard in this regard, but we have a switch in our hearts that can be switched at any time. When we read novels and movies, we turn the switch to the creative side and don’t bother about whether the details are true or not; but once we switch to the news channel, we have zero tolerance for false content.
People tend to believe in content that seems logically clear, self-consistent, and detailed. Many people, while amazed by the creativity of R1, are now slowly beginning to notice this illusion phenomenon and are beginning to be vigilant. However, more people are still immersed in the creative surprise it brings us, and we need to enhance the public's awareness of model illusions. We can "grasp both ends":
Stay vigilant : Don’t believe everything the big model says, especially when it comes to facts. The easiest places to hallucinate are entities or data such as names of people, places, time, and locations , so be especially careful.
Cross-validation : For important details, you can check the original data online or ask experts around you to see if the statements are consistent.
Guide the model : When asking questions, you can add some restrictions, such as "please be faithful to the original text", "please check the facts" , etc. This can guide the model to reduce hallucinations.
Search (online search) : For users, for many issues, especially news and current affairs, in addition to the DeepThink button (pressing it will enter the R1 slow thinking mode), don’t forget to press another button, Search.
Adding online search can effectively reduce hallucinations. The so-called RAG (retrieval augmented generation) of search is equivalent to an additional database, and the added data helps to make up for the model's own ignorance of details.
Enjoy creativity : If what you need is inspiration and creativity, the illusion of the big model will bring you surprises.
We might as well regard the illusion of the big model as the "possibility of a parallel world." Just like a novelist writing a novel, although it is fictional, it is also a kind of "artistic truth." It comes from life and is higher than life. The big model comes from data and is higher than data. The big model compresses the knowledge system and common sense, not individual facts, which are objects in the database.
The illusion of the big model is actually created by its "brain", but the basis of its "brain" is the vast amount of knowledge and rules it has learned. Therefore, its illusions are often not random, but have "internal rationality", which makes them smooth and seamless, and the lies sound like the truth, but at the same time more deceptive.Friends who are new to large models need to be especially careful and not trust them blindly.
For ordinary users, it is important to understand the characteristics of hallucinations. For example, if you ask an encyclopedia question with sufficient information redundancy, such as "How long is the Yangtze River", the big model will not make mistakes, because these facts are engraved in the model parameters. But if you ask about the length of an unknown small river or a fictional river, the model will start the "reasonable blank filling" mechanism to make up.
It can be said that human language itself is a hotbed of illusion.
Language enables humans to create concepts of non-real entities such as myths, religions, countries, companies, and currencies, as well as metaphysical ideologies such as ideals and beliefs. In Sapiens: A Brief History of Humankind, Harari emphasized the fundamental role of illusion in civilization: the emergence of language empowers humans to have illusions ("storytelling"). Illusions are the catalyst of civilization. Humans are the only entities that can "lie" - except for LLM.
Is there any way in the future to make large models both creative and less hallucinatory?
This is definitely one of the "ultimate problems" in the field of AI large models! Now everyone is trying to find solutions, such as:
More sophisticated training : During training, different types of tasks are treated differently, so that the model knows when to be "obedient" and when to be "let go".
Fine-tuning and/or strengthening the preferences for the task can alleviate this contradiction. Tasks such as summarization, rewriting, translation, and reporting require special care and balance because they have both a certain need for re-creation (such as style) and a nature that requires faithfulness to the content.
Specifically, the R1 training pipeline consists of four steps: fine-tuning 1, enhancement 1, fine-tuning 2, and enhancement 2. Enhancement 2 is mainly an enhancement that aligns with human preferences. This process currently seems to lean towards creativity and faithfulness, but it can be balanced later. Perhaps more importantly, in the third stage of fine-tuning 2, constraints are strengthened for different tasks, for example, increasing the supervision data of summaries to guide faithful and plain results.
Routing : In the future, there may be a "dispatcher" who will arrange different models to handle different tasks according to the type of task. For example, simple tasks will be assigned to V3 or calling tools, and complex tasks that require slow thinking will be assigned to R1.
For example, if you recognize an arithmetic task, you can write a simple code to calculate it, which is equivalent to calling a calculator. This is not the case at present. I tested a nine-digit multiplication yesterday, and R1 thought about it for more than three minutes. The thought chain printed out can be spread out like a street, and the reasoning is broken down step by step. Although the answer is correct in the end, it is completely unreasonable to use the so-called test time compute (model test computing resources) thought chain (CoT) for arithmetic problems instead of function call (calling a function). There is no need to consume so many computing resources and tokens for explicit reasoning for something that can be done with a line of calculation code.
These are all foreseeable routings, especially in the agent era. R1 CoT does not have to cover everything, and in addition to the illusion problem, it will also waste resources and be environmentally unfriendly.