Can AI answers be wrong? How does Mininglamp Technology use factual data to counter AI illusions?

In the AI era, how to ensure data authenticity? Minglue Technology interprets the truth behind AI illusions.
Core content:
1. The miscommunication of mortality data of the post-80s generation reveals the risk of AI data misinterpretation
2. Analysis of the reasons why AI generates erroneous information: prediction mechanism based on training data
3. How can enterprises prevent AI illusions and ensure data authenticity
Some time ago, a set of data on the mortality rate of people born in the 1980s attracted widespread attention online. Many self-media outlets, chasing traffic and popularity, continued to hype up the data, which contributed to the spread of these outrageous data. Recently, authoritative media such as CCTV News and relevant experts have refuted the rumors, pointing out that the data is seriously inconsistent with the facts.
Experts said that the seventh national census was conducted in 2020, and its results obviously cannot predict the mortality rate in 2024. In addition, the census data will only publish the mortality rate of the corresponding period. For example, the seventh national census in 2020 reflects the mortality rate from November 1, 2019 to October 31, 2020, and there is no cumulative death statistics for specific groups (such as those born in the 1980s). The "5.2%" mortality rate data that appeared on the Internet is obviously wrong. Because in professional statistical data, the mortality rate is usually expressed in per thousand, not as a percentage. In addition, there are other obvious errors in professional common sense such as confusion of definitions in related content.
As the number of Chinese Internet users exceeds 1.1 billion, about 250 million of them have become generative AI users. While AI brings dividends, it also brings risks and challenges.
Science Popularization China once pointed out that just like when we encounter questions that we don’t know how to answer in an exam, we will try to use our known knowledge to guess the answer. When AI encounters information missing or uncertain, it will fill in and infer based on its own “experience” (training data). This is not because AI wants to deceive us, but because it is trying to complete this task with its own understanding model.
The knowledge of the big model comes from data, which comes from public data sets, data crawled from the Internet, and proprietary or third-party data.
However, due to multiple factors such as insufficient training corpus and data sources, large AI models also have cognitive deficiencies and inevitably generate erroneous or false information, which is often referred to as "hallucinations" in the industry .
"The main reason is that the fundamental principle of the big model is to predict the next token. Since it is a prediction, it means choosing the path with the highest probability for reasoning, and this path does not include "facts" and "logical reasoning." Relevant experts from Minglu Technology pointed out.
With the explosion of DeepSeek, a new national top-tier app, AI has broken through the circle of the whole nation, and enterprises are increasingly demanding AI-enabled businesses. However, professional fields have stricter requirements on the authenticity and accuracy of AI output information. So in actual business scenarios, how can enterprises take advantage of its strengths and avoid its weaknesses to make AI better serve their businesses?
Experts from Minglue Technology said that AI has different application scenarios, some of which require divergence and imagination, while others require rigor and convergence. In most enterprise scenarios, it is necessary to ensure that the answers are rigorous, well-founded, and cannot be wrong.
Companies can alleviate AI illusions in three main ways: choosing specific models, providing required materials, and adding instruction guidance:
Select a specific model
Models that perform well in instruction compliance and summary citation tend to prefer "quoting the original text" answers during training. Therefore, users can see more original content in the answers given by AI, rather than the results of AI's free play.
Given the required materials
By adding materials and information related to the question, AI can judge the relationship between the material and the question and tend to answer using the given material.
Add instruction guide
Clarify constraints and tell AI to answer based on existing facts and not to make assumptions; mark uncertainties and mark ambiguous information with "this is speculation" etc.
From a model perspective, the reasons why models such as GPT-4 have fewer hallucinations are: first, enhancing the quality and diversity of training data, which is equivalent to using high-quality corpus to cover more user scenarios; second, the post-verification and correction mechanism; third, the use of more complex constraints and rules.
Among them, the first point is the most critical, that is, most problems have been trained with corresponding corpus. However, for scenes and problems that have never been seen, in the absence of corresponding corpus, the fabrication of the big model still exists. Therefore, it is very important to avoid hallucinations from the "nutrient" of the AI big model - the data side.
Minglue Technology believes that for enterprises,
On the one hand, in specific industry scenarios, enterprises need to use factual data to counter AI illusions and select authoritative data sources to effectively make up for the lack of proprietary knowledge in large models;
On the other hand, enterprises should strengthen the construction of knowledge base and make full use of retrieval enhancement generation (RAG) technology.
RAG is equivalent to equipping the large model with a super plug-in, allowing users to obtain reliable information from reliable materials at any time, thereby obtaining more reliable answers.