How does the search company behind 60% of AI applications in China view the AI illusion problem? | AI Illusion Catcher

Explore the truth behind AI illusions and reveal how domestic AI application service providers fight against information distortion.
Core content:
1. AI illusion phenomenon and its impact on the industry
2. The key role of search technology in AI applications
3. Bocha's strategy and practice in fighting AI illusions
The so-called "AI illusion" - AI talking nonsense in all seriousness - has become a hot word in the industry. It is often attributed to the natural defects of the large model generation mechanism and the limited training data. However, apart from a series of technical explanations, one link is rarely discussed: search.
Online search is now a standard feature of almost every AI general chat product, and it is responsible for supplementing AI with the "latest knowledge". If the process of AI answering questions online is likened to cooking, the big model is the chef and the search engine is the ingredient supplier. The flavor of a dish depends on the chef's level, but the ingredients are equally important.
In our previous tests of AI hallucinations, there were many problems in the "ingredients" link: distorted information, second-hand information from self-media, and AI-generated content were repeatedly cited... Why is it difficult to cite accurate information? What criteria does AI use to select information on the Internet? (See: AI checked news 330 times: average accuracy rate 25%, nearly half of the links could not be opened, is the AI "hallucination" problem getting worse?)
To better understand the link of AI illusion, we turned our attention to a company headquartered in Hangzhou: Bocha. This startup is the service provider of more than 60% of AI applications in China, providing search services for leading AI products such as DeepSeek, ByteDance, and Tencent. The team told us that in March this year, the average daily call volume of Bocha search API exceeded 30 million times, reaching one-third of Microsoft Bing.
We had a conversation with Bocha CEO Liu Xun and CTO Weng Rouying. As an information portal for AI, Bocha provides another perspective to understand the problem of AI hallucinations.
AI hallucinations can only be reduced as much as possible, but it is difficult to completely eliminate them
21st Century Business Herald : When "AI+Internet search" first emerged, many people expected it to solve the problem of AI fabrication from the source, but this problem is still common today. As a provider of AI search capabilities, how do you view the AI illusion problem? What is the reason?
Weng Rouying: Essentially, this is a problem of the source of information. If you search on Baidu, Google, or Bing, you will find that there is a lot of false information there.
Although the underlying technical architectures of traditional search engines and AI search engines are different, the logic of "content production-crawl-index" is the same, and the authenticity of content is not something that AI search can fully control. In other words, when AI searches the Internet, just like when we use traditional search engines, we will encounter the problem of inaccurate information.
This problem can only be reduced as much as possible, but it is difficult to completely eliminate it. What we can do now is mainly to filter information through technical means.
21st Century Business Herald : What are the effective technical means to enable AI to provide the most accurate search results possible?
Liu Xun: Accuracy and authority need to be judged from multiple aspects. The more common strategy now is "model + manual".
First, at the big model level, we have a set of adversarial model systems. Before Internet information enters our index library, the big model will first determine the credibility. For example, if someone posts on our trusted site Xueqiu (an investor community) that DeepSeek is a product released by Kai-Fu Lee, and the entire content is fabricated, we can use the big model to identify it and significantly reduce its weight.
But there are also some contents that the big model cannot judge. For example, when Big S passed away, some people said that Wang Xiaofei chartered a plane to transport the body back to Taiwan, China, and even his mother (Zhang Lan) liked this news on Douyin. Many users thought it was true, and the big model could not accurately identify it. At this time, manual intervention is needed. When we confirm that a piece of information has been clearly refuted by official media and confirmed to be a rumor, we will take the initiative to remove such content.
Weng Rouying: After we get the relevant results, we will re-rank them. We will use Google's EEAT (Expertise, Experience, Authority, Credibility) to score each result. We will re-rank the search results by weighted scoring.
21st Century Business Herald : What other selection criteria does AI search have? We have previously found in actual tests that some self-media content with little readership but comprehensive coverage is ranked very low in traditional search engines, but can be seen and cited by AI. What might be the reason for this?
Liu Xun: Currently, our most heavily weighted evaluation indicator is "semantic relevance" . The core principle is that the returned content must contain answers to user questions.
How to understand? When you search with DeepSeek, you will directly ask a complete sentence. In the past, if you put such a long sentence into the search engine, you would not get any results, because traditional search engines match keywords, while the technical architecture of AI search engines is "semantic search", which is based on natural language matching results.
When a large model processes a user's question, it may receive 30 to 50 web page contents at a time. We usually use a score of 1 to 10 to evaluate the quality of these contents. There are four ranges in total. The higher the score, the more complete the web page content can answer the user's question, and even provide additional information.
Of course, we are not the AI product itself, but we provide an API for online search for AI products, which means we do not have the final say on the output. AI products will do another round of screening based on semantic relevance, and finally select a few pieces of content from these dozens of candidate web pages for summary.
21st Century Business Herald : So even if they are connected to the same search API interface, the accuracy of the final output responses of different AI products will vary.
Liu Xun: In fact, AI products often access more than one information source, and we are just one of the content sources. For example, Doubao not only accesses ByteDance information sources such as Toutiao and Douyin, but also accesses third-party data sources like us. This is technically called " multi-channel recall " - fetching results from multiple content pools at the same time. After multi-channel recall, how to sort and which content to display first is decided by the AI manufacturer itself.
Generally speaking, AI manufacturers will give priority to displaying content within their own ecosystems, because these contents not only have a higher degree of trust, but are also easier to realize traffic monetization and ecological closed loop on their own platforms.
GEO is on the rise, and low-quality content is pouring in
21st Century Business Herald : Traditional search engines have long been criticized for some problems, such as too many advertisements, high-quality content hidden in "walled gardens" and not open to the public, etc. Will these old problems have an impact on AI search? How do you deal with it?
Weng Rouying: This situation is actually not so bad. First of all, the problem with advertising is not the content itself, but the search engine companies choose to add advertising in the user interface, which leads to the problem you mentioned. Our positioning is "a search engine for AI", and we have not introduced a bidding ranking mechanism in business.
Secondly, the poor quality of information and the non-disclosure of high-quality content are still technical issues. Traditional search engines are based on keyword search. Based on this structure, low-quality content can be ranked higher through some means, such as being pushed to the front as long as you pay.
21st Century Business Herald : Speaking of the business model of paid ranking, SEO (search engine optimization) has developed into a huge industry, and after AI became popular, a new service called GEO (Generative Engine Optimization) emerged, which makes the content of a certain web page more easily cited by AI. Have you paid attention to this phenomenon?
Weng Rouying: I will summarize it in one sentence: You must know exactly what questions users will ask, and then write answers based on these questions, which can greatly improve the ranking of your content.
Of course, whether it is GEO or traditional SEO, high-quality content is the foundation. On this basis, content with a clear structure and clear answers is easier to be searched and cited by AI.
There are already some companies that originally did SEO and are transforming into GEO, but we are not planning to follow this path at the moment. Because we found that what big models really need is the most authoritative and accurate content source. If low-quality content is allowed to "smoke in" through GEO technology, it will easily aggravate the AI illusion problem, so we do not encourage the influx of low-quality content.
On the contrary, we actually hope to establish a brand new content cooperation mechanism. In the past, people paid for search rankings, but in the future we hope to be able to do the opposite: no need to buy rankings, but actively reward good content. If you can provide us with high-quality, well-structured, and credible content, we can share the content or other forms of cooperation incentives. This is a new model we are exploring.
Liu Xun: Providing high-quality content will be our principle. However, the domestic AI ecosystem is still evolving rapidly. The final form of AI applications, especially in C-end user scenarios, is still uncertain in the next 2 to 3 years. We hope to establish a mature and clear content cooperation mechanism after the industry form becomes clearer.
21st Century Business Herald : The source of many AI answers nowadays is actually content generated by another AI. The self-loop of “AI citing AI” is becoming more and more common. Are there any feasible coping strategies at present?
Weng Rouying: We have been promoting information filtering. The first step is to clean up illegal and illegal content such as pornography, gambling, and drugs; the second step, which is also the focus of current investment, is to identify and intercept AI-generated content, especially "poisonous " AI-generated content.
This type of content has two obvious characteristics: first, the structure, wording, and semantic style are different from those of human creations. We can train a set of specialized large models to identify them, similar to AI duplication detection in papers; second, they often contain false details. For example, there may be ten real reports on the same event online, but the one written by AI may contain fabricated content. We can eliminate these contents through cross-comparison.
21st Century Business Herald : We also noticed the delay problem. Previously, a media reported that the country had purchased 345 million tons of autumn grain. At that time, AI could not find the source of the data of "345 million tons of autumn grain purchased". It was not until the next day when there were more related reports that AI included this information. Why did this happen?
Liu Xun: Similar to the architecture of traditional search engines, when we crawl a web page, the data needs to go through a series of processing procedures, including extraction of original content, compliance identification such as pornography, content cleaning, and structural processing before it can enter the index library. This set of processes takes time. Currently, the fastest data processing time we can achieve is about half an hour, which is a technical limitation.
In the future, AI search calls may be 5 to 10 times that of humans.
21st Century Business Herald : Many of your partners are domestic Internet companies with mature technical teams and Internet experience. What are their main needs when they seek cooperation with Bocha?
Weng Rouying: The most core requirement is search quality.
In fact, whether it is a large Internet company or a small and medium-sized company, as long as they do AI search, they are facing a new set of technical architecture. The industry first applied "semantic search" to AI scenarios when Microsoft began to provide search services for ChatGPT based on Bing search. It was not until May 2023, when ChatGPT was connected to Bing to realize online search, that this architecture received widespread attention. However, it is difficult and costly to overturn decades of technological accumulation and rebuild the architecture, so the overall advancement speed is relatively slow.
On the other hand, some customers do not have search engine technology themselves and can only perform site searches, but cannot support full-network searches. Therefore, they hope to complement our capabilities from 0 to 1.
In the past, these customers usually accessed Microsoft Bing's search API. But Bing has two problems: one is that data is exported overseas, which poses security and compliance risks; the other is that it is expensive. Bocha's benchmark is Bing, so it chose us as a more secure and controllable domestic alternative.
21st Century Business Herald : Can you talk about the technology and cost of providing AI search services? What are the high barriers to entry?
Liu Xun: For example, the first step in building a search engine is to build an "index library", which can be simply understood as the underlying database of content. Google's index volume is about trillions, and Bing is slightly lower. Even if it is just starting out in China, the scale of the index library must reach at least tens of billions of data.
What does this amount of data mean? We currently support real-time retrieval of tens of billions of data, and to achieve millisecond-level responses, this requires a very large infrastructure system. For the server part alone, the number of servers we use is between 10,000 and 20,000. The cost of supporting this system is also very high, and the "starting price" is at least tens of millions of yuan per month.
More importantly, our technical architecture is designed entirely around content relevance, without advertising interference, which is the most basic requirement for AI search . If traditional search engine companies also want to switch to AI search APIs, it means abandoning the original keyword search architecture and rebuilding a vector indexing system. In addition, if they want to provide an API without advertising interference, it will also impact their original business model and revenue structure.
21st Century Business Herald : How long does Bocha expect its profit cycle to be? What are your plans for technology optimization and business layout in the future?
Liu Xun: We are not in a hurry to make profits at present. We are more concerned about how to promote the development of the entire AI ecosystem. When the domestic AI application ecosystem matures, we will then realize commercial realization.
Currently, the total number of searches initiated by humans worldwide is between 10 billion and 20 billion (including searches on Google, Bing, WeChat and other platforms). However, we believe that the demand for AI searches in the future will far exceed this level.
For example, when you ask a question to a model like DeepSeek, the big model will break down a question into multiple sub-questions and call the search at the same time. Especially for AI agents like Manus, in order to complete a complex task, they often need to call the search service interface repeatedly. We estimate that the number of AI search calls in the future will be 5 to 10 times that of humans, or even higher.
In other words, search capabilities will become an indispensable basic module for AI applications in the future, just like maps and payments, and it is the upper-level AI applications that pay for them. So we will wait for the domestic AI application ecosystem to get going.
We have always been competing with Google and Bing, and we hope to reach at least half of Google's index next year - 500 billion indexes. In fact, the next key challenge is infrastructure. Our resources are deployed on major cloud vendors. The current costs and limitations are still the so-called "three horses" : algorithms, computing power, and data. Therefore, we need further development of the entire infrastructure to support the next stage of expansion and breakthroughs.