What happens if you use the big model directly as the search page (SERP)?

Written by
Clara Bennett
Updated on:July-15th-2025
Recommendation

Explore the innovation of future search technology and how big models can subvert traditional search engines.

Core content:
1. Big models as an innovative idea for search engine results pages (SERPs)
2. Changes in user preferences for traditional search and chat-style search interfaces
3. Analysis of the advantages of big models in terms of massive knowledge reserves and data coverage

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)

The opposite is true: Model is search

Since the rise of Retrieval Enhanced Generation (RAG) technology, using Large Language Models (LLM) to improve search has become an industry consensus. From Perplexity to DeepSearch and DeepResearch, incorporating search engine results into the content generation process of LLM has become a common practice in the industry.

Many users said that they rarely use Google now, and they think that traditional page-turning search is troublesome and outdated. In contrast, they prefer a chat-style search interface that can directly give high-precision and high-recall results, which seems to represent the future development direction of search.

So, what happens if we use the big model directly as the search page (SERP)?

You can explore the knowledge stored in LLM just like using Google, with paging, clicking links, and all the traditional search elements you are familiar with. But it is all generated by AI. If this concept is not intuitive enough, please check out the demonstration below.

The title, link, and summary are all generated by LLM. You can directly visit jina.ai/llm-serp-demo to experience it.

Is this idea plausible?

Before the experts raise their hands to discuss the problem of hallucination, let us first explain to the newbies why this idea is not  so  outrageous.

First of all, there is a massive knowledge reserve . LLM is trained on a massive network knowledge base and has "remembered" a large amount of Internet information. For example, models such as DeepSeek-R1, GPT-4, Claude-3.7, and Gemini-2.0 have all been trained on trillions of tokens from the public Internet.

The second is the considerable data coverage . Roughly estimated, the leading LLM has learned 1% to about 5% of high-quality, publicly accessible web text. This ratio may not seem high at first glance, but if Google's index is used as a benchmark (regarded as 100% of the world's user-accessible data), then LLM is equivalent to the amount of data that a small search engine can provide.

Search EnginesData coverage (Google is 100%)
Google100%
Bing30-50%
Baidu5-10%
Yandex3-5%
Brave Search<1%
LLM (Estimated Value)1-5% (high quality data)

Finally, there is the memory activation mechanism : through clever prompt engineering, we can activate the memory of LLMs, allowing them to operate like search engines and generate results similar to search engine result pages (SERPs).

Admittedly, LLM also faces some challenges as a SERP, the most prominent of which is still the hallucination problem, that is, the model may generate false or inaccurate information. But we have reason to believe that with the continuous evolution of the model architecture and the increasing richness of training data, the hallucination problem will eventually be alleviated. Just as on X/Twitter, everyone is enthusiastic about using the latest release of LLM to generate SVG images and witnessing the continuous improvement of the generated quality, we also have the same expectations for LLM's ability to understand the digital world.

Knowledge deadline is another important limitation of LLM as SERP. An ideal search engine should be able to provide near real-time information, but since LLM weights are frozen after training, they cannot provide accurate information beyond their deadline.

Generally speaking, the closer the query is to this deadline, the more likely it is to experience hallucinations. Older information may have been repeatedly cited and revised, and therefore has a larger share in the training data. (Here we assume that information weights are evenly distributed, but in reality, breaking news and headlines, even if they are close to the knowledge deadline, may receive high weights in the training forecast due to their repeated exposure.)

But on the other hand, this limitation has inspired us to use it in different scenarios: it can be used to answer questions within the knowledge scope of the model. It is like making a bet on whether the user's question (or transition question in DeepSearch) falls within the knowledge deadline of the model. If it falls within the deadline and the model performance is good (refer to the example of SVG generated images mentioned above), then the answer is more likely to be correct. If it falls after the deadline, then the possibility of the answer being wrong is higher.

What is the use of LLM as SERP?

In DeepSearch /RAG or any system that relies on external information, an inevitable question is: how to determine whether the current problem can be solved by LLM's own knowledge or must rely on external information to solve it? The current common practice is to adopt a routing strategy based on prompt words, and use prompt words like this:

- For greetings, small talk, or general knowledge questions, answer directly without retrieving external information.
- For other questions, provide answers that are verified based on external knowledge and give exactQuote and corresponding URL as reference.

However, this approach is not immune to omissions: sometimes it triggers searches unnecessarily, and sometimes it misses critical information needs. In particular, some newer reasoning models often wait until halfway through the generation process before determining whether external data is needed.

So what if the search is indiscriminate from the beginning?

We can call the real search engine API and LLM as SERP at the same time. This avoids the trouble of making routing decisions in the early stage and postpones the decision to the downstream. In the subsequent stages, we can directly compare the results of the two: one is the latest data from real search, and the other is the knowledge within the LLM training deadline and possible bias information.

The search results provided by LLM-SERP will mix real knowledge and fictional information (i.e., "illusions"). Since real search engines provide fact-based content, and LLM provides semi-real and semi-fictional content, in general, the proportion of real content will be higher, so the subsequent reasoning steps can naturally focus on real knowledge.

Next, LLM can find the contradictions and consensus between the results through simple summarization, and summarize the real information according to the freshness and reliability of the information source. Summarizing is the strength of LLM, so we don’t need to hard-code these complex logics into the prompt words.

Of course, we can also use Jina Reader to visit the URL of each search result and verify the information by visiting the web page. This step of verification cannot be omitted in any case. Never just look at the summary provided by the search engine, no matter whether the search engine is real or fake.

in conclusion

By using LLM as SERP, we transform the binary question of "Is this within the knowledge of the model?" into a more reliable evidence weighting process. This avoids the need to route through the pre-system prompt words in the DeepSearch system. Of course, the assumed logical chain of this application scenario is:

  1. The training of a large model is expected to be equivalent to a small high-quality search engine.

  2. As the big model continues to develop, the hallucination problem will be alleviated. The knowledge cutoff and the hallucination problem are relatively independent, that is, a perfect big model should have zero hallucination when answering questions that fall before the knowledge cutoff. Hallucinations will only appear when answering questions that fall after the knowledge cutoff.

  3. The information on the Internet is both true and false, and the information on large models is also both true and false .

  4. When doing search grounding, we don’t need to worry too much about the problem of truth or falsehood, but only need to consider the recall rate. The summary, that is, improving the precision, is left to the big model.