Overthrowing traditional RAG, Tencent uses generative search to open up a new multimodal situation

Written by

Jasper Cole

Updated on:June-25th-2025

This paper is actually very interesting from the beginning. It addresses a very thorny problem: in multimodal applications, such as VQA (visual question answering) and multimodal dialogue, it is often not enough to rely on the information of images and texts themselves, and external knowledge bases must be consulted. The traditional approach is to rely on various retrievers, one for searching texts, one for searching images, and some even have entity retrievers, making the process long and cumbersome. Moreover, various retrievers need to be trained separately, which consumes a lot of data and is also costly.

Then, the author asked a very practical question: Is it possible to create a universal, multimodal retriever , and make the whole process simpler without so many messy separate modules?

The answer is the GeMKR they proposed. The idea is actually very direct and smart: don’t just calculate similarities and compare embeddings, just use the large language model (LLM) to directly generate clues, and then use these clues to search the database to get the corresponding documents. You can understand it as the model "thinking" of a key phrase by itself, and this phrase can uniquely point to a certain document.

Moreover, there is a small detail here that is very important. The author particularly emphasizes that only the generation step uses the neural network, and the subsequent database query is a deterministic operation (such as the extremely fast data structure of FM-Index), so the overall efficiency is very high.

In order to make the model use visual information, they also customized the visual features, instead of just throwing the pictures to the model. Specifically, they used a trick called Object-aware Prefix Tuning . Simply put, it is to use the detected objects (such as motorcycles, coconut trees, teddy bears, etc. in a picture) as features, integrate them into the "prefix" of the visual encoding, and then let the visual encoder such as CLIP perceive fine-grained information without changing the original large model parameters. The number of parameters is small and the training is fast.

Next, the image features and text features are uniformly fed into the LLaMA large language model for processing together. This is different from the traditional approach, which is generally "visual processing on one side, text processing on the other side, and then fusion at the end", while GeMKR directly puts both images and text into the same Transformer for deep interaction.

There is also a very clever design here, which is knowledge-guided constraint decoding . How to understand it? When generating clue words, the model can only select those suffix words that actually exist in the database under the current prefix at each step. To give a very simple example, if the model generates "palm" before, the next legal extension may be "tree", "oil", "leaf", but not "motorcycle", because the latter will not appear in documents related to palm at all. This greatly reduces the probability of generating errors and avoids running around.

They also added a small threshold, such as requiring a minimum number of tokens to be generated to ensure that the clues are recognizable enough. Once they can uniquely point to a document in the knowledge base, they stop. After the clues are generated, they can directly check the index and get the document. It is quick and easy.

How about the performance? It can be said to be overwhelming. The author took three multimodal knowledge retrieval datasets for experiments, with knowledge base sizes ranging from hundreds of thousands to more than 20 million. On the largest OKVQA-WK21M, GeMKR improved P@5 (the correct proportion of the top 5 search results) by 14.6% and R@5 (how many correct documents can be recalled from the top 5) by 8.9%, which is quite impressive. In particular, other models generally fall behind when the knowledge base size is large, but GeMKR can still hold up.

What’s even more impressive is that they only used 20,000 training data (Instruction Data) and adjusted only 14 million parameters (7.3 billion in total) to achieve this effect. The training time is also very short, and it can be completed in 3 hours on an A6000 graphics card with 48G video memory, without the need for massive computing resources.

In order to figure out which designs are the most important, the author also did a bunch of ablation experiments. The conclusion is clear:

• Without object-aware prefix, performance drops by about 2%.
• Without dual-flow attention (a mechanism that processes prefix and hidden state separately), the rate dropped by nearly 2%.
• Without LoRA fine-tuning LLM, only freezing LLaMA parameters, the drop is even greater.
• If the object features + LoRA fine-tuning are removed together, the performance will drop by more than 10%.

Obviously, fine processing on the visual side and lightweight optimization (LoRA) on the text side are both core.

There is also an interesting phenomenon: if we remove the image input and only rely on text query and image caption (such as a text description of "a man surfing"), the performance is still better than most traditional baselines, but certainly not as good as using real visual patches as input. This also shows that simple image captions cannot completely replace the information of the image itself.

The author also did a small scale-up experiment to see if changing to a larger LLM, such as LLaMA-13B, would result in a greater improvement. The results showed that it did increase slightly, but the marginal benefit was limited. This shows that in this type of retrieval task, LLaMA-7B is sufficient, and blindly changing to a larger model is meaningless and will only increase the computational overhead.

Another particularly intuitive observation is that if you use only text or only images to search, the keywords generated are actually very different. For example, text tends to be conceptual words such as "racing" and "American", while images tend to be words that describe visual objects such as "snowboarding", "statue", and "toy". In other words, different modalities do have different focuses, and the information of one modality alone cannot handle it, so it is reasonable and necessary for GeMKR to use fusion processing.

Finally, the author did several sets of case studies (actual examples) which were also quite interesting.

For example, given a question like "What sport can you use this for?" and a photo of a motorcycle, GeMKR generated a clue like "motocross is a form of off-road motorcycle racing," and then found a document related to motocross.
Another clue like "What plant is this?" and a photo of a coconut tree, it generated a clue like "palm trees are among the most exotic and recognizable foliage," which was a perfect match.

It can be seen that it can not only perform overall matching, but also capture details, achieving true "multimodal fine-grained alignment".

To sum up, the highlight of the GeMKR paper is not the fancy model it uses, but the complete simplification of the process , small changes with big benefits , super efficient training , and super robust retrieval results . The core concept is:

1. The large model generates knowledge clues by itself, eliminating the need for complex retrieval modules.
2. Carefully process visual information, balancing efficiency and granularity.
3. Constrained generation ensures one-to-one correspondence between clues and documents.
4. In the whole process, the focus is on integration, and retrieval is only auxiliary.

This is a typical shift in thinking, from "identification and comparison" to "generation and search", which is worth learning from.