RAG without Rerank is like driving without a steering wheel

The importance and advantages of Rerank in RAG technology, and the key to improving the performance of question answering systems.
Core content:
1. Limitations and problems of vector search in RAG technology
2. Advantages and principles of Rerank as a cross encoder
3. How Rerank coordinates recall and precision to improve RAG performance
RAG without Rerank is like driving a car without a steering wheel.
I believe that many people build the knowledge base and then run RAG, and the results are all illusions and a mess.
Moreover, the more data there is, the greater the illusion becomes, and the accuracy drops sharply.
The reason is simple.
Traditional RAG is a "retrieval + generation" process: extract relevant content from massive documents, feed it to LLM, and let it spit out the answer.
The first step is to search for vectors, convert text into embeddings, and compare them using cosine similarity to get the results.
In the second step, LLM takes over and generates the answer.
But here comes the problem: vector search is fast, but its reliability is poor. When text is converted into 768-dimensional or 1024-dimensional embedding, the information is compressed too much and a lot of details are lost.
For example.
If you ask “What is the core principle of quantum computing?”, vector search may rank the article “Introduction to Quantum Mechanics” at the top, while the truly hardcore “Quantum Computing Algorithm” is squeezed out of the top_k, for example, at number 15.
At this time, the LLM is given a bunch of half-baked information, it would be strange if he could answer the questions well!
So, can we increase top_k and get more documents? It's a good idea, but the context window of LLM is not infinite.
Take Claude as an example. 100K tokens sounded OK, but after filling up the window, LLM's recall ability plummeted, and the middle information was basically "amnesiac", and it didn't listen to the instructions. If you feed too much, you will get an assistant with "Alzheimer's disease".
So here comes Rerank.
Rerank is a cross-encoder. In simple terms:
Get a one-to-one comparison of the query and the document, calculate the similarity score, and then re-rank. Compared with the bi-encoder of vector search, Rerank does not rely on pre-calculated embeddings, but on real-time analysis , with very little information loss.
For example, vector search retrieves 50 documents for you, and Rerank analyzes and selects the 5 most relevant ones.
The "quantum computing algorithm" originally ranked 20th may jump directly to the first place, and all the contexts obtained by LLM are opened. The quality of the answer is directly raised to the sky.
Why is Rerank so strong?
The problem with the previous dual encoder is that it compresses the document into a single vector in advance, and then matches it when a query comes in, and the context is all guessed. However, Rerank deeply mines the meaning of the document based on the specific query, and its accuracy is far better than that of the dual encoder .
Vector search relies on pre-calculated embeddings, and only matches temporarily when a query comes in, which is inevitably a bit like "blind men touching an elephant". Rerank is different. It is a real-time analysis that can deeply explore the meaning of documents based on your specific questions.
There are two hard-core indicators to mention here: recall and precision .
The recall rate depends on how many relevant documents you retrieve, and the precision depends on how many of these documents are truly useful.
For vector search, top_k can be adjusted to 50 or 100 to maximize the recall rate, but LLM cannot cope with it. Once the context window is full, the performance will collapse.
Let’s just cut top_k to 5, but the recall rate is not high enough and key information is missed.
The great thing about Rerank is that it perfectly achieves the coordination between the two.
The first step is to search for a large number of vectors (for example, top_k=50) to ensure the recall rate.
In the second step, Rerank carefully selects (such as top_5) to ensure accuracy.
The result is: no misses in the retrieval stage, and no overload in the LLM stage. The data supports this point. According to Pinecone’s analysis, two-stage retrieval plus rerank can improve the search quality of RAG by 20%-30%, especially in question-answering tasks.
Of course, Rerank is not without its flaws. The biggest problem is that it is slow!
What vector search can do in 100 milliseconds may take several seconds or even minutes to rerank. If you have 40 million records, running rerank with a small BERT may take at least 50 hours on a V100 GPU. In contrast, dual encoders plus vector search are basically "instant kill" speeds.
But there is a reason for being slow.
Rerank's high accuracy comes from its real-time calculation to avoid information loss from dual encoders. Especially in enterprise scenarios, such as legal search and medical Q&A, this slow and meticulous work can bring about the ultimate effect.
Moreover, with hardware upgrades and model optimizations (such as DistilBERT), the latency of Rerank is gradually decreasing.
Without Rerank, the initial search is unreliable, the quality of information obtained by LLM is uneven, and the answers are either off-base or full of "illusions".
Rerank is like a "quality control officer" to ensure that every piece of information is qualified.
Can we optimize the embedding model to skip Rerank?
In theory, this is possible, for example by using a more powerful Sentence-BERT, but in practice, embedding models are prone to failure under large-scale data sets and complex queries.
An article on Medium mentioned that even the most advanced embedding model still has unstable top_k ranking when facing diverse documents. Rerank is currently the fastest and most practical "patch", especially when pursuing extreme results, there is basically no choice.
Using Reflow in Cherry Studio
Get the key
https://cloud.siliconflow.cn/i/eGafyivT
Select Model Type
Interface settings
Using Rerank in Knowledge Base