Having an Embedding model is not enough, do we also need a Rerank model?

Written by

Audrey Miles

Updated on:July-08th-2025

What is the Rerank Model?

The Rerank model is a machine learning model used to optimize the ranking of information retrieval results. It improves the accuracy and semantic matching of the final results by fine-tuning the relevance of documents to queries. The following are its key points:

‌Definition and Positioning‌

It is a re-ranking algorithm that performs secondary screening and sorting of candidate documents after the initial retrieval (such as keyword matching or vector similarity retrieval).
In the RAG (retrieval enhanced generation) process, it is used in conjunction with the Embedding model to form a synergistic mechanism of "rough screening + fine sorting".
‌Core Role‌

Solve the limitations of preliminary retrieval : Make up for the shortcomings of traditional retrieval methods (such as inverted index or Embedding similarity calculation) in the depth of semantic understanding.
Improve result quality : Re-score documents through multi-dimensional evaluation (such as semantic consistency and contextual relevance) to ensure that highly relevant content is displayed first.

How it works

‌Supervised learning training‌ : Based on a large number of correct and incorrect query-document pairs, the model learns to maximize the score of correct pairs and minimize the score of incorrect pairs‌.
‌Relevance scoring‌ : Input query and document, directly output the matching score between the two, and sort them accordingly‌.

‌Typical application scenarios‌

‌RAG system‌ : Optimizes the ranking of search documents and improves the accuracy of answers generated by large models‌.
‌Search engine/recommendation system‌ : Fine-tune the order of results to enhance user satisfaction‌.

What is the difference between the Rerank model and the Embedding model?

The following is a comparison table between the Rerank model and the Embedding model, covering the core differences and typical applications:

‌Comparison Dimensions‌	Embedding Model	Rerank Model
‌Main Objectives‌	Map text into vectors to achieve large-scale fast semantic retrieval	Refine the sorting of preliminary search results to improve the ranking accuracy of relevant documents
‌Input and output formats‌	- Input: a single piece of text (query or document) - Output: a dense vector of fixed length (e.g. 768 dimensions)	- Input: query + document pair - Output: relevance score (no fixed range, such as 0.85)
‌Typical Architecture‌	Bi-Encoder (such as BERT’s two independent encoding towers)	Cross-Encoder (e.g. BERT jointly encodes query and document)
Calculation method	Independently encode texts, sorted by vector similarity (e.g. cosine distance)	Jointly encode query and document, capture fine-grained semantic interactions and directly score
‌Application Phase‌	Retrieval process front end: quickly recall candidate sets (such as Top-100) from massive data	Retrieval process backend: perform secondary refinement on a small number of candidate sets (such as Top-100) and output the final results (such as Top-5)
Resource Consumption	- Document vectors can be pre-computed offline - High online retrieval efficiency (only query vectors need to be calculated)	- The query interaction with each document needs to be calculated online in real time - The computational cost increases linearly with the number of candidates
‌Effect Optimization Direction‌	Improve the quality of the semantic space (e.g., uniformity, generalization), but may lose fine-grained semantics	Directly optimize the ability to discriminate relevance and accurately match intent through supervised learning
‌Typical Models/ Tools‌	Open Source:`BGE-base-zh`,`text2vec` Business: OpenAI Embedding, Cohere Embed	Open Source:`BGE-reranker-large`,`bge-reranker-base` Commercial: Cohere Rerank API
‌Applicable scenarios‌	Scenarios that require rapid screening of candidates (such as search engine first-round recall, recommendation system cold start)	Scenarios that require high-precision sorting (such as RAG enhancement generation, advertisement sorting, and question-answering system answer optimization)
‌Pros and Cons Comparison‌	✅ Advantages: efficient and scalable ❌ Disadvantages: coarse semantic matching granularity	✅ Advantages: high accuracy, deep semantic understanding ❌ Disadvantages: slow calculation, poor scalability

Typical collaboration scenario examples (taking the RAG system as an example):

The Embedding model encodes user queries and document libraries into vectors to complete the initial recall.
The Rerank model re-ranks the recall results to improve the accuracy of the answers generated by LLM‌
The two form a complementary mechanism of "coarse screening + fine sorting", taking into account both efficiency and precision.

RAG evaluation based on LlamaIndex :

How to choose the Rerank model?

First, you can refer to https://huggingface.co/spaces/mteb/leaderboard_legacy

The no-brainer choice is to recommend the Zhipu series

Multilingual scenarios are preferred

BAAI/bge-reranker-v2-m3

BAAI/bge-reranker-v2-gemma

Model	Base model	Language	layerwise	feature
BAAI/bge-reranker-base	xlm-roberta-base	Chinese and English	-	Lightweight reranker model, easy to deploy, with fast inference.
BAAI/bge-reranker-large	xlm-roberta-large	Chinese and English	-	Lightweight reranker model, easy to deploy, with fast inference.
BAAI/bge-reranker-v2-m3	bge-m3	Multilingual	-	Lightweight reranker model, possesses strong multilingual capabilities, easy to deploy, with fast inference.
BAAI/bge-reranker-v2-gemma	gemma-2b	Multilingual	-	Suitable for multilingual contexts, performs well in both English proficiency and multilingual capabilities.
BAAI/bge-reranker-v2-minicpm-layerwise	MiniCPM-2B-dpo-bf16	Multilingual	8-40	Suitable for multilingual contexts, performs well in both English and Chinese proficiency, allows freedom to select layers for output, facilitating accelerated inference.

Last words

Considering the core irreplaceability of the Rerank model

Capability Dimension	Rerank model value	Large model replacement feasibility analysis
‌Semantic Interaction Depth‌	Cross-coding enables fine-grained semantic matching between queries and documents (e.g., ambiguity resolution)‌	LLM cannot directly replace the semantic discrimination ability at this level
‌Computational efficiency‌	The second refinement of the Top-100 candidate set only requires millisecond latency‌	LLM requires several times more computing resources to process the same amount of data.
‌Advantages of system	Independent modules facilitate iterative optimization (such as domain adaptation and fine-tuning)‌	The complexity of debugging end-to-end solutions increases exponentially‌

Some recommendations for precise answers

Scene Type	Recommended Solution	Theoretical benefits
High-precision question answering system	Rerank+full parameter LLM	Answer accuracy increased by 18-25%‌
Real-time conversation scenario	Rerank + layer pruning LLM	Response delay reduced by 40%, accuracy loss <3%‌
Multimodal Retrieval	Multimodal Rerank + Cross-modal LLM	Cross-modal alignment efficiency increased by 30%‌