Performance has greatly improved! Alibaba open-sources the new Qwen3 model, which dominates the text representation chart

Written by

Caleb Hayes

Updated on:June-13th-2025

Early this morning, Alibaba open-sourced two new Qwen3 series models, Qwen3-Embedding and Qwen3-Reranker .

These two models are designed specifically for text representation, retrieval and sorting tasks. They are trained based on the Qwen3 basic model, fully inherit the advantages of Qwen 3 in multilingual text understanding, and support 119 languages.

According to test data, Qwen3 Embedding performs very well in the multilingual text representation benchmark test . Among them, 8B parameters ranked first with a high score of 70.58 , surpassing many commercial API services, such as Google's Gemini-Embedding .

In the sorting task, the Qwen3 Reranker series models also demonstrated strong capabilities. In the basic relevance retrieval task, the 8B model achieved a high score of 69.02 in the multilingual retrieval task , 77.45 in the Chinese retrieval task , and 69.76 in the English retrieval task , significantly outperforming other baseline models.

Open source address: https://huggingface.co/collections/Qwen/qwen3-embedding-6841b2055b99c44d9a4c371f

https://huggingface.co/collections/Qwen/qwen3-reranker-6841b22d0192d7ade9cdefea

Text representation and sorting are core tasks in natural language processing and information retrieval, and are mainly used in many fields such as web search, question-answering systems, and recommendation systems. High-quality text representation enables the model to accurately capture the semantic relationship between texts, while an effective sorting mechanism ensures that the most relevant results are presented to users first.

However, it is difficult to train a model that has both generalization capabilities and accurate retrieval and sorting capabilities on large-scale data, and the new version of Qwen3 is far ahead of other models.

In terms of model architecture, these two models adopt a dense version based on the Qwen3 basic model and provide three model configurations with different parameter sizes, namely 0.6B , 4B and 8B parameters, to meet the performance and efficiency requirements in different scenarios.

For the text embedding model, the researchers used a large model with a causal attention mechanism and added an [EOS] tag at the end of the input sequence to extract the semantic representation of the text from the hidden state of the last layer. This design not only enhances the model's ability to understand the semantics of the text, but also enables the model to be flexibly adjusted according to different task requirements.

In addition, in order to enable the model to better follow instructions and perform well in downstream tasks, the researchers concatenated the instructions and query text into a single input context, while the document remained unchanged. This design enables the model to better understand and handle complex semantic tasks, improving the model's performance in multilingual and cross-lingual tasks.

For the ranking model, a single tower structure is adopted, which takes text pairs (such as user queries and candidate documents) as input, and converts the similarity assessment task into a binary classification problem through the dialogue template of the large model. The model can judge whether the document meets the query requirements based on the input instructions, queries and documents, and output the relevance score. This design enables the model to more accurately evaluate the relevance between text pairs, thereby achieving better results in ranking tasks.

In terms of training paradigm, this series of models adopts an innovative multi-stage training method, including large-scale unsupervised pre-training, supervised fine-tuning of high-quality data, and model fusion strategy.

In the unsupervised pre-training stage, researchers used the text generation capabilities of the Qwen3 base model to synthesize large-scale weakly supervised training data. This data covers a variety of task types, languages, and fields, providing a wide range of learning materials for the model.

This method of synthesizing data not only improves the controllability of data, but also can generate high-quality data in low-resource languages and domains. It breaks through the limitations of traditional methods that rely on community forums or open source data screening to obtain weakly supervised text pairs, and realizes the efficient generation of large-scale weakly supervised data.

In the supervised fine-tuning phase, the researchers selected high-quality, small-scale annotated data for training to further improve the performance of the model. The training data in this phase not only includes open source annotated datasets, such as MS MARCO , NQ , HotpotQA , etc., but also selects some synthetic data. Through simple cosine similarity calculation, high-quality data pairs are selected from the synthetic data to further improve the performance of the model. This strategy not only improves the generalization ability of the model, but also achieves excellent results in multiple benchmarks.

Finally, in the model fusion stage, the researchers used a model fusion technique based on spherical linear interpolation. By merging multiple model checkpoints saved during the fine-tuning process, the model was able to perform better on different data distributions. This strategy significantly improved the stability and consistency of the model, and enhanced the robustness and generalization ability of the model.

In addition to the above technical innovations, the two models are also carefully designed in terms of training data synthesis. In order to generate high-quality synthetic data, the researchers adopted a carefully designed prompting strategy. In the text retrieval task, the model generates data through a multilingual pre-trained corpus and assigns a specific role to each document to simulate a potential user's query for the document.

In addition, the prompts also include multiple dimensions, such as query type keywords, factuality, summary, judgment, query length, difficulty, and language, ensuring the high quality and diversity of the synthetic data. The synthetic data generated in this way not only meets the needs of large-scale pre-training in terms of quantity, but also effectively improves the performance of the model in terms of quality.