Latest|Use Qwen3 Embedding+Milvus to build the strongest enterprise knowledge base

Written by
Clara Bennett
Updated on:June-13th-2025
Recommendation

Alibaba Qwen3 Embedding + Milvus join forces to create the ultimate solution for enterprise-level knowledge base, with performance exceeding mainstream commercial APIs!

Core content:
1. Breakthrough performance of Qwen3-Embedding and Reranker models
2. Unique advantages of multi-language support and cross-language search
3. Complete tutorial for building a RAG system

Yang Fangxian
Founder of 53A/Most Valuable Expert of Tencent Cloud (TVP)
Preface

In recent days, Alibaba has quietly released two new models of the Qwen3 family: Qwen3-Embedding  and  Qwen3-Reranker (both include 0.6B lightweight version, 4B balanced version, and 8B high-performance version) . The two models are trained based on the Qwen3 base and naturally have strong multi-language understanding capabilities, supporting 119 languages , covering mainstream natural languages ​​and programming languages.

I took a quick look at the data and reviews on Hugging Face, and there are a few points worth sharing.

  • Qwen3-Embedding-8B scored 70.58 points on the MTEB multi-language list  , surpassing a number of star models such as BGE, E5, and even Google Gemini.

  • Qwen3-Reranker-8B scored 69.02 in the multilingual sorting task  and  77.45 in Chinese , which is also the top score among the existing open source reranker models.

  • Text vectors are unified in the same semantic space, and Chinese questions can directly hit English results , which is particularly suitable for intelligent search or customer service systems in global scenarios.

This means that these two models are not just "good among open source models", but "have caught up with or even surpassed mainstream commercial APIs in all aspects". In systems such as RAG retrieval, cross-language search, code search, especially in the Chinese context, these two models have the capabilities to be directly put into production .

So how to use it to build a RAG system? This article will give an in-depth tutorial.

01

RAG Construction Tutorial (Qwen3-Embedding-0.6B + Qwen3-Reranker-0.6B)

Tutorial highlights: Teach you step by step to build a RAG using the latest embedding model and reranker model released by Qwen3. The two-stage retrieval design (recall + reranking) balances efficiency and accuracy!

Environment Preparation

! pip install --upgrade pymilvus openai requests tqdm sentence-transformers transformers

Requires transformers>=4.51.0

Requires sentence-transformers>=2.7.0

In this example, we will use OpenAI as the large language model for text generation, so you need to prepare the API key OPENAI_API_KEY as an environment variable for the large language model to use.

import osos.environ["OPENAI_API_KEY"] = "sk-************"Data preparation

We can use the FAQ page in Milvus documentation 2.4.x as private knowledge in RAG, which is a good data source for building a basic RAG.

Download the zip file and extract the documentation to the folder milvus_docs

! wget https://github.com/milvus-io/milvus-docs/releases/download/v2.4.6-preview/milvus_docs_2.4.x_en.zip! unzip -q milvus_docs_2.4.x_en.zip -d milvus_docs We load all the markdown files from the folder milvus_docs/en/faq. For each document, we simply use "#" to separate the contents of the file, which can roughly separate the contents of the main parts of the markdown file.
from glob import globtext_lines = []for file_path in glob("milvus_docs/en/faq/*.md", recursive=True): with open(file_path, "r") as file: file_text = file.read() text_lines += file_text.split("# ")Prepare LLM and Embedding models

In this example, Qwen3-Embedding-0.6B is used for text embedding, and Qwen3-Reranker-0.6B is used to rerank the search results.

from openai import OpenAIfrom sentence_transformers import SentenceTransformerimport torchfrom transformers import AutoModel, AutoTokenizer, AutoModelForCausalLM# Initialize OpenAI client for LLM generationopenai_client = OpenAI()# Load Qwen3-Embedding-0.6B model for text embeddingsembedding_model = SentenceTransformer("Qwen/Qwen3-Embedding-0.6B")# Load Qwen3-Reranker-0.6B model for rerankingreranker_tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-Reranker-0.6B", padding_side='left')reranker_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-Reranker-0.6B").eval()# Reranker configurationtoken_false_id = reranker_tokenizer.convert_tokens_to_ids("no")token_true_id = reranker_tokenizer.convert_tokens_to_ids("yes")max_reranker_length = 8192prefix = "<|im_start|>system\nJudge whether the Document meets the requirements based on the Query and the Instruct provided. Note that the answer can only be \"yes\" or \"no\".<|im_end|>\n<|im_start|>user\n"suffix = "<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n"prefix_tokens = reranker_tokenizer.encode(prefix, add_special_tokens=False)suffix_tokens = reranker_tokenizer.encode(suffix, add_special_tokens=False) Output example

Define a function to generate text embedding using the Qwen3-Embedding-0.6B model. This function will be used to generate document embedding and query embedding.

def emb_text(text, is_query=False): """ Generate text embeddings using Qwen3-Embedding-0.6B model. Args: text: Input text to embed is_query: Whether this is a query (True) or document (False) Returns: List of embedding values ​​""" if is_query: # For queries, use the "query" prompt for better retrieval performance embeddings = embedding_model.encode([text], prompt_name="query") else: # For documents, use default encoding embeddings = embedding_model.encode([text]) return embeddings[0].tolist() Define reranking functions to improve retrieval quality. These functions use Qwen3-Reranker to implement a complete reranking pipeline, evaluating and reranking candidate documents based on their relevance to the query. The main functions of each function are:
  1. format_instruction(): Formats the query, document, and task instructions into the standard input format for the reranking model

  2. process_inputs(): Encode the formatted text and add special tokens for model judgment

  3. compute_logits(): Calculates the relevance score (between 0 and 1) of the query-document pair using the reranking model

  4. rerank_documents(): Reranks documents based on query relevance, returning a list of documents sorted in descending order of relevance score

def format_instruction(instruction, query, doc): """Format instruction for reranker input""" if instruction is None: instruction = 'Given a web search query, retrieve relevant passages that answer the query' output = "<Instruct>: {instruction}\n<Query>: {query}\n<Document>: {doc}".format( instruction=instruction, query=query, doc=doc ) return outputdef process_inputs(pairs): """Process inputs for reranker""" inputs = reranker_tokenizer( pairs, padding=False, truncation='longest_first', return_attention_mask=False, max_length=max_reranker_length - len(prefix_tokens) - len(suffix_tokens) ) for i, ele in enumerate(inputs['input_ids']):        inputs['input_ids'][i] = prefix_tokens + ele + suffix_tokens inputs = reranker_tokenizer.pad(inputs, padding=True, return_tensors="pt", max_length=max_reranker_length) for key in inputs: inputs[key] = inputs[key].to(reranker_model.device) return inputs@torch.no_grad()def compute_logits(inputs, **kwargs): """Compute relevance scores using reranker""" batch_scores = reranker_model(**inputs).logits[:, -1, :] true_vector = batch_scores[:, token_true_id] false_vector = batch_scores[:, token_false_id] batch_scores = torch.stack([false_vector, true_vector], dim=1) batch_scores = torch.nn.functional.log_softmax(batch_scores, dim=1) scores = batch_scores[:, 1].exp().tolist() return scoresdef rerank_documents(query, documents, task_instruction=None): """ Rerank documents based on query relevance using Qwen3-Reranker Args: query: Search query documents: List of documents to rerank task_instruction: Task instruction for reranking Returns: List of (document, score) tuples sorted by relevance score """ if task_instruction is None: task_instruction = 'Given a web search query, retrieve relevant passages that answer the query' # Format inputs for reranker pairs = [format_instruction(task_instruction, query, doc) for doc in documents] # ​​Process inputs and compute scores inputs = process_inputs(pairs) scores = compute_logits(inputs) # Combine documents with scores and sort by score (descending) doc_scores = list(zip(documents, scores)) doc_scores.sort(key=lambda x: x[1], reverse=True) return doc_scores Generates a test vector and prints its dimensions and the first few elements.
test_embedding = emb_text("This is a test")embedding_dim = len(test_embedding)print(embedding_dim)print(test_embedding[:10]) Result example:
1024[-0.009923271834850311, -0.030248118564486504, -0.011494234204292297, -0.05980192497372627, -0.0026795873418450356, 0.016578301787376404, -0.04073038697242737, 0.03180320933461189, -0.024417787790298462, 2.1764861230622046e-05] Load data into Milvus

Create a collection

from pymilvus import MilvusClientmilvus_client = MilvusClient(uri="./milvus_demo.db")collection_name = "my_rag_collection"Parameter settings for MilvusClient:
  • Setting the URI to a local file (e.g. ./milvus.db) is the most convenient way, as it automatically stores all data in that file using Milvus Lite.

  • If you have large-scale data, you can build a more powerful Milvus server on Docker or Kubernetes. In this case, please use the server's URI (for example, http://localhost:19530 ) as your URI.

  • If you want to use Zilliz Cloud (Milvus' fully managed cloud service), please adjust the URI and token to correspond to the Public Endpoint and API key in Zilliz Cloud respectively.

Checks if the collection already exists, and removes it if so.

if milvus_client.has_collection(collection_name): milvus_client.drop_collection(collection_name) Creates a new collection with the specified parameters.

If no field information is specified, Milvus will automatically create a default ID field as the primary key, and a vector field to store vector data. A reserved JSON field is used to store fields and their values ​​that are not defined in the schema.

milvus_client.create_collection( collection_name=collection_name, dimension=embedding_dim, metric_type="IP", # Inner product distance consistency_level="Strong", # Strong consistency level) insert collection

Iterate over the text line by line, create embedding vectors, and then insert the data into Milvus.

Below is a new field text, which is an undefined field in the collection. It will automatically create a corresponding text field (actually it is implemented by the reserved JSON dynamic field at the bottom, you don't need to care about its underlying implementation.)

from tqdm import tqdmdata = []for i, line in enumerate(tqdm(text_lines, desc="Creating embeddings")): data.append({"id": i, "vector": emb_text(line), "text": line})milvus_client.insert(collection_name=collection_name, data=data) Output result example:
Creating embeddings: 100%| 8.68it/s]{'insert_count': 72, 'ids': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120

Retrieve Data

Let's specify a frequently asked question about Milvus.

question = "How is data stored in milvus?" searches the collection for that question and gets the top 10 candidate answers with the highest semantic match, then uses a reranker to select the best 3 matches.
# Step 1: Initial retrieval with larger candidate setsearch_res = milvus_client.search( collection_name=collection_name, data=[ emb_text(question, is_query=True) ], # Use the `emb_text` function with query prompt to convert the question to an embedding vector limit=10, # Return top 10 candidates for reranking search_params={"metric_type": "IP", "params": {}}, # Inner product distance output_fields=["text"], # Return the text field)# Step 2: Extract candidate documents for rerankingcandidate_docs = [res["entity"]["text"] for res in search_res[0]]# Step 3: Rerank documents using Qwen3-Rerankerprint("Reranking documents...")reranked_docs = rerank_documents(question, candidate_docs)# Step 4: Select top 3 reranked documentstop_reranked_docs = reranked_docs[:3]print(f"Selected top {len(top_reranked_docs)} documents after reranking")Let's take a look at the reranking results of this query!
import json# Display reranked results with reranker scoresreranked_lines_with_scores = [ (doc, score) for doc, score in top_reranked_docs]print("Reranked results:")print(json.dumps(reranked_lines_with_scores, indent=4))# Also show original embedding-based results for comparisonprint("\n" + "="*80)print("Original embedding-based results (top 3):")original_lines_with_distances = [ (res["entity"]["text"], res["distance"]) for res in search_res[0][:3]]print(json.dumps(original_lines_with_distances, indent=4))Example of output results:

From the results, we can see that Qwen3-Reranker has a significant reranking effect and the correlation score has good discrimination.


Reranked results(top 3):[ [ " Where does Milvus store data?\n\nMilvus deals with two types of data, inserted data and metadata. \n\nInserted data, including vector data, scalar data, and collection-specific schema, are stored in persistent storage as incremental log. Milvus supports multiple object storage backends, including [MinIO](https://min.io/), [AWS S3](https://aws.amazon.com/s3/?nc1=h_ls), [Google Cloud Storage](https://cloud.google.com/storage?hl=en#object-storage-for-companies-of-all-sizes) (GCS), [Azure Blob Storage](https://azure.microsoft.com/en-us/products/storage/blobs), [Alibaba Cloud OSS](https://www.alibabacloud.com/product/object-storage-service), and [Tencent Cloud Object Storage](https://www.tencentcloud.com/products/cos) (COS).\n\nMetadata are generated within Milvus. Each Milvus module has its own metadata that are stored in etcd.\n\n###", 0.9997891783714294 ], [ "How does Milvus flush data?\n\nMilvus returns success when inserted data are loaded to the message queue. However, the data are not yet flushed to the disk. Then Milvus' data node writes the data in the message queue to persistent storage as incremental logs. If `flush()` is called, the data node is forced to write all data in the message queue to persistent storage immediately.\n\n###", 0.9989748001098633 ], [ "Does the query perform in memory? What are incremental data and historical data?\n\nYes. When a query request comes, Milvus searches both incremental data and historical data by loading them into memory. Incremental data are in the growing segments, which are buffered in memory before they reach the threshold to be persisted in storage engine, while historical data are from the sealed segments that are stored in the object storage. Incremental data and historical data together constitute the whole dataset to search.\n\n###", 0.9984032511711121 ]]================================================================================ Original embedding-based results(top 3):[ [ " Where does Milvus store data?\n\nMilvus deals with two types of data, inserted data and metadata. \n\nInserted data, including vector data, scalar data, and collection-specific schema, are stored in persistent storage as incremental log. Milvus supports multiple object storage backends, including [MinIO](https://min.io/), [AWS S3](https://aws.amazon.com/s3/?nc1=h_ls), [Google Cloud Storage](https://cloud.google.com/storage?hl=en#object-storage-for-companies-of-all-sizes) (GCS), [Azure Blob Storage](https://azure.microsoft.com/en-us/products/storage/blobs), [Alibaba Cloud OSS](https://www.alibabacloud.com/product/object-storage-service), and [Tencent Cloud Object Storage](https://www.tencentcloud.com/products/cos) (COS).\n\nMetadata are generated within Milvus. Each Milvus module has its own metadata that are stored in etcd.\n\n###", 0.8306853175163269 ],[ "How does Milvus flush data?\n\nMilvus returns success when inserted data are loaded to the message queue. However, the data are not yet flushed to the disk. Then Milvus' data node writes the data in the message queue to persistent storage as incremental logs. If `flush()` is called, the data node is forced to write all data in the message queue to persistent storage immediately.\n\n###", 0.7302717566490173 ], [ "How does Milvus handle vector data types and precision?\n\nMilvus supports Binary, Float32, Float16, and BFloat16 vector types.\n\n- Binary vectors: Store binary data as sequences of 0s and 1s, used in image processing and information retrieval.\n- Float32 vectors: Default storage with a precision of about 7 decimal digits. Even Float64 values ​​are stored with Float32 precision, leading to potential precision loss upon retrieval.\n- Float16 and BFloat16 vectors: Offer reduced precision and memory usage. Float16 is suitable for applications with limited bandwidth and storage, while BFloat16 balances range and efficiency, commonly used in deep learning to reduce computational requirements without significantly impacting accuracy.\n\n###", 0.7003671526908875 ]]

Building Retrieval-Augmented Generation (RAG) Responses Using Large Language Models (LLMs)

Convert the retrieved document into a string format.

context = "\n".join( [line_with_distance[0] for line_with_distance in retrieved_lines_with_distances]) provides system prompts and user prompts for the large language model. This prompt is generated from the documents retrieved from Milvus.
SYSTEM_PROMPT = """Human: You are an AI assistant. You are able to find answers to the questions from the contextual passage snippets provided."""USER_PROMPT = f"""Use the following pieces of information enclosed in <context> tags to provide an answer to the question enclosed in <question> tags.<context>{context}</context><question>{question}</question>"""Use Open AI The large language model gpt-4o, generates responses based on prompts.
response = openai_client.chat.completions.create( model="gpt-4o", messages=[ {"role": "system", "content": SYSTEM_PROMPT}, {"role": "user", "content": USER_PROMPT}, ],)print(response.choices[0].message.content) output result display:
In Milvus, data is stored in two main forms: inserted data and metadata. Inserted data, which includes vector data, scalar data, and collection-specific schema, is stored in persistent storage as incremental logs. Milvus supports multiple object storage backends for this purpose, including MinIO, AWS S3, Google Cloud Storage, Azure Blob Storage, Alibaba Cloud OSS, and Tencent Cloud Object Storage. Metadata for Milvus is generated by its various modules and stored in etcd.02 summary

Through the above tutorials and output results, it is not difficult to find that the embedding and reranker models launched by the Tongyi Qianwen team in the Qwen3 series perform quite well. The combination of these two models provides a relatively complete and practical solution for the RAG system.

In terms of design concept, the Embedding model supports differentiated processing of query and document, reflecting a deep understanding of retrieval tasks; Reranker adopts a cross-encoder architecture, which can capture the fine interaction between query and document; the two-stage retrieval design (recall + reranking) in the tutorial balances efficiency and accuracy. In particular, Qwen3-Embedding-0.6B (1024 dimensions) and Qwen3-Reranker-0.6B both use a relatively lightweight parameter scale, support local deployment , and reduce dependence on external APIs. While ensuring performance, it reduces hardware requirements and is suitable for small and medium-sized enterprises and individual developers.

In fact, the launch of the embedding and reranker models in the Qwen3 series is not an isolated case or a coincidence, but an industry consensus.

The reason is simple. These two modules determine whether the large model has the ability to be commercialized.

The biggest problems with generative large models are: high uncertainty, difficult evaluation, and high cost.

To solve the above problems, whether it is  RAG, LLM Memory, or Agent, they all rely on one premise: whether the semantics can be compressed into a vector expression that can be efficiently retrieved and judged by the machine .

Embedding and Ranking are the best paths at present: clear standards, measurable performance, controllable costs, and easy grayscale. Embedding determines whether you can "find", and Ranking determines whether you can "select accurately". This makes them one of the first API modules to run through model commercialization: high call frequency (required for each search), high switching cost (bound to the index), and high commercial value (can be used as underlying infra).