Tutorial | Tongyi Qwen 3 + Milvus, the hybrid reasoning model is the best paradigm for optimizing RAG costs

Alibaba's latest Qwen3 series models are released. The hybrid reasoning model leads to a new level of performance and helps enterprises optimize costs.
Core content:
1. The hybrid reasoning capability of the Qwen3 series models greatly improves reasoning and instruction compliance performance
2. The eight models fully support multiple languages to reduce enterprise development costs
3. Zilliz engineers provide step-by-step RAG tutorials to help users find the golden point of performance and cost
Are you ready to work overtime on May Day?
Just this morning, Alibaba's Qwen3 series of models were officially released. In just 12 hours, the number of GitHub stars exceeded 17k, and the peak download volume of Hugging Face reached 23,000 times per hour.
What’s even more exciting is that a total of eight models have been released in the Qwen3 series, all of which are hybrid reasoning models (supporting both fast and slow thinking). In addition to significant enhancements in reasoning, instruction following, tool calling, and multi-language capabilities , they also set new performance records for all domestic models and global open source models.
Although there are a lot of factors, don't worry. Zilliz has arranged several engineers to get started as soon as possible to help everyone get started with the evaluation and provide a step-by-step RAG tutorial. Coding by heart means learning, and there will be no more overtime on May Day.
In addition, our Zilliz open source project DeepSearcher (GitHub over 5K stars, focusing on deep retrieval and report generation) has supported the Qwen3 model as soon as possible, helping users find the golden point of performance and cost.
Reference link https://github.com/zilliztech/deep-searcher?tab=readme-ov-file#configuration-details
01
Qwen 3 series interpretation: stronger, more choices, lower threshold, more suitable for enterprise implementation
To summarize the Qwen 3 series models, you can remember these four groups of keywords:
(1) All models are hybrid models that combine reasoning and non-reasoning; they meet the boss’s abnormal requirements of both cost and performance;
(2) Two MoE models + six Dense models; the former are suitable for the cloud, while the latter perform better locally;
(3) Ability upgrade based on small size; 4 H20s can deploy a full-blooded Qwen 3;
(4) Access to MCP and multi-language capabilities; lower development costs.
Let’s start with the first set of keywords – hybrid reasoning model.
The Qwen3 series are all hybrid reasoning models, which means they have both reasoning (slow thinking) and non-reasoning (fast thinking) capabilities.
This improvement is also in line with the viewpoint we put forward in the historical article "The emergence of Agentic RAG, the twilight of traditional RAG" data-itemshowtype="0" target="_blank" linktype="text" data-linktype="2">DeepSearcher in-depth interpretation: The emergence of Agentic RAG, the twilight of traditional RAG" that the mainstream development direction of the industry in the future will be the combination of inference and non-inference models.
The reason is simple: the inference mode can make the model smarter, but it also significantly increases computing power consumption and corresponding waiting time. By mixing inference and non-inference capabilities, users can freely choose to think deeply about complex problems or quickly answer simple questions according to actual needs, maximizing the balance between computing power cost and output effect.
In addition, it is interesting that, regarding the token output volume, in addition to the inference and non-inference modes, the Qwen3 API can also set the " thinking budget " (that is, the number of tokens expected for maximum depth of thinking) as needed, perform different levels of thinking, and flexibly meet the diverse performance and cost requirements of AI applications and different scenarios.
Next, let’s look at the second set of keywords: two MoE models + six Dense models.
This open source includes two MoE models : Qwen3-235B-A22B (over 235 billion total parameters, over 22 billion activation parameters), and Qwen3-30B-A3B (30 billion total parameters, 3 billion activation parameters). The number of activated experts processed by each token is set to 8 by default, and the total expert pool size is expanded to 128.
And six Dense models : Qwen3-32B, Qwen3-14B, Qwen3-8B, Qwen3-4B, Qwen3-1.7B and Qwen3-0.6B.
(Background science: Dense model means that all parameters and computing units are involved in the calculation at each inference or training , so the larger the model, the higher the training and inference costs. Representative examples: GPT-3, BERT, ResNet and other traditional large models. MoE model introduces the concept of "experts": the model contains multiple sub-networks - expert networks, and only a part of the experts are activated at each inference. )
The third group of keywords is based on small size capability upgrades.
This time, the Qwen3 series training data volume reached 36T tokens, double that of Qwen2.5.
In this context, among the eight models released, the largest one is the MoE model Qwen3-235B-A22B, with a total of more than 235 billion parameters and more than 22 billion activation parameters. Before this, the model with the largest parameter scale in the Tongyi Qianwen (Qwen) series was Qwen1.5-110B, which has doubled the number of parameters.
However, by comparing it with DeepSeek-R1 and other models with 671B parameters, we can find that Tongyi did not take the absolute large parameter route. Accordingly, the deployment cost of Qwen3 is not high. By supporting dynamic quantization from FP4 to INT8 , only 4 H20s are needed to deploy the full version of Qwen3.
The fourth group of keywords is MCP and multilingual.
The Qwen3 series models have provided a lot of engineering support in terms of agent support and global implementation.
On the one hand, the Qwen3 model supports access to MCP, allowing external databases, tools and other products to better interact with the large model. In addition, the Qwen3 model is equipped with a Qwen-Agent project, which can use APIs to call tools or expand them in combination with existing tool chains.
On the other hand, the Qwen3 model supports 119 languages and dialects, which can better serve global developers and upper-level applications and facilitate enterprises to expand their global business.
Based on the above capabilities, it can be found that the Qwen3 series models are very suitable for implementation in enterprise scenarios, helping developers quickly build products with fully controllable performance and cost.
Next, we will take RAG as an example, combining Milvus and Qwen3 to provide a step-by-step tutorial and compare the results and costs in the inference and non-inference modes.
02
RAG Construction Tutorial
Environment Preparation
! pip install --upgrade pymilvus openai requests tqdm
First, we need to go to the official website of Alibaba Cloud DashScope Model Service Lingji to obtain the API keyDASHSCOPE_API_KEY
, and add it to the environment variables
import osos.environ["DASHSCOPE_API_KEY"] = "sk-************"
Data preparation
We can use the FAQ page in Milvus documentation 2.4.x as private knowledge in RAG, which is a good data source for building a basic RAG.
Download the zip file and extract the documentation to the folder milvus_docs
! wget https://github.com/milvus-io/milvus-docs/releases/download/v2.4.6-preview/milvus_docs_2.4.x_en.zip! unzip -q milvus_docs_2.4.x_en.zip -d milvus_docs
We load all the markdown files from the folder milvus_docs/en/faq. For each document, we simply use "#" to separate the contents of the file, which can roughly separate the contents of each major part of the markdown file.
from glob import globtext_lines = []for file_path in glob("milvus_docs/en/faq/*.md", recursive=True): with open(file_path, "r") as file: file_text = file.read() text_lines += file_text.split("# ")
Prepare LLM and Embedding Model
DashScope supports OpenAI-style API. We can initialize the OpenAI client to call LLM.
from openai import OpenAIopenai_client = OpenAI( base_url="https://dashscope.aliyuncs.com/compatible-mode/v1", api_key=os.getenv("DASHSCOPE_API_KEY"))
We define an embedding model using milvus_model
To generate text embedding, and useDefaultEmbeddingFunction
The model is used as an example, which is a pre-trained lightweight embedding model.
from pymilvus import model as milvus_modelembedding_model = milvus_model.DefaultEmbeddingFunction()
Generate a test embedding and print its dimensions and first few elements.
test_embedding = embedding_model.encode_queries(["This is a test"])[0]embedding_dim = len(test_embedding)print(embedding_dim)print(test_embedding[:10])
768[-0.04836066 0.07163023 -0.01130064 -0.03789345 -0.03320649 -0.01318448 -0.03041712 -0.02269499 -0.02317863 -0.00426028]
Load data into Milvus and create a collection
from pymilvus import MilvusClientmilvus_client = MilvusClient(uri="./milvus_demo.db")collection_name = "my_rag_collection"
About MilvusClient parameter settings:
Setting the URI to a local file (e.g. ./milvus.db) is the most convenient way, as it automatically stores all data in that file using Milvus Lite.
If you have large-scale data, you can build a more powerful Milvus server on Docker or Kubernetes. In this case, please use the server's URI (for example, http://localhost:19530 ) as your URI.
If you want to use Zilliz Cloud (Milvus' fully managed cloud service), please adjust the URI and token to correspond to the Public Endpoint and API key in Zilliz Cloud respectively.
Checks if the collection already exists, and removes it if so.
if milvus_client.has_collection(collection_name): milvus_client.drop_collection(collection_name)
Creates a new collection with the specified parameters.
If no field information is specified, Milvus will automatically create a default ID field as the primary key, and a vector field to store vector data. A reserved JSON field is used to store fields and their values that are not defined in the schema.
milvus_client.create_collection( collection_name=collection_name, dimension=embedding_dim, metric_type="IP", # Inner product distance consistency_level="Strong", # Strong consistency level)
Insert Collection
Iterate over the text line by line, create embedding vectors, and then insert the data into Milvus.
Below is a new field text, which is an undefined field in the collection. It will automatically create a corresponding text field (actually it is implemented by the reserved JSON dynamic field at the bottom, you don't need to care about its underlying implementation.)
from tqdm import tqdmdata = []doc_embeddings = embedding_model.encode_documents(text_lines)for i, line in enumerate(tqdm(text_lines, desc="Creating embeddings")): data.append({"id": i, "vector": doc_embeddings[i], "text": line})milvus_client.insert(collection_name=collection_name, data=data)
Creating embeddings: 100%|72/72 [00:00<00:00, 381300.36it/s]{'insert_count': 72, 'ids': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71], 'cost': 0}
Build RAG and retrieve data
Let's specify a frequently asked question about Milvus.
question = "How is data stored in milvus?"
Search the collection for that question and retrieve the top 3 results that are the closest semantic match.
search_res = milvus_client.search( collection_name=collection_name, data=embedding_model.encode_queries( [question] ), # Convert the question to an embedding vector limit=3, # Return top 3 results search_params={"metric_type": "IP", "params": {}}, # Inner product distance output_fields=["text"], # Return the text field)
Let's take a look at the search results for this query.
import jsonretrieved_lines_with_distances = [ (res["entity"]["text"], res["distance"]) for res in search_res[0]]print(json.dumps(retrieved_lines_with_distances, indent=4))
[ [ " Where does Milvus store data?\n\nMilvus deals with two types of data, inserted data and metadata. \n\nInserted data, including vector data, scalar data, and collection-specific schema, are stored in persistent storage as incremental log. Milvus supports multiple object storage backends, including [MinIO](https://min.io/), [AWS S3](https://aws.amazon.com/s3/?nc1=h_ls), [Google Cloud Storage](https://cloud.google.com/storage?hl=en#object-storage-for-companies-of-all-sizes) (GCS), [Azure Blob Storage](https://azure.microsoft.com/en-us/products/storage/blobs), [Alibaba Cloud OSS](https://www.alibabacloud.com/product/object-storage-service), and [Tencent Cloud Object Storage](https://www.tencentcloud.com/products/cos) (COS).\n\nMetadata are generated within Milvus. Each Milvus module has its own metadata that are stored in etcd.\n\n###", 0.6572665572166443 ], [ "How does Milvus flush data?\n\nMilvus returns success when inserted data are loaded to the message queue. However, the data are not yet flushed to the disk. Then Milvus' data node writes the data in the message queue to persistent storage as incremental logs. If `flush()` is called, the data node is forced to write all data in the message queue to persistent storage immediately.\n\n###", 0.6312146186828613 ], [ "How does Milvus handle vector data types and precision?\n\nMilvus supports Binary, Float32, Float16, and BFloat16 vector types.\n\n- Binary vectors: Store binary data as sequences of 0s and 1s, used in image processing and information retrieval.\n- Float32 vectors: Default storage with a precision of about 7 decimal digits. Even Float64 values are stored with Float32 precision, leading to potential precision loss upon retrieval.\n- Float16 and BFloat16 vectors: Offer reduced precision and memory usage. Float16 is suitable for applications with limited bandwidth and storage, while BFloat16 balances range and efficiency, commonly used in deep learning to reduce computational requirements without significantly impacting accuracy.\n\n###", 0.6115777492523193 ]]
Building Retrieval-Augmented Generation (RAG) Responses Using Large Language Models (LLMs)
Convert the retrieved document into a string format.
context = "\n".join( [line_with_distance[0] for line_with_distance in retrieved_lines_with_distances])
Provides system prompts and user prompts for large language models. The prompts are generated from documents retrieved from Milvus.
SYSTEM_PROMPT = """Human: You are an AI assistant. You are able to find answers to the questions from the contextual passage snippets provided."""USER_PROMPT = f"""Use the following pieces of information enclosed in <context> tags to provide an answer to the question enclosed in <question> tags.<context>{context}</context><question>{question}</question>"""
Use Tongyi Qianwen's latest model to generate responses based on prompts.
completion = openai_client.chat.completions.create( # Model list: https://help.aliyun.com/zh/model-studio/getting-started/models model="qwen-plus-2025-04-28", messages=[ {"role": "system", "content": SYSTEM_PROMPT}, {"role": "user", "content": USER_PROMPT}, ], # The Qwen3 model controls the thinking process through the enable_thinking parameter (the open source version defaults to True, the commercial version defaults to False) extra_body={"enable_thinking": False},)print(completion.choices[0].message.content)
Output result display
In Milvus, data is stored in two main categories: **inserted data** and **metadata**.- **Inserted Data**: This includes vector data (like Binary, Float32, Float16, and BFloat16 types), scalar data, and collection-specific schema. These are stored in persistent storage as **incremental logs**. Milvus supports various object storage backends for this purpose, including: - [MinIO](https://min.io/) - [AWS S3](https://aws.amazon.com/s3/?nc1=h_ls) - [Google Cloud Storage (GCS)](https://cloud.google.com/storage?hl=en#object-storage-for-companies-of-all-sizes) - [Azure Blob Storage](https://azure.microsoft.com/en-us/products/storage/blobs) - [Alibaba Cloud OSS](https://www.alibabacloud.com/product/object-storage-service) - [Tencent Cloud Object Storage (COS)](https://www.tencentcloud.com/products/cos)- **Metadata**: Metadata generated within Milvus is stored separately. Each Milvus module maintains its own metadata, which is stored in [etcd](https://etcd.io/), a distributed key-value store. When data is inserted into Milvus, it is first loaded into a message queue. It is not immediately written to disk. A `flush()` operation ensures that all data in the queue is written to persistent storage immediately.
03
Comparison between inferential and non-inferential modes
Basic math problem: A and B start from the same place. A walks first for 2 hours at a speed of 5km/h, and B catches up at 15km/h. How long will it take for A to catch up?
Reasoning Mode
Reference code:
import osimport timefrom openai import OpenAIos.environ["DASHSCOPE_API_KEY"] = "sk-******************************" client = OpenAI( api_key=os.getenv("DASHSCOPE_API_KEY"), base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",)######################################### Think# Recording start time start_time = time.time()stream = client.chat.completions.create( # Model list: https://help.aliyun.com/zh/model-studio/getting-started/models model="qwen3-235b-a22b", messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "A and B start from the same place. A walks for 2 hours first, at a speed of 5km/h. B catches up at 15km/h. How long will it take to catch up?"}, ], # The Qwen3 model controls the thinking process through the enable_thinking parameter (the open source version defaults to True, and the commercial version defaults to False) extra_body={"enable_thinking": True}, stream=True,)answer_content = ""for chunk in stream: delta = chunk.choices[0].delta if delta.content is not None: answer_content += delta.contentprint(answer_content)# Record the end time and calculate the total time end_time = time.time()print(f"\n\nTotal time: {end_time - start_time:.2f} seconds")From the answer result, the reasoning mode can know that this is a catch-up problem by analyzing the question, and then analyze the known conditions, and give two solutions and the correct answer, indicating that the model has done enough thinking on this problem. In particular, the Markdown format document it finally gives, in which the formula expression is very perfect.
The entire code takes: 35.73 seconds
(The following figure is a screenshot of the markdown results of the large model's answer after visualization, for the convenience of readers)
Non-inferential mode
In the code, just set"enable_thinking":False
Let's look at the performance of non-inference mode on this problem:
The non-inferential mode solves the problem step by step, using the common relative speed method, and quickly arrives at the correct answer.
The total time consumed is 6.89 seconds, which is about 1/5 of the time of the reasoning mode. It can be seen that the reasoning mode generates more thinking than the non-reasoning mode.
This more thinking can easily make the answer richer and more logical. However, the non-inference model is relatively faster to answer. These two modes have their own advantages in dealing with different problems . Therefore, users can choose according to their own situation .
end
In general, in addition to achieving SOTA in various segments, the release of the Qwen3 series models is more focused on engineering implementation, and has made a good implementation proof in terms of the balance between cost and performance for the construction of RAG or agent.
For example, compared to DeepSeek, it has smaller parameters, which reduces deployment costs; more language support increases the cost of use for global developers; MCP support increases ecological links; and the hybrid reasoning model design allows users to freely choose whether to use reasoning to achieve precise control of output costs.
In the recently open-sourced DeepSearcher project (Zilliz's open-source deep search and report generation project), we received a lot of user feedback - using the inference model for DeepSearcher not only takes a long time to wait, but also has uncontrollable costs; but using only non-inference models will greatly reduce the number of retrieval rounds and intelligence, and fail to achieve the expected effect of the agent rag.
However, DeepSearcher has already supported the Qwen3 model for the first time, helping users find the golden point of performance and cost. Welcome everyone to try it out