Tongyi QwQ-32B+Milvus, the era of consumer-grade graphics cards full of big models and RAG is here!

Written by
Silas Grey
Updated on:July-11th-2025
Recommendation

The rise of the medium-sized inference model QwQ-32B brings new opportunities for consumer graphics cards.
Core content:
1. QwQ-32B model performance highlights and applicable scenarios
2. Comparative analysis of dense models and MoE models
3. How to build a RAG system based on QwQ-32B

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)

Preface

Recently, Tongyi's open source QwQ-32B model has become extremely popular.

As a medium-sized inference model, QwQ-32B has only 32 billion parameters, but it has demonstrated excellent inference capabilities in multiple benchmark tests , almost approaching the full-blooded version of DeepSeek R1, and its performance in mathematical calculations, writing, and code programming is quite good.

Most importantly, QwQ-32B is not only powerful but also extremely "people-friendly". It is small in size, fast inference, and supports consumer-grade graphics card deployment. Graphics cards like RTX 4090 can run easily, making it very suitable for ordinary individual developers or scientific researchers with insufficient resources to learn.

However, since QwQ-32B uses a dense model, compared to DeepSeek R1, it often fails to recognize the previous content or has hallucination problems when performing complex reasoning on long texts .

Therefore, in local scenarios, deploying RAG based on QwQ-32B has become the most powerful plug-in to solve this shortcoming.

Next, in this article, we will show you how to use QwQ-32B and Milvus to efficiently and securely build a RAG (retrieval augmentation generation) system based on the Ollama open source platform.

01

Selection ideas

(1) How to choose QwQ-32B VS DeepSeek R1

As large models focusing on inference, DeepSeek-R1 adopts the MoE structure, while QwQ-32B is a typical dense model.

Generally speaking, the MoE model is more suitable for knowledge-intensive scenarios, such as knowledge question-answering systems, information retrieval, and large-scale data processing. For example, when processing large-scale text data and image data, the MoE model can improve processing efficiency by having different experts process different data subsets.

However, the huge number of parameters in MoE means that deploying a full-scale large model needs to be run on the cloud or on one's own server resources.

The dense model is small in size and has high computational cost, and is more suitable for scenarios that require deep and coherent reasoning, such as complex logical reasoning tasks, deep reading comprehension, and complex algorithm design, which do not require high real-time performance. The biggest advantage is that it is very suitable for local deployment, but sometimes there will be too much nonsense.

Comparison Dimensions

Dense Model (QwQ-32B)

MoE Model (DeepSeek-R1)

advantage

The training difficulty is relatively low and the process is simple and direct

Reasoning is coherent, all neurons are involved in calculation and reasoning, and can grasp the context information as a whole

High computational efficiency, only some experts need to be activated during reasoning

The model has a large capacity and can be expanded by increasing the number of experts

shortcoming

High computational cost, training and inference require a lot of computing resources and memory

Model capacity expansion is limited, prone to overfitting, and high storage and deployment costs

Training is complex, requires training of gating networks, and considers expert load balancing

There is routing overhead, and gating network routing selection has additional calculation and time consumption

Dense models and MoE models each have their own unique advantages and limitations, and there is no absolute distinction between the two. In practical applications, we need to choose the appropriate model architecture based on specific task requirements, data characteristics, computing resources, budget and other factors.

However, in my opinion, the future development trend may be to combine the advantages of both. For example, in some complex tasks, the MoE model can be used first for preliminary knowledge retrieval and coarse-grained processing, and then the dense model can be used for deep reasoning and refinement to achieve better performance.

(2) Vector database selection

Considering the problem of QwQ-32B being prone to hallucinations, we can introduce the Milvus open source vector database during its local deployment. Milvus is designed for storing and querying high-dimensional vectors (such as those generated by XLNet), can handle millions or even billions of vector data, and supports a number of feature requirements such as hybrid retrieval and full-text retrieval. It is the most popular vector database on GitHub.

(3) Deployment platform selection

For this deployment, in terms of platform, we can consider using the Ollama open source platform. As a solution that provides local operation and management for large language models (LLMs), Ollama can simplify the deployment and management process, allowing users to quickly deploy models through simple command line tools and Docker integration, while supporting Modelfile management to simplify model version control and reuse. Secondly, it provides a rich model library and has the characteristics of cross-platform and hardware adaptation. It supports macOS, Linux, Windows and Docker container deployment, and can automatically detect GPUs and enable acceleration first. In addition, Ollama provides developer-friendly tools such as REST API and Python SDK to facilitate the integration of models into applications.

02

Preparation

environment

!pip install pymilvus ollama

Dataset preparation

We can use the FAQ page in Milvus documentation 2.4.x as private knowledge in RAG, which is a good data source for building a basic RAG.

Download the zip file and extract the documentation to the folder milvus_docs


! wget https://github.com/milvus-io/milvus-docs/releases/download/v2.4.6-preview/milvus_docs_2.4.x_en.zip! unzip -q milvus_docs_2.4.x_en.zip -d milvus_docs

We load all the markdown files from the folder milvus_docs/en/faq. For each document, we simply use "#" to separate the contents of the file, which can roughly separate the contents of each major part of the markdown file.

from glob import glob
text_lines = []
for file_path in glob( "milvus_docs/en/faq/*.md" , recursive= True ):with open(file_path, "r" ) as file:file_text = file.read()
text_lines += file_text.split( "# " )

Prepare LLM and Embedding Model

Ollama supports multiple models for LLM-based tasks and embeddings, making it easy to develop retrieval-augmented generation (RAG) applications. For this setting:

  • We will use QwQ(32B) as the LLM for the text generation task.

  • For the embedding model, we will use mxbay-embed-size, a 334M parameter model optimized for semantic similarity.

Before you begin, make sure both models are downloaded locally:

! ollama pull mxbai-embed-large

! ollama pull qwq

With these models ready, we can start the LLM generation and Embedding-based retrieval workflow.

Generate a test embedding and print its dimensions and first few elements.

import ollamafrom ollama import Client
ollama_client = Client(host= "http://localhost:11434" )def emb_text (text) :response = ollama_client.embeddings(model= "mxbai-embed-large" , prompt=text)return response[ "embedding" ]#testtest_embedding = emb_text( "This is a test" )embedding_dim = len(test_embedding)print(embedding_dim)print(test_embedding[: 10 ])


1024[ 0.23217937350273132 , 0.42540550231933594 , 0.19742339849472046 , 0.4618139863014221 , -0.46017369627952576 , -0.14087969064712524 , -0.18214142322540283 , -0.07724273949861526 , 0.40015509724617004 , 0.8331164121627808 ]
Loading data into Milvus

Create a collection

from pymilvus import MilvusClient
milvus_client = MilvusClient(uri= "./milvus_demo.db" )
collection_name = "my_rag_collection"

About MilvusClient parameter settings:

  • Setting the URI to a local file (e.g. ../milvus.db) is the most convenient way, as it automatically stores all data in that file using Milvus Lite.

  • If you have large-scale data, you can build a more powerful Milvus server on Docker or Kubernetes. In this case, please use the server's URI (for example, http://localhost:19530) as your URI.

  • If you want to use Zilliz Cloud (Milvus' fully managed cloud service), please adjust the URI and token to correspond to the Public Endpoint and API key in Zilliz Cloud respectively.

Checks if the collection already exists, and removes it if so.

if milvus_client.has_collection(collection_name):milvus_client.drop_collection(collection_name)

Creates a new collection with the specified parameters.

If no field information is specified, Milvus will automatically create a default ID field as the primary key, and a vector field to store vector data. A reserved JSON field is used to store fields and their values ​​that are not defined in the schema.

milvus_client.create_collection(collection_name=collection_name,dimension=embedding_dim,metric_type="IP",# Inner product distanceconsistency_level="Strong",# Strong consistency level)

Inserting Data

Iterate over the text line by line, create embedding vectors, and then insert the data into Milvus.

Below is a new field text, which is an undefined field in the collection. It will automatically create a corresponding text field (actually it is implemented by the reserved JSON dynamic field at the bottom), you don't need to care about its underlying implementation.


from tqdm import tqdm
data = []
for i, line in enumerate(tqdm(text_lines, desc= "Creating embeddings" )):data .append({ "id" : i, "vector" : emb_text(line), "text" : line})
milvus_client.insert(collection_name=collection_name, data = data )


Creating embeddings:  100 %  00 : 06 < 00 : 00 , 11.86 it/s]
{ '  insert_count' : 72  ,ids' : [ 0123 , 4 ,  56789101112131415161718192021222324252627 , 28  ,  29303132333435363738394041424344454647484950515253545556575859606162636465666768697071 ],  'cost'0 }

03

Building RAG

Retrieve Data

Let's specify a frequently asked question about Milvus.

question = "How is data stored in milvus?"

Search the collection for that question and retrieve the top 3 results that are the closest semantic match.

search_res = milvus_client.search(collection_name=collection_name,data=[emb_text(question)],# Use the `emb_text` function to convert the question to an embedding vectorlimit=3,# Return top 3 resultssearch_params={"metric_type": "IP", "params": {}},# Inner product distanceoutput_fields=["text"],# Return the text field)

Let's take a look at the search results for this query.

import json
retrieved_lines_with_distances = [(res[ "entity" ][ "text" ], res[ "distance" ]) for res in search_res[ 0 ]]print (json.dumps(retrieved_lines_with_distances, indent= 4 ))

[[" Where does Milvus store data?\n\nMilvus deals with two types of data, inserted data and metadata. \n\nInserted data, including vector data, scalar data, and collection-specific schema, are stored in persistent storage as incremental log. Milvus supports multiple object storage backends, including [MinIO](https://min.io/), [AWS S3](https://aws.amazon.com/s3/?nc1=h_ls), [Google Cloud Storage](https://cloud.google.com/storage?hl=en#object-storage-for-companies-of-all-sizes) (GCS), [Azure Blob Storage](https://azure.microsoft.com/en-us/products/storage/blobs), [Alibaba Cloud OSS](https://www.alibabacloud.com/product/object-storage-service), and [Tencent Cloud Object Storage](https://www.tencentcloud.com/products/cos) (COS).\n\nMetadata are generated within Milvus. Each Milvus module has its own metadata that are stored in etcd.\n\n###",231.9922637939453],["How does Milvus flush data?\n\nMilvus returns success when inserted data are loaded to the message queue. However, the data are not yet flushed to the disk. Then Milvus' data node writes the data in the message queue to persistent storage as incremental logs. If `flush()` is called, the data node is forced to write all data in the message queue to persistent storage immediately.\n\n###",226.54090881347656],["What is the maximum dataset size Milvus can handle?\n\n\nTheoretically, the maximum dataset size Milvus can handle is determined by the hardware it is run on, specifically system memory and storage:\n\n- Milvus loads all specified collections and partitions into memory before running queries. Therefore, memory size determines the maximum amount of data Milvus can query.\n- When new entities and and collection-related schema (currently only MinIO is supported for data persistence) are added to Milvus, system storage determines the maximum allowable size of inserted data.\n\n###",210.63682556152344]]

Building Retrieval-Augmented Generation (RAG) Responses Using Large Language Models (LLMs)

Convert the retrieved document into a string format.


context = "\n".join([line_with_distance[0] for line_with_distance in retrieved_lines_with_distances])

Provides system prompts and user prompts for large language models. The prompts are generated from documents retrieved from Milvus.

SYSTEM_PROMPT = """Human: You are an AI assistant. You are able to find answers to the questions from the contextual passage snippets provided."""USER_PROMPT = f"""Use the following pieces of information enclosed in <context> tags to provide an answer to the question enclosed in <question> tags.<context>{context}</context><question>{question}</question>"""

Use the QwQ (32B) model provided by Ollama to generate responses based on the prompts.

from ollama import chatfrom ollama import ChatResponse
response: ChatResponse = ollama_client.chat(model = "qwq" ,messages=[{ "role" : "system" , "content" : SYSTEM_PROMPT},{ "role" : "user" , "content" : USER_PROMPT},],)print(response[ "message" ][ "content" ])


<think>Okay, I need to answer the question "How is data stored in Milvus?" using the provided context. Let me start by reading through the context carefully.
First, looking at the first context section titled "Where does Milvus store data?" . It mentions that Milvus handles two types of data : inserted data and metadata.
Inserted data includes vectors, scalar data , and collection schema, stored as incremental logs in persistent storage. The supported backends are listed like MinIO, S3, GCS, etc. So I should note that inserted data goes to these object storages via incremental logs.
Then metadata is stored in etcd for each Milvus module. That 's straightforward—metadata in etcd.
The second context about flushing data explains that data first goes to the message queue and then to persistent storage when flushed. But this might be more about the process rather than where it's stored, so maybe mention that insertion is into the message queue first before being written out.
Third section on maximum dataset size mentions hardware constraints: memory for queryable data and storage for persistence. This could relate to how data is managed in terms of memory vs disk. The user asked "how is data stored," which might include both where (like object stores, etcd) and the process (message queue then persistent logs).
Putting it all together, the answer should mention:
- Two types: inserted data and metadata.
For inserted data :- Stored as incremental logs in persistent storage using backends like MinIO, S3 etc.- Initially loaded into message queues, then written to disk. The flush() forces immediate writing if needed.
Metadata:- Stored in etcd for each module.
Also, the hardware part says that memory limits queryable data size and storage (like the object stores) limit total dataset size. So maybe mention these factors as part of how data is stored, considering capacity constraints.
I should structure this clearly, perhaps bullet points or sections on inserted vs metadata, then processes like flushing, and finally limitations based on hardware.</think>
Data in Milvus is stored in two main ways:
1. **Inserted Data **:- Includes vector data , scalar data , and collection schema.- Stored as **incremental logs** in **persistent storage**, supported by object storage backends such as MinIO, AWS S3, Google Cloud Storage (GCS), Azure Blob Storage, Alibaba Cloud OSS, and Tencent Cloud Object Storage (COS).- Initially loaded into a **message queue** when inserted. Milvus returns success upon insertion into the queue but does not immediately write to disk. A background process later writes this data to persistent storage. The `flush()` method can force immediate writing of all queued data to storage.
2. **Metadata**:- Generated internally by Milvus modules (eg, collection configurations, partitions).- Stored in **etcd**, a distributed key-value store.
**Hardware Considerations**:- **Memory**: The amount of data Milvus can query is limited by system memory since it loads specified collections/partitions into memory for queries.- **Storage Capacity**: The maximum dataset size is constrained by the underlying storage backend (eg, object storage), which stores all inserted data and schema incrementally.

The above is the complete process of building retrieval enhancement generation (RAG) based on Milvus and Ollama.