The new king of open source embedding has arrived! Qwen3-Embedding local deployment guide + Dify recall test record

Written by

Audrey Miles

Updated on:June-13th-2025

A few days ago, the Qwen3-Embedding series of models (including three versions: 8B, 4B and 0.6B) launched by Tongyi Qianwen performed amazingly in authoritative evaluations, especially in multi-language tasks and long-context processing capabilities, surpassing mainstream competitors and becoming the new king of open source embedding models.

An all-round player in multiple sizes, it completely crushes BGE-M3!

Performance dominance, leading in all sizes

Qwen3-8B topped the list with a total score of 70.58 (surpassing Gemini-001's 68.37 ), and ranked first in 12 of the 16 evaluation items, especially in key tasks such as retrieval accuracy (MSMARCO 57.65) and question-answering ability (NQ 10.06).

Even the smallest Qwen3-0.6B (only 595M parameters), with a total score of 64.34 , still significantly surpasses 7B-level competitors (such as SFR-Mistral 60.9). Small models also have great power!

Comparison with BGE-M3: All-round generational advantage

index	Qwen3-8B	BGE-M3	Advantage
Comprehensive score	70.58	59.56	↑11.02
Context length	32K	8K	↑ 4 times
Retrieval task (MSMARCO)	57.65	40.88	↑41%
Open Questions (NQ)	10.06	-3.11	Reversal of negative points
Multi-language understanding	28.66	20.10	↑42%

While maintaining 99% list compliance, Qwen3 completely rewrites the performance boundary of the Embedding model with higher dimensional parameters (8B vs 568M) and 4 times the context support!

Comparison of models of the same size: performance crushes the same level

Both are at the 7B level: Qwen3-8B compares Linq-Embed-Mistral (61.47) and SFR-Mistral (60.9), with a performance lead of more than 15%.

Lightweight battlefield: Qwen3-0.6B (64.34) is far ahead of similar small models such as multilingual-e5-large (63.22) and BGE-M3 (59.56), proving the efficiency of Tongyi Qianwen architecture.

Local deployment of Qwen3-Embedding

GPUStack Local Deployment

Deploy GPUStack yourself according to the official documentation . The official Docker image is provided for quick deployment.

In the GPUStack model interface, click Deploy Model -> ModelScope and search for qwen3-embedding . The platform will automatically detect your hardware performance and recommend the quantized model version that can be installed.

gpustack deploy qwen3-embedding

We selected the Q8_0 quantized version of qwen3-embedding-8b and waited for the model to be downloaded. The message "running" indicated that the model had been deployed.

qwen3-embedding model deployment is successful

Testing in dify

Now find GPUStack in Dify's plugin market and click to install the plugin. After the plugin is installed, proceed to model configuration.

Configure QWEN3-EMBEDDING for local GPUStack deployment in Dify

Create a knowledge base and select the model we deployed ourselves in the Embedding model.

Dify Knowledge Base Creation

Put the historical articles of the official account into the knowledge base for testing.

Upload Documents

Select Dify's parent-child segmentation strategy. Since it is in markdown format, each large paragraph is expected to be a parent block, and the segmentation character is "#".

Configuration parsing method

Test the recall

Recall test

(over)

Follow the official account to get more exciting content