Qwen3 small model actual test: from 4B to 30B, which one can use MCP to communicate smoothly with Obsidian?

Written by
Clara Bennett
Updated on:June-29th-2025
Recommendation

The interaction effect between Qwen3 series small models and Obsidian-MCP is measured to reveal the performance differences of models of different scales.

Core content:
1. The test results of the interaction between Qwen3 series models (4B/8B/14B) and Obsidian-MCP
2. The performance of each model in terms of tool calling, content deviation, contextual restrictions, etc.
3. The performance improvement trend of Qwen3 small models and the hardware threshold for smooth interaction

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)

 

This paper tests the interaction effect between the Qwen3 series local models (4B/8B/14B) and the knowledge base of Obsidian-MCP, and finds that the small model has problems such as tool call failure, response hallucination and context limitation.Version 4B Loss of instruction comprehension due to quantization,Version 8BAlthough the tool can be called, there are content deviations.14B+We can have a normal conversation. The availability of local mini-models is gradually increasing, but I am still a 16G graphics card away from smooth interaction?

Qwen3 small model actual test: from 4B to 30B, which one can use MCP to communicate smoothly with Obsidian?

I heard it was released last night qwen3 The model's Agent and code capabilities have been optimized to strengthen support for MCP.

Qwen3: Think deeply and move quickly
https://qwenlm.github.io/zh/blog/qwen3/

This sentence in the introduction

The small MoE model Qwen3-30B-A3B has 10% of the number of activated parameters of QwQ-32B, and performs better. 
`A small model like Qwen3-4B can also match the performance of Qwen2.5-72B-Instruct`.

I was very excited, so I went home after get off work. nas server use Ollama Pull model deployed, use cherry studio, enable obsidian-mcp, started testing, but the test results slapped me in the face.

Test content:

  1. 1. Check my obsidian knowledge base for changes in the last day, and the model will answer randomly

The model cannot hit the tool.

  1. 1. Use obsidian's mcp obsidian_get_recent_changes tool to query the changes in my knowledge base in the last day

I prompted the name of the tool, but the model still gave a random answer.

qwen3 model

Model evaluation item description

Evaluation Name
illustrate
Key points
ArenaHard
Human comparative evaluation of comprehensive conversational ability, focusing on "difficult scenarios"
A high score means the dialogue is natural and logical.
AIME'24/'25
Mathematical competition questions to test mathematical reasoning, number series, geometry, etc.
GPT-4o scores low because it does not turn on "thinking mode" in this benchmark, Qwen3 performs more realistically
LiveCodeBench
Code generation tasks, combined with real-time code execution to verify correctness
Qwen3-4B performs close to GPT-4o, indicating that the small model has strong coding capabilities
CodeForces (Elo Rating)
Similar to the Elo ranking in programming competitions, the higher the stronger
Qwen3-4B > GPT-4o, which means it is better than GPT-4o in "problem solving speed + accuracy"
GPQA
High-quality question-answering set (similar to academic QA) to examine multi-hop reasoning
The Qwen series maintains its advantage, showing that it takes both knowledge and reasoning into account
LiveBench
Real-time dialogue task evaluation, including multi-turn context and factual requirements
GPT-4o has a low score (52.2), which means it may not be the best in all tasks.
BFCL
Instruction following and dialogue coherence test, Qwen uses FC format for assessment
GPT-4o performs the best, Qwen3-4B is slightly weaker but close
MultiIF (8 Languages)
Multilingual instruction following ability assessment
Qwen3-4B has better multilingual generalization and is better than GPT-4o (especially in non-English scenarios)

Obsidian-MCP

Obsidian-MCP is commonly used for the following tasks:

  • • Semantic retrieval and summarization of log/note content (embedding + question-answering)
  • • Self-dialogue (multi-turn historical context)
  • • Context-based "thinking enhancement" such as task suggestions and card associations
  • • Memory callback of private knowledge base (streamable / SSE mode long connection)
  • • Local embedding + lightweight reasoning, no reliance on public network LLM

 

These tasks mainly require:

  • •  Ability to follow instructions
  • •  Context-aware (little context)
  • •  Moderate reasoning skills
  • •  Fast response, small model, easy to deploy

Obsidian API Tools List

 


JSON search to obtain the content of periodic notes. Get the list of recent periodic notes. Get the most recently modified files.
Tools and Methods
Functional Description
parameter
list_files_in_vault
Get a list of knowledge base files
none
list_files_in_dir
Get the file list of the specified directory
dirpath
get_file_contents
Get the contents of a single file
filepath
get_batch_file_contents
Get the contents of multiple files in batches
filepaths
search
Perform a simple search
query, context_length
search_json
Performing complex format searches
query
append_content
Append content to a file
filepath, content
patch_content
Modify the specified content block of the file
filepath, operation, target_type, target, content
delete_file
Deleting files/directories
filepath
get_periodic_note
Get cycle notes content
period
get_recent_periodic_notes
Get the list of recent cycle notes
period, limit, include_content
get_recent_changes
Get recently modified files
limit, days

Test whether Qwen3-4B's capabilities match the above requirements

qwen3:4b, the words are spoken very quickly, and the level of the answer is also high, but the text is not relevant to the topic, and it doesn't even recognize that the tool needs to be called.
So I looked at the hugging_face tokenizer_config.json model configuration, and indeed there istool_callWhy is this layer not working? Is it thisq4 quantificationCauses severe IQ loss?
I thought the 4B small n card on my NAS finally came in handy, but it seems I have to wait a little longer.
I want to try 8B again, but the local video memory is not enough, so I changed it to openrouter Service tests 8b, 14b, 30b.

Test whether Qwen3-8B's capabilities match the above requirements

Use cherryStudio to test qwen3:8b. It is possible to call the tool, but the answer is hallucinatory and the name of the returned note is changed.

Qwen3-4B-Local Model + Obsidian-MCP's `Local Q& A`.md
The answer became

01Project / Blog/draft/Qwen3-4B-Local Model + Obsidian-MCP's `Local Issues`.md

At this time, the notes are used git sync The advantage comes out. When you use mcp locally to organize your notes, if an error occurs, you can roll back to the last submitted version at any time!


This 8B can basically only be used for chatting, and in my scenario it is just for show but not for use

Test whether Qwen3-14B's capabilities match the above requirements

Using openrouter qwen3:14b Model testing


It looks good and can return results normally.
But when I want to test the content in depth, it reports insufficient tokens. According to official data,qwen3:14bThe maximum token of the model is128K, 150,000 words, I think this is enough to analyze a note.
But when I tested it, I asked it to read the note content and summarize it, but it prompted that the token exceeded 40k. I don’t know why?

It is clear from this error message that the
current context limit of the model is : 40960 tokens ➤ Exceeded.

I think it is a limitation of openRouter's own deployment. qwen3-demo:

https://huggingface.co/spaces/Qwen/Qwen3-Demo

After testing, the same text can be summarized normally, and 128k tokens are enough. It seems that 8B, 14B, and 32B can still be used locally.

in conclusion

The knowledge base interaction test using Qwen3 and Obsidian-MCP concluded that:

Version 4B : Quantization compression leads to aphasia

  • • The tool call capability is completely lost, facing clear obsidian_get_recent_changes Indifferent to instructions
  • • The token capacity is 32K, so long sessions may be difficult to process completely

Version 8B : Seemingly useful but actually dangerous

  • • Although the tool call can be recognized, the returned file path has a high error rate;
  • • Appears when the content is summarized Hallucination Rewrite, the note name will be modified;
  • • If the MCP API is accidentally deleted and there is no git backup, it will be more dangerous

Version 14B+ : Really Fragrance Warning

  • • 128K token capacity perfectly adapts to the knowledge base scenario, and accurately calls the Obsidian API during testing
  • • However, local deployment requires 16G video memory, which is prohibitive for most NAS users

Before my 16G graphics card arrives, I have to pay attention to privacy protection. I first use the cloud-based large model + MCP to read the non-sensitive data directory as the context for questions and answers.

After all, to be a technology master, you must understand Finding the optimal solution within realistic constraints.