ollama v0.6.6 is released! The reasoning ability is doubled, and the download speed is increased by 50%. Which one is stronger than vLLM/LMDeploy?

Written by

Iris Vance

Updated on:June-30th-2025

Ollama v0.6.6 major update: stronger reasoning, faster download, more stable memory

Attention AI developers! Ollama v0.6.6 is officially released, bringing many major optimizations, including new model support, faster download speed, memory leak fixes, etc., making local large model inference more efficient and stable!

? Core update highlights

1. Two new models are launched

• Granite 3.3 (2B & 8B): 128K ultra-long context , optimized instruction following and logical reasoning capabilities, suitable for complex task processing.
• DeepCoder (14B & 1.5B): A completely open source code model with performance comparable to O3-mini. Developers can deploy high-quality code generation AI at low cost!

2. Download speed is greatly improved

• Experimental new downloader : via OLLAMA_EXPERIMENT=client2 ollama serve Enable, download faster and more stable!
• Safetensors import optimization :ollama create Significant performance improvements when importing models.

3. Critical BUG fixes

• Gemma 3 / Mistral Small 3.1 memory leak issue fixed , running more stable.
• OOM (out of memory) issue optimization , reserve more memory at startup to avoid crashes.
• Safetensors import data corruption issue fixed to ensure model integrity.

4. API and compatibility improvements

• Supports tool function parameter type arrays (such as string | number[]), the API is more flexible.
• OpenAI-Beta CORS header support for easy front-end integration.

? Ollama vs. vLLM vs. LMDeploy: Who is the king of local deployment?

Comparison Dimensions	Ollama v0.6.6	vLLM	LMDeploy
Ease of use	⭐⭐⭐⭐⭐ (One-click installation, suitable for individual developers)	⭐⭐⭐ (Docker/complex configuration required)	⭐⭐⭐⭐ (Zero One Everything Optimized, Suitable for Enterprises)
Inference speed	⭐⭐⭐ (suitable for small and medium models)	⭐⭐⭐⭐⭐ (PagedAttention optimization, high throughput)	⭐⭐⭐⭐ (Turbomind engine, low latency)
Memory optimization	⭐⭐⭐ (Automatic CPU/GPU switching)	⭐⭐⭐⭐⭐ (continuous batch processing, high memory utilization)	⭐⭐⭐⭐ (W4A16 quantization, save video memory)
Model support	⭐⭐⭐⭐ (Supports GGUF quantization, rich community)	⭐⭐⭐ (model format needs to be converted manually)	⭐⭐⭐ (mainly adapted to the InternLM ecosystem)
Applicable scenarios	Personal development/lightweight application	Highly concurrent production environment	Enterprise-level real-time conversation/edge computing