Woter AI detection.Hurry - ends Jul 26th

New Year Sales :up to 80% OFF

AI Humanize AI Translator Bypass AI AI Rewriter AI Detector

PRICING

TRY FOR FREE

Why can't vLLM do this? Decoding Ollama's cross-level deployment black technology: Taking DeepSeek-R1-8B as an example

Written by

Silas Grey

Updated on:July-13th-2025

Recently, I am experimenting with deploying 70b of vllm on a server with dual 4090 GPUs and nearly 200g of memory. I use the ragflow knowledge base.

Based on my previous understanding of using Ollama, I thought there should be no problem deploying 70b, but then there were pitfalls one after another.

Let me talk about this first

• Deployment of 30b is still very smooth, but the reasoning ability is indeed weak
• Deployed 70b, and through continuous optimization of parameters, the maximum tokens, 4096 tokens is no problem (a bge-m3 was also deployed)
• However, when using it in ragflow, it often reported OOM due to exceeding the maximum tokens, so I submitted an issue to the official ragflow. The official fixed it very quickly, which is still great.

The more I struggled, the more questions I had. Then I learned some knowledge points and wrote this article.

Let’s take the RTX3060 12G video memory as an example for disassembly.

Anatomy of the video memory of DeepSeek-R1-8B

Raw video memory requirements

• Weights (direct cost) : 8B parameters × 2 bytes (FP16) = 16 GB
• KV Cache (context cache)

• Dynamic Growth:2 × number of layers × number of heads × head dimension × sequence length × batch size × accuracy
• (Take 2048 tokens as an example): 2 (K/V) × 32 layers × 32 heads × 128 head dimensions × 2048 sequence length × 2 bytes = 1.07 GB

• Total requirement : 16 + 1.07 ≈ 17.07 GB (far exceeding the 12GB limit of RTX 3060)

Ollama's "slimming magic"

• 4-bit GPTQ quantization : weight memory reduced to 8B × 0.5 bytes = 4 GB
• Dynamic unloading strategy : temporarily transfer some weights to CPU memory, reducing the memory usage to 6.2 GB (actual measured data)
• Cost : CPU-GPU data transfer reduces token generation speed from 45 tokens/s to 28 tokens/s

The craziest thing is that a former colleague of mine used Ollama to deploy a 70B model on a MacBook Air with 8G memory.

vLLM's "Memory Cleanliness"

• Design principles : vllm pursues extreme throughput and rejects any dynamic offloading that may affect performance
• Hard threshold for video memory : requires weight + KV Cache to be completely resident on the GPU, causing DeepSeek-R1-8B to fail to start on 12GB graphics cards

Three core technologies of Ollama's cross-level deployment

Mixed precision quantization (flexibility crushes vLLM)

• Level-sensitive quantization :MLP LayerUsing 4-bit, top levelAttentionKeep 6-bit (reduce precision loss)
• Actual measurement comparison (DeepSeek-R1-8B generation task):

Quantitative solution	Video memory usage	PPL (Perplexity Level)
Ollama Mixed Precision	6.2 GB	7.1
vLLM official INT8 quantization	10.5 GB	6.9

Memory-CPU Hierarchical Storage (vLLM restricted area)

• Strategy : Convert FP16AttentionWeights are kept in video memory, and MLP weights are dynamically loaded into memory
• Technology cost :

• Each forward pass adds 5-8ms of PCIe transmission latency
• However, the video memory requirement dropped by 40% (from 10.5GB to 6.2GB)

Adaptive sequence slicing

• Long text processing : When the input exceeds 512 tokens, it is automatically split into multiple segments for parallel processing
• Video memory optimization effect : When 2048 tokens are input, the peak video memory is reduced by 32%

Why does vLLM “rather die than deploy across levels”?

A fundamental conflict of design goals

• vLLM's core mission : high throughput and low latency in service-oriented scenarios (such as an API accessed by hundreds of people at the same time)
• Reasons for rejecting dynamic uninstallation :

• CPU-GPU data transfer will severely slow down the processing of concurrent requests
• Video memory fragmentation may destroy the continuous memory allocation mechanism (PagedAttention technology that vLLM relies on)

Limitations of quantitative support

• Only supports static INT8 : cannot mix 4/6-bit like Ollama, resulting in insufficient video memory compression rate
• Calibration data fixation : vLLM requires offline quantization, while Ollama supports dynamic adjustment at runtime

Hardware compatibility differences

• Ollama’s “Art of Compromise” :

• For compatibility with consumer graphics cards (such as RTX 3060), it is allowed to sacrifice speed in exchange for video memory
• Even supports emulating video memory through system memory (performance reduction but still works)

• The “elitism” of vLLM :

• Only optimized for Tesla series graphics cards (such as A100/H100), relying on high bandwidth video memory
• Performance on consumer cards is actually worse than Ollama (RTX 4090 measured 15% lower throughput)

Actual combat test-a life-and-death duel on RTX 3060

Test environment

• Graphics card: NVIDIA RTX 3060 12GB
• Test task: DeepSeek-R1-8B generates 512 tokens to answer

Results comparison

frame	Video memory usage	Spawn Speed	Availability
vLLM	Report error and exit	-	Completely unavailable
Ollama	6.2/12 GB	22 tokens/s	Smooth operation
Original Hugging Face	17.1/12 GB	Report error and exit	Unavailable

Key conclusions

• Ollama's survival logic : Through quantization + dynamic unloading, the memory requirement is compressed to less than 60% of the hardware.
• The philosophical flaw of vLLM : in pursuit of industrial-grade performance, it abandons adaptation to resource-constrained scenarios

Developer Selection Guide

Choose Ollama scene

• Individual developers/small teams: limited hardware (≤24GB video memory)
• Need to quickly verify the model effect and have a high tolerance for delays
• Long text generation requirements (using slicing strategy to reduce peak video memory)

Scenario for choosing vLLM

• Enterprise-level API services: need to support high concurrency (≥100 QPS)
• Have professional graphics cards such as A100/H800, pursuing the ultimate throughput
• Need to be compatible with the existing Kubernetes cluster scheduling system

Ultimate Tips for Avoiding Pitfalls

• Beware of "false upgrades" : some tools claim to support low-memory operation, but in fact they significantly cut model parameters (such as DeepSeek-8B being castrated to 6B)
• Verify quantification integrity : Usellm-int8Tool checks whether the Attention layer really retains high accuracy
• Stress testing to ensure safety : Ollama needs to test the long-generated memory leak problem (some versions have cumulative usage bugs)

Conclusion: There are no myths, only trade-offs

The essence of Ollama's "leapfrog" is the democratization of technology - allowing more people to use large models, even at the expense of speed; vLLM's "coldness" is a choice of commercial reality. In the future, the two may merge (such as vLLM introducing dynamic unloading), but before that, developers still need to recognize their needs and choose the most suitable battlefield.

Related terms

Memory (RAM) vs. Video Memory (VRAM)

• System memory (RAM) : Main memory managed directly by the CPU, commonly referred to as "memory", physically connected through the motherboard slot and shared by all system processes.
• Video memory (VRAM) : GPU-dedicated memory that communicates with the CPU through the PCIe bus and is designed for high-throughput parallel computing.
• Key differences :

• The CPU does not own memory directly, but accesses RAM through a memory controller;
• GPU video memory is independent hardware and is physically separated from CPU memory.

The essence of Ollama video memory optimization: CPU-GPU heterogeneous memory swapping

When Ollama claims to "transfer some weights to CPU memory", the essence of the technology is: transfer the temporarily unused weight data in the GPU memory to the system memory (RAM) through the PCIe bus, and dynamically load it back to the memory when needed . This process involves the following core technologies:

(1) Memory Tiering

• Hot data : The weights required for the current calculation (such as the Attention layer being executed) are retained in the video memory.
• Cold data : Weights that are needed in subsequent steps (such as the MLP parameters of the next layer) are temporarily stored in the system memory.
• Exchange granularity : usually in layers, for example, the 32 layers of DeepSeek-R1-8B are divided into multiple groups and loaded on demand.

(2) Prefetching & Caching

• Preloading mechanism : When calculating the current layer, the weights of the next layer are asynchronously transferred from the system memory to the video memory in advance.
• Cache strategy : Permanently keep copies of frequently used weights (such as Embedding layers) in the video memory.

(3) Hardware-accelerated transmission

• Pinned Memory : Usetorch.cuda.StreamCooperate with Pinned Memory to reduce CPU-GPU data transmission delay.
• Direct Memory Access (DMA) : The GPU controller manages the transmission directly, bypassing the CPU. The measured bandwidth can reach 32GB/s of PCIe 4.0 x16.