Recommendation
It is no longer difficult to choose a large model deployment tool. This article takes the DeepSeek-R1 32B model as an example to explain the selection guide of Ollama and llama.cpp.
Core content:
1. The background and difference between Ollama and llama.cpp as large model deployment tools
2. The technical relationship and underlying implementation of Ollama and llama.cpp
3. Ollama and llama.cpp performance evaluation and deployment practice based on the DeepSeek-R1 32B model
Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)
Ollama and llama.cpp are both common tools for local deployment of large models. With their ordinary laptops, you can also run large models. Ollama and llama.cpp both have the word "llama" in their names, which can make it difficult to choose. This article hopes to help you make a quick choice with the help of a practical example.Let me first say the conclusion: if you only want to deploy locally and don't care about performance, choose Ollama. If you want to do "extreme optimization and performance is king", choose llama.cpp. Later, we will use Ollama and llama.cpp to deploy DeepSeek-R1 32B to illustrate how to reach this conclusion.2. The relationship between Ollama and llama.cpp Ollama and llama.cpp both contain llama, which is the familiar Meta open source llama model. At first, Ollama and llama.cpp were used to serve llama, and later they developed independently into two independent software, each with its own community. What I want to emphasize here is that Ollama uses llama.cpp as the underlying implementation of model reasoning. This can be found in the source code of Ollama:The llama subdirectory of the Ollama code contains the llama.cpp code, and the C++ interface is exported to the golang space through the llama.go file. Therefore, from the perspective of source code, llama.cpp can be considered as the bottom layer of Ollama.3. Review Ollama and llama.cpp The model format recommended by llama.cpp is GGUF. To be fair, let Ollama also use the same GGUF model. The model I used in my experiment is: DeepSeek-R1-Distill-Qwen-32B-Q5_K_M.gguf.3.1 Ollama deploys DeepSeek-R1 32B Ollama does not support the GGUF format by default, so you need to use Modelfile to convert it. The steps are as follows:- Create a file named deepseek-r1-32b.gguf with the following content:
FROM ./bartowski/DeepSeek-R1-Distill-Qwen-32B-Q5_K_M.gguf
- Execute the following command:
ollama create my-deepseek-r1-32b-gguf -f .\deepseek-r1-32b.gguf
You can import the DeepSeek-R1 32B GGUF model into Ollama for use- Execute the command to start the model:
ollama run my-deepseek-r1-32b-gguf:latest
Now it can be loaded normally. Through the ollama ps command, you can see the process information as follows:NAME ID SIZE PROCESSOR UNTILmy-deepseek-r1-32b-gguf:latest ad9f11c41b7a 25 GB 87%/13% CPU/GPU 3 minutes from now
As you can see, the entire model is 25G, 87% is loaded into the CPU memory space, and 13% is loaded into the GPU space. In actual use, it is found that inference is very slow, but it is still usable.3.2 llama.cpp deploys DeepSeek-R1 32BI use gitbash, so the installation reference of llama.cpp is: https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md#git-bash-mingw64
You can also choose the correct reference content to install according to your own situation. After installation, execute it with the following command:build/bin/Release/llama-cli -m "/path/to/DeepSeek-R1-Distill-Qwen-32B-Q5_K_M.gguf" -ngl 100 -c 16384 -t 10 -n -2 -cnv
The results are as follows:ggml_vulkan: Device memory allocation of size 1025355776 failed.
ggml_vulkan: vk::Device::allocateMemory: ErrorOutOfDeviceMemoryllama_model_load: error loading model: unable to allocate Vulkan0 bufferllama_model_load_from_file_impl: failed to load modelcommon_init_from_params: failed to load model 'D:/llm/Model/bartowski/DeepSeek-R1-Distill-Qwen-32B-Q5_K_M.gguf'main: error: unable to load model
It reports an error directly.4. Why can't llama.cpp be deployed?At this point, I believe you know how to choose.However, since we already know that llama.cpp is the bottom layer of Ollama, why does llama.cpp perform worse than Ollama? This is due to an optimization made by Ollama itself, that is, the ngl parameter of llama.cpp. When deployed with llama.cpp, the ngl parameter is hard-coded, while Ollama dynamically calculates the ngl parameter based on the model file.The ngl parameter means how many layers are loaded into the GPU. My laptop's GPU has 4G video memory, so it is definitely not possible to load 25G of DeepSeek into it. Therefore, the usage of -ngl 100 in llama.cpp is definitely wrong (100 layers basically means all layers are loaded into the GPU), but for llama.cpp with only command lines, it is difficult for you to estimate how many -ngl should be used to successfully deploy it.So how does Ollama do it? The answer is in Ollama's source code memory.go. The following function in this file implements the function of calculating the ngl value based on the model:// Given a model and one or more GPU targets, predict how many layers and bytes we can load, and the total size// The GPUs provided must all be the same Libraryfunc EstimateGPULayers(gpus []discover.GpuInfo, f *ggml.GGML, projectors []string, opts api.Options) MemoryEstimate { // Graph size for a partial offload, applies to all GPUs var graphPartialOffload uint64 // Graph size when all layers are offloaded, applies to all GPUs var graphFullOffload uint64 // Final graph offload once we know full or partial var graphOffload uint64 ...
Due to space limitations, I did not list the code in full. If you are interested, you can take a look at it yourself. It is this function that allows Ollama to dynamically calculate the value of ngl, thereby performing the action of "87% loaded into CPU memory space, 13% loaded into GPU space", and finally successfully deploying the DeepSeek-R1 32B model. To be honest, it is really amazing that an ordinary laptop can deploy a 32B model. It is unexpected.Similarly, it was asked to do the task that Foreign Minister Wang Yi assigned to DeepSeek a few days ago: to translate "Let him be strong, the breeze blows over the hills; let him be arrogant, the bright moon shines on the river."The result is a bit strange. Although it gives a partial translation, it misunderstands the task. Maybe it has something to do with the optimization of other parameters of Ollama.