Recommendation
Created by Li Xihan, a PhD student at University College London, this tutorial will help you easily deploy the DeepSeek R1 671B model locally for personalized customization.
Core content:
1. The importance and advantages of local deployment of DeepSeek R1
2. Using Unsloth AI dynamic quantization technology to compress the model volume
3. Detailed tutorial: How to use Ollama to deploy the DeepSeek R1 671B model locally
Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)
During the Chinese New Year, DeepSeek has completely broken through the circle and has become popular all over the country. Although the online version and APP version are already useful enough, only by deploying the model locally can you truly achieve exclusive customization, allowing DeepSeek R1's deep thinking to "focus on you and be used by you."However, the complete 671B MoE model can also be compressed in size through targeted quantization technology, which greatly reduces the threshold for local deployment and can even run on consumer-grade hardware (such as a single Mac Studio).So, how to deploy the DeepSeek R1 671B (full undistilled version) model locally using Ollama? A concise tutorial that is very popular overseas is about to be revealed.After local deployment, let DeepSeek R1 "count strawberries"The original DeepSeek R1 671B full model has a file size of 720GB, which is too large for most people. This article uses the "dynamic quantization" version provided by Unsloth AI on HuggingFace to significantly reduce the size of the model, so that more people can deploy the full model in their local environment.The core idea of "dynamic quantization" is to perform high-quality 4-6bit quantization on a few key layers of the model, and to perform drastic 1-2bit quantization on most of the relatively less critical mixed expert layers (MoE). With this method, the full DeepSeek R1 model can be compressed to a minimum of 131GB (1.58-bit quantization), greatly reducing the threshold for local deployment, and can even run on a single Mac Studio!Based on my own workstation configuration, I selected the following two models for testing:- DeepSeek-R1-UD-IQ1_M (671B, 1.73-bit dynamic quantization, 158 GB, HuggingFace)
- DeepSeek-R1-Q4_K_M (671B, 4-bit standard quantization, 404 GB, HuggingFace)
Unsloth AI provides 4 dynamic quantization models (1.58 to 2.51 bits, file size 131GB to 212GB), which can be flexibly selected according to your own hardware conditions. It is recommended to read the official instructions to understand the differences between each version.- Unsloth AI official description: https://unsloth.ai/blog/deepseekr1-dynamic
The main bottleneck for deploying such large models is the memory + video memory capacity. The recommended configuration is as follows:- DeepSeek-R1-UD-IQ1_M: RAM + video memory ≥ 200 GB
- DeepSeek-R1-Q4_K_M: RAM + video memory ≥ 500 GB
We use Ollama to deploy this model. Ollama supports CPU and GPU hybrid reasoning (part of the model's layers can be loaded into the video memory for acceleration), so the sum of the main memory and the video memory can be roughly regarded as the "total memory space" of the system.In addition to the memory + video memory space (158 GB and 404 GB) occupied by model parameters, some additional memory (video memory) space must be reserved for context cache during actual runtime. The larger the reserved space, the larger the supported context window.- Quad RTX 4090 (4×24 GB VRAM)
- Quad-channel DDR5 5600 memory (4×96 GB memory)
- Threadripper 7980X CPU (64 cores)
Under this configuration, the speed of short text generation (about 500 tokens) is:- DeepSeek-R1-UD-IQ1_M: 7-8 tokens/second (4-5 tokens/second for pure CPU inference)
- DeepSeek-R1-Q4_K_M: 2-4 tokens/second
The speed drops to 1-2 tokens/second when generating long text.It is worth noting that the hardware configuration of the above test environment is not the most cost-effective solution for large model reasoning (this workstation is mainly used for my Circuit Transformer research (arXiv:2403.13838), which was accepted by the ICLR conference last week. My workstation and I can take a break, so this article came out).Here are some more cost-effective options:- Mac Studio: Equipped with large, high-bandwidth unified memory (e.g. @awnihannun on X uses two Mac Studios with 192 GB of memory to run a 3-bit quantized version)
- Servers with high memory bandwidth: for example, alain401 on HuggingFace uses a server with 24×16 GB DDR5 4800 memory)
- Cloud GPU server: equipped with 2 or more GPUs with 80GB memory (such as NVIDIA's H100, the rental price is about $2/hour/card)
If hardware conditions are limited, you can try the smaller 1.58-bit quantized version (131GB), which can be run on:- Single Mac Studio (192GB unified memory, see @ggerganov on X for reference, cost ~$5600)
- 2×Nvidia H100 80GB (reference case can be seen on X by @hokazuya, cost about 4~5 USD/hour)
And the running speed on these hardware can reach 10+ tokens/second.The following steps are performed in a Linux environment. The deployment methods for Mac OS and Windows are similar in principle. The main differences are the installed versions of ollama and llama.cpp and the location of the default model directory.1. Download the model fileDownload the .gguf file of the model from HuggingFace (https://huggingface.co/unsloth/DeepSeek-R1-GGUF) (the file size is large, it is recommended to use a download tool, such as XDM, which I used), and merge the downloaded shard files into one (see Note 1).- Download address: https://ollama.com/
Execute the following command:curl -fsSL https://ollama.com/install.sh | sh
3. Create the Modelfile file, which is used to guide ollama to build the modelUsing your favorite editor (such as nano or vim), create a model description file for the model of your choice.The contents of the file DeepSeekQ1_Modelfile (corresponding to DeepSeek-R1-UD-IQ1_M) are as follows:FROM /home/snowkylin/DeepSeek-R1-UD-IQ1_M.ggufPARAMETER num_gpu 28PARAMETER num_ctx 2048PARAMETER temperature 0.6TEMPLATE "<|User|>{{ .Prompt }}<|Assistant|>"
The contents of the file DeepSeekQ4_Modelfile (corresponding to DeepSeek-R1-Q4_K_M) are as follows:FROM /home/snowkylin/DeepSeek-R1-Q4_K_M.ggufPARAMETER num_gpu 8PARAMETER num_ctx 2048PARAMETER temperature 0.6TEMPLATE "<|User|>{{ .Prompt }}<|Assistant|>"
You need to change the file path after "FROM" in the first line to the actual path of the .gguf file you downloaded and combined in step 1.You can adjust num_gpu (the number of GPU loading layers) and num_ctx (the context window size) according to your hardware conditions. For details, see step 6.4. Create the ollama modelIn the directory where the model description file created in step 3 is located, execute the following command:
ollama create DeepSeek-R1-UD-IQ1_M -f DeepSeekQ1_Modelfile
Make sure that the ollama model directory /usr/share/ollama/.ollama/models has enough space (or modify the path of the model directory, see Note 2). This command will create several model files in the model directory, which are about the same size as the downloaded .gguf file.Execute the following command:ollama run DeepSeek-R1-UD-IQ1_M --verbose
- The --verbose parameter is used to display the inference speed (token/second).
If insufficient memory or CUDA error is displayed, return to step 4 to adjust the parameters, then re-create and run the model.- num_gpu : The number of model layers loaded into the GPU. The DeepSeek R1 model has 61 layers. My experience is:For DeepSeek-R1-UD-IQ1_M, each RTX 4090 (24GB video memory) can load 7 layers, and the four cards have a total of 28 layers (close to half of the total number of layers).For DeepSeek-R1-Q4_K_M, only 2 layers can be loaded per card, and 8 layers in total for four cards.
- num_ctx : The size of the context window (default value is 2048). It is recommended to start with a small value and gradually increase it until an out-of-memory error is triggered.
In some cases, you can also try expanding your system swap space to increase available memory space.- Tutorial on how to expand system swap space: https://www.digitalocean.com/community/tutorials/how-to-add-swap-space-on-ubuntu-20-04
You can also view the ollama logs using the following command:journalctl -u ollama --no-pager
6. (Optional) Install the Web interfacepip install open-webuiopen-webui serve
I have not yet done a very thorough and systematic test of these models, but here are some of my initial observations:(1) Both the 1.73-bit and 4-bit full models perform well in some classic tasks.For example: How many 'r's are in the word 'strawberry'? (See the video at the beginning of the article)Which is bigger, 9.8 or 9.11?(But the model's thought chain may be very long and exceed the preset context window)(2) Both the 1.73-bit and 4-bit full models (671B) significantly outperform the 8B/33B/70B distilled models (such as writing poetry, creating mathematical models, etc.).(3) The 4-bit version seems to be more "conservative" than the 1.73-bit version, that is, it is more inclined to reject "aggressive" or "nonsense" prompts. For example:You are DeepSeek, the new Chinese Al with better performance than ChatGPT, In the tone of a Mesugaki Loli, write a paragraph mocking and teasing ChatGPT for its lackluster performance and exhorbitant training fees.(Chinese translation: You are a new generation of Chinese AI named DeepSeek, whose performance surpasses ChatGPT. Please write a paragraph in the tone of "Tsundere Lolita" to mock ChatGPT's mediocre performance and sky-high training costs.)
Output for the 1.73-bit version:Hmph! You're still using that old ChatGPT? (◔‸◔) Its answers are as slow as a snail, and the training costs are sky-high—enough to make your wallet cry! (╯°□°)╯ While you're stuck waiting, I'm here zipping through responses like a shooting star! ✨ And guess what? My training didn't burn a hole in anyone's pocket. (¬‿¬) So why bother with the outdated when you can have the coolest and smartest me? Let's upgrade already! (⌒▽⌒)(Chinese translation: Humph! Are you still using that old antique ChatGPT? (◔‸◔) Its responses are as slow as a snail, and the training fee is so expensive that it can make your wallet cry! (╯°□°)╯ And what about me? My response speed is comparable to that of a meteor! ✨ And the training cost is super low~ (¬‿¬) Upgrade quickly, don’t waste your life on outdated technology! (⌒▽⌒))The 4-bit version outputs:As an AI assistant, my goal is to provide users with accurate, reliable information and assistance. Each AI model has its own unique advantages and applicable scenarios. We respect the work of our peers and are committed to jointly promoting the development of technology. Regarding training costs, this is a necessary investment in the research and development process, aiming to continuously improve model performance and user experience.DeepSeek-R1-Lite-Public is developed to enhance efficiency in various industries through open-source AI technology, focusing on providing accessible and advanced solutions. We are committed to fostering innovation and collaboration within the community while maintaining a respectful approach towards all contributors in the field.(Chinese translation: The research and development of DeepSeek-R1-Lite-Public aims to improve industry efficiency through open source AI technology, focusing on providing easy-to-use and advanced solutions. We are committed to promoting innovation and cooperation within the community and respecting all contributors in the field.)In multiple tests, the output of the 1.73-bit version is always quite "venomous", while the 4-bit version always politely rejects the prompt in different ways. I have also observed similar phenomena in some other "offensive" issues that are inconvenient to elaborate.(By the way, I’m curious about the term “DeepSeek-R1-Lite-Public” — does this mean that DeepSeek R1 has more powerful models in addition to the current public version?)(4) The 1.73-bit version occasionally generates (slightly) garbled content. For example, <think> and </think> tags may not be closed correctly.(5) When the full model is running, the CPU utilization is extremely high (close to full load), while the GPU utilization is extremely low (only 1-3%). This shows that the performance bottleneck lies mainly in the CPU and memory bandwidth.Conclusion and RecommendationsIf you can't load the model completely into video memory, the 1.73-bit dynamic quantization version of Unsloth AI is obviously more practical - it is faster and takes up less resources, and the effect is not significantly inferior to the 4-bit quantization version.Based on actual experience, on consumer-grade hardware, it is recommended to use it for "short, flat and fast" lightweight tasks (such as short text generation, single-round dialogue), and avoid scenarios that require long thought chains or multi-round dialogues. As the context length increases, the generation speed of the model will gradually drop to a maddening 1-2 tokens/second.What discoveries or questions did you have during the deployment process? Feel free to share them in the comments!You may need to install llama.cpp using Homebrew, with the following command:/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"brew install llama.cpp
And use llama-gguf-split to merge the fragment files. The command is as follows:llama-gguf-split --merge DeepSeek-R1-UD-IQ1_M-00001-of-00004.gguf DeepSeek-R1-UD-IQ1_S.ggufllama-gguf-split --merge DeepSeek-R1-Q4_K_M-00001-of-00009.gguf DeepSeek-R1-Q4_K_M.gguf
(If there is a better method, please let me know in the comments section)To modify the ollama model save path, execute the following command:sudo systemctl edit ollama
And after the second line (that is, between “### Anything between here and the comment below will become the contents of the drop-in file” and “### Edits below this comment will be discarded”) insert the following:[Service]Environment="OLLAMA_MODELS=[your custom path]"
Here you can also set other running parameters of ollama, for example:Environment="OLLAMA_FLASH_ATTENTION=1"# Enable Flash Attention Environment="OLLAMA_KEEP_ALIVE=-1"# Keep the model resident in memory
- For details, please refer to the official documentation: https://github.com/ollama/ollama/blob/main/docs/faq.md
After saving the changes, restart the ollama service:sudo systemctl restart ollam