DeepSeek-R1-671B-Q8 deployment solution for less than RMB 40,000

Written by
Audrey Miles
Updated on:July-12th-2025
Recommendation

An innovative solution that significantly reduces the deployment cost of DeepSeek-R1, a boon for technology enthusiasts.

Core content:
1. Tencent Xuanwu Lab optimized the DeepSeek-R1 deployment solution, costing less than 40,000 yuan
2. Software and hardware optimization increased the long text generation speed by 25%, the peak output speed by 15%, and the pre-filling speed by 20%
3. Hardware selection priority analysis and optimization solution recommendation to achieve efficient and low-cost deployment

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)

Although DeepSeek-R1 is an open source model, in theory every technology enthusiast can deploy their own DeepSeek-R1 at home, but because its total model parameters are as high as 671B, a typical private deployment solution requires 8 141G H20s, costing more than 1.5 million yuan.

After DeepSeek-R1 was released, Rasim Nadzhafov and others found that it could be deployed using a CPU-based hardware solution. Tencent Xuanwu Lab conducted in-depth research based on many related practices on the Internet, and optimized from various levels such as hardware, system, and reasoning framework. While using lower-cost, lower-power hardware, it achieved an increase of about 25% in long text generation speed, about 15% in peak output speed, and about 20% in pre-filling speed. Using Xuanwu Lab's hardware and software optimization solution, DeepSeek-R1-671B-Q8 can be deployed with hardware costing less than RMB 40,000, with a peak generation speed of 7.17 tokens/s, which means it outputs about 10 Chinese characters per second, and the power consumption and noise of the whole machine are similar to those of a home desktop computer.

According to our research, in the CPU inference solution: memory bandwidth directly affects the generation speed; the number of CPU cores affects the pre-filling and concurrent output speed; the SSD read and write speed affects the model loading speed and the prompt cache read and write speed; the CPU main frequency has little impact on performance. Therefore, in hardware selection, the budget should be allocated according to the following priorities:

Memory bandwidth > Number of CPU cores > SSD read/write speed > CPU clock speed

At the same time, we also found that two CPUs should not be used to run an instance , because two-way NUMA  conflicts will cause serious degradation of memory bandwidth, and all solutions to optimize NUMA  conflicts will consume valuable memory capacity . 

The other 12 memory channels must be fully populated to obtain the full bandwidth supported by the CPU. A single memory should be 64GB, because after the 12-way 64GB total capacity of 768GB is installed with the Q8 quantized model weights, the remaining storage space can also support 22K model context as KV Cache.

When choosing a motherboard, you should not choose one that supports 2DPC (2 DIMMs Per Channel) memory slots. Even if you use this type of motherboard, make sure to only insert one memory stick into each channel. Otherwise, the motherboard will downclock the channel, such as from 5600MHz to 4800MHz, which will significantly reduce the overall bandwidth.

The CPU can be cooled with air, but the heat dissipation of the memory is very important. Long-term overheating of the memory may cause frequency reduction, and the memory will lose up to 20% of the generation speed after frequency reduction.

Based on the above research results, we have planned a solution based on  AMD EPYC 5th Gen 9005 series processors (the price is the current retail market price ):

MZ33-AR1 ( 5950 yuan)

EPYC 9115 (5400 RMB) or EPYC 9135 (7900 RMB)

DDR5 5600MHz 64GB x 12 (22800 yuan)

1TB SSD (338 yuan)

850W power supply (349 yuan)

CPU cooler (294 yuan)

Memory cooler (368 yuan)

Chassis (187 yuan)

Total: 35,686 yuan (  38,186 yuan if you choose EPYC 9135  )

If you want better scalability, you can also replace the motherboard with the dual-channel  MZ73-LM1. The cost is still within 40,000 yuan, but you can add another CPU and corresponding memory in the future to run two instances at the same time.

In terms of hardware optimization, the most important thing is the memory heat dissipation mentioned above. Secondly, since both the CPU and the motherboard support 6000MHz, the memory can be slightly overclocked, increasing the frequency from the default frequency of 5600MHz to 6000MHz. The entry location for overclocking selection: AMD CBS -> UMC Common Options -> Enforce PDR -> Memory Target Speed ​​- > DDR6000, as shown in the figure below:

In terms of system optimization, the main thing is to configure the system to use 1G huge pages (HugePages) and pre-allocate 671 1G huge pages. Add the following settings in the Grub configuration file:

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash default_hugepagesz=1G hugepagesz=1G hugepages=671"

After restarting, the system will open 1G large pages and reserve enough memory space to load the Q8 precision weight file.

In addition to hardware and system-level optimizations, the inference framework also needs to be optimized and llama-mmap.cpp  in llama.cpp  needs to be modified to use the reserved 1G large pages to improve performance .

Our modified  llama-mmap.cpp code can be obtained from the following address:

https://github.com/XuanwuLab/llama.cpp_deepseek/blob/main/llama-mmap.cpp

Replace the corresponding file in llama.cpp with the modified  llama-mmap.cpp . After compiling, execute the following command to load the model weights: 

./llama-server -m ./DeepSeek-R1-Zero-Q8_K_M/DeepSeek-R1-Zero-BF16-256x20B-Q8_0-00001-of-00016.gguf--host 0.0.0.0 --port 8008 --temp 0.6 --cache-type-k q8_0 -t 16 -tb 32 --ctx-size 4096 -np 1 --jinja --chat-template-file ../../models/templates/llama-cpp-deepseek-r1.jinja --reasoning-format deepseek
--jinja --chat-template-file ./llama.cpp/models/templates/llama-cpp-deepseek-r1.jinja --reasoning-format The deepseek parameter will force the model to perform deep thinking. If forced thinking is not required, these parameters are not needed.

The -t 16 and -tb 32 parameters specify the number of cores for generation and pre-population, respectively, which can avoid the system overhead of competing for CCD bandwidth while making full use of the extra computing power brought by hyperthreading. In general, using hyperthreading for generation is a negative optimization, but using hyperthreading for pre-population can increase speed.

FAQ:

Q: Why can a large model with parameters as high as 671B be inferred using a CPU?

A: DeepSeek is a highly sparse MoE (Mixture of Experts) model, with 256 experts in each layer. However, during actual reasoning, only 8 experts are activated for each token generated. This "activation on demand" mechanism means that although the total model parameters are as high as 671B, only about 37B of parameters are actually involved in the calculation, accounting for only 5.5% of the total parameters. Therefore, the demand for computing resources in the reasoning process is greatly reduced, making it possible to deploy a model of this scale using only CPU.

Q: Since DeepSeek-R1 can be run on the CPU, does that mean the GPU is not important?

A: For individual technology enthusiasts, this CPU-based hardware solution can achieve a relatively smooth output speed at the price of a high-performance game console, making it no longer out of reach for individuals to deploy DeepSeek-R1 . However,  the CPU solution also has its inherent defects. For example, the speed will drop significantly when there is high concurrency and long input . From the perspective of cost per million tokens, the CPU solution is also higher than the GPU. Therefore, the CPU solution has its applicable scenarios, but it cannot replace GPUs such as H20.

Q: Why is it quantified to Q8?

A: The native precision of DeepSeek-R1 is FP8. Since the CPU does not have dedicated FP8 hardware instruction support, and the SIMD instruction sets such as AVX512 of modern CPUs can accelerate integer processing, DeepSeek-R1 needs to be quantized to Q8 when using CPU reasoning. From our tests, the reasoning capabilities of Q8 and FP8 are not much different.

Q: Why not quantify it as Q4?

A: Although Q4 consumes less memory and generates tokens faster than Q8, Q8 still has a clear advantage in actual reasoning capabilities. In addition, when using CPU reasoning, because the SIMD instruction set has native support for dot product operations of 8-bit integers, more importantly, we found that the average length of Q4's thought chain is 45% longer than Q8, which means that 45% more invalid tokens are output. Therefore, although Q4 generates tokens faster, it will even be slower to complete the task. This is why we finally chose Q8.

Q: In addition to DeepSeek-R1, can this solution also be used for DeepSeek-V3?

A: Yes, this solution can also be used for DeepSeek-V3. In theory,  any MoE model with parameters less than or equal to the scale of DeepSeek-R1 can be used.

Q: Is there any place where I can experience  the effect of deploying DeepSeek-R1 on CPU?

A: We have released similar optimization solutions on the Tencent Cloud Native Build (CNB) platform. You can quickly experience the effect of the pure CPU deployment solution on CNB: https://cnb.cool/ai-models/deepseek-ai/DeepSeek-R1-GGUF/DeepSeek-R1-Q8_0