DeepSeek R1 671B Large Model Local Deployment Practical Guide for Less Than 100,000

Written by
Audrey Miles
Updated on:July-03rd-2025
Recommendation

A detailed practical guide for low-cost local deployment of DeepSeek R1 671B large models.

Core content:
1. Selecting the appropriate server configuration for low-cost deployment
2. Using the Ktransformers framework for efficient deployment
3. Detailed installation steps and environment configuration requirements

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)
Recently, I helped a friend deploy a local 671b version of deepseek r1. The requirement was to deploy it completely locally, but I didn't want to spend too much cost and there was no concurrency requirement. I thought it would be completely appropriate to use the ktransformers framework for deployment.
Regarding the machine configuration, after careful selection and evaluation, the equipment selection is as follows, and the server configuration is finally selected.
The total cost of this set of equipment is less than 100,000 yuan, which is much more worthwhile than spending several million yuan on the full-powered version of deepseek R1 or spending 500,000 or 600,000 yuan to buy a deepseek 70b all-in-one machine. Not to mention that the 70b is not a real deepseek r1, and its effect is not as good as the 32b QWQ. Just upgrading the all-in-one machine is a hassle. Once you buy the machine, it is basically bound to the model, and it will be difficult to upgrade it after a new model comes out.
Speaking of this, when I went to train a government department on big models last month, I heard a piece of gossip that a company spent several million to let a large factory privatize and deploy a customized model, but after deepseek r1 came out, it directly abandoned it and bought a deepseek all-in-one machine.
Moreover, after these companies bought the all-in-one machines, they still didn’t know how to use them. They just created an interface and a page for employees to ask questions, but there was no other use for them. I can only say that the companies are really rich.
Basic Configuration
Let me first state the conclusion. Using the K transformers framework, the speed can reach 5 tokens/s on the server configuration in the above picture. Considering that K transformers does not currently support concurrency and is deployed locally and privately for use by a few people, this speed is barely acceptable.
I have already written an article about the installation method of ktransformers official website document, detailed notes on ktransformers deployment, so I will not repeat it here.
Here I would like to introduce a new tutorial  r1-ktransformers-guide that I found during the installation this time. It provides a pre-compiled environment based on uv, which can avoid many environment dependency errors.
At the same time, please note that the Ubuntu version must be Ubuntu 22 or above , and the Python version must be 3.11.
NVIDIA driver version 570.86.1x, CUDA version 12.8
Then Ktransfomers should use version 0.2.2. The latest version 0.3 still has many bugs.
git clone https://github.com/kvcache-ai/ktransformers.gitcd ktransformers git submodule init git submodule update git checkout 7a19f3b git rev-parse --short HEAD #should display 7a19f3b

Note that git submodule update is mainly used to update projects in third_party
If the network is not good, you can download these projects directly from GitHub and put them in the third_party folder 
[submodule "third_party/llama.cpp"] path = third_party/llama.cpp url = https://github.com/ggerganov/llama.cpp.git[submodule "third_party/pybind11"] path = third_party/pybind11 url = https://github.com/pybind/pybind11.git


Download the model

Then download the quantized deepseek r1 model. Here I download the int4 quantized version. Due to network problems, I use Ali's Magic Tower to download the model.
modelscope download unsloth/DeepSeek-R1-GGUF --include "DeepSeek-R1-Q4_K_M/*" --cache_dir /home/user/new/models
--cache_dir /home/user/new/models is the location of the model download path


UV installation

uv is a modern Python package management tool written in Rust, known as "Python's Cargo". It is a high-speed alternative to traditional tools such as pip, pip-tools, and virtualenv. It is faster than pip and also supports advanced features of pip such as editable installation, git dependency, local dependency, source code distribution, etc.

Install the uv toolchain
curl -LsSf https://astral.sh/uv/install.sh | sh

Creating a Virtual Environment
uv venv ./venv --python 3.11 --python-preference=only-managedsource venv/bin/activate

Then we can install it using the UV tool as described in the tutorial.


uv install precompiled version
flashinfer-python is a high-performance GPU acceleration library designed for large language model (LLM) inference services . It mainly provides the following functions :
 $ uv pip install flashinfer-python
This is to install the precompiled version of the ktransformers library :
$ export TORCH_CUDA_ARCH_LIST="8.6"uv pip install https://github.com/ubergarm/ktransformers/releases/download/7a19f3b/ktransformers-0.2.2rc1+cu120torch26fancy.amd.ubergarm.7a19f3b.flashinfer-cp311-cp311-linux_x86_64.whl
This is to install the precompiled version of the flash_attn library :
uv pip install https://github.com/mjun0812/flash-attention-prebuild-wheels/releases/download/v0.0.5/flash_attn-2.6.3+cu124torch2.6-cp311-cp311-linux_x86_64.whl
The precompiled version here is actually compiled directly by the author of this document, although the tutorial instructions mention that it is only suitable for RTX 3090Ti 24GB video memory + 96GB DDR5-6400 memory + Ryzen 9950X processor.
But I use 4090 24 video memory + 500 DDR5-4800 memory, and I can install it successfully using this precompiled version. If this precompiled version can be installed successfully, many potential errors caused by incorrect versions can be avoided.

Run ktransformers from source


If the precompiled version above does not work, and you do not want to install ktransformers, you can also run it directly from the source code. The command is as follows:


 Support for multiple GPU configurations and finer-grained memory offload settings via `--optimize_config_path`PYTORCH_CUDA_ALLOC_CONF = expandable_segments: True  python3 ktransformers / server / main.py      --gguf_path /mnt/ai/models/unsloth/DeepSeek-R1-GGUF/DeepSeek-R1-UD-Q2_K_XL/     --model_path deepseek-ai/DeepSeek-R1 --model_name unsloth/DeepSeek-R1-UD-Q2_K_XL --cpu_infer 16 --max_new_tokens 8192 --cache_lens 32768 --total_context 32768--cache_q4 true--temperature 0.6 --top_p 0.95 --optimize_config_path ktransformers/optimize/optimize_rules/DeepSeek-R1-Chat.yaml --force_think --use_cuda_graph --host 127.0.0.1 --port 8080
Yes, in fact, even if you don't compile ktransformers, you can still run it . Just download the project, run the above command, prepare the corresponding files, and it will run normally.
If you want to continue compiling ktransfomers, you can follow the steps below:
# Install additional build dependencies, including the CUDA toolchain, for example: # sudo apt-get install build-essential cmake ...source venv/bin/activateuv pip install -r requirements-local_chat.txtuv pip install setuptools wheel packaging # It is recommended to skip the optional website app and use alternatives such as `open-webui` or `litellm` cd ktransformers/website/npm install @vue/clinpm run build cd ../.. # If you have sufficient CPU cores and memory resources, the build speed will be significantly improved # $ export MAX_JOBS=8 # $ export CMAKE_BUILD_PARALLEL_LEVEL=8 # Install flash_attnuv pip install flash_attn --no-build-isolation # Optional experimental use of flashinfer instead of triton # Not recommended unless you are an advanced user who has successfully used it # Install with the following command: # $ uv pip install flashinfer-python # Only applicable in the following cases: # Intel dual-socket CPUs with >1TB memory can accommodate two copies of the full model memory (one copy per CPU) # AMD EPYC NPS0 dual-socket platform may not need this configuration? # $ export USE_NUMA=1 # Install ktransformers KTRANSFORMERS_FORCE_BUILD=TRUE uv pip install . --no-build-isolation

If you want to compile it yourself for use in other environments, you can use the following command to compile the environment.
KTRANSFORMERS_FORCE_BUILD=TRUE uv build

Then move the generated files to other environments for installation and enter the following command
uv pip install ./dist/ktransformers-0.2.2rc1+cu120torch26fancy-cp311-cp311-linux_x86_64.whl

ktransformers run
Interface operation command
ktransformers --model_path /home/user/new/ktran0.2.2/ktransformers/models/deepseek-ai/DeepSeek-R1 --gguf_path /home/user/new/models/unsloth/DeepSeek-R1-GGUF/DeepSeek-R1-Q4_K_M --port 8080


If you want to run the web version, just add the --web True parameter at the end.


Other issues

If you use a CPU that does not support AMX, you may encounter the following error
/tmp/cc8uoJt1.s:23667: Error: no such instruction: `vpdpbusd %ymm3,%ymm15,%ymm1' error, File "<string>", line 327, in build_extension File "/usr/local/python3/lib/python3.11/subprocess.py", line 571, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['cmake', '--build', '.', '--verbose', '--parallel=128']' returned non-zero exit status 1. [end of output]


This is because the CPU architecture does not support these instructions: the -march=native parameter is used during compilation, which will cause the compiler to generate code optimized for the current CPU, including the use of a specific instruction set. However, if the current CPU does not support the AVX-VNNI instruction set, this error will occur.
Solution: Add the following option in your CMake configuration:
-DLLAMA_NATIVE=OFF -DLLAMA_AVX=ON -DLLAMA_AVX2=ON -DLLAMA_AVX512=OFF -DLLAMA_AVXVNNI=OFF
This will prevent the CPU from using the AMX instruction set, thus avoiding this error.

In addition, if you need requirements.txt of my environment, you can reply 330 in the background to get it.

Last words


Today in 2025, AI innovation has been gushing out, with new technologies appearing almost every day. As a technician who has experienced three waves of AI, I firmly believe that AI is not to replace humans , but to free us from repetitive work and focus on more creative things. Follow our official account Pocket Big Data to explore the infinite possibilities of big model implementation together !