modelscope download unsloth/DeepSeek-R1-GGUF --include "DeepSeek-R1-Q4_K_M/*" --cache_dir /home/user/new/models
Table of Content
DeepSeek R1 671B Large Model Local Deployment Practical Guide for Less Than 100,000

Updated on:July-03rd-2025
Recommendation
A detailed practical guide for low-cost local deployment of DeepSeek R1 671B large models.
Core content:
1. Selecting the appropriate server configuration for low-cost deployment
2. Using the Ktransformers framework for efficient deployment
3. Detailed installation steps and environment configuration requirements
Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)
NVIDIA driver version 570.86.1x,
CUDA version 12.8
Then Ktransfomers should use version 0.2.2. The latest version 0.3 still has many bugs.
git clone https://github.com/kvcache-ai/ktransformers.gitcd ktransformers git submodule init git submodule update git checkout 7a19f3b git rev-parse --short HEAD #should display 7a19f3b
Note that git submodule update is mainly used to update projects in third_party
If the network is not good, you can download these projects directly from GitHub and put them in the third_party folder
[submodule "third_party/llama.cpp"] path = third_party/llama.cpp url = https://github.com/ggerganov/llama.cpp.git[submodule "third_party/pybind11"] path = third_party/pybind11 url = https://github.com/pybind/pybind11.git
Download the model
curl -LsSf https://astral.sh/uv/install.sh | sh
uv venv ./venv --python 3.11 --python-preference=only-managedsource venv/bin/activate
$ uv pip install flashinfer-python
$ export TORCH_CUDA_ARCH_LIST="8.6"uv pip install https://github.com/ubergarm/ktransformers/releases/download/7a19f3b/ktransformers-0.2.2rc1+cu120torch26fancy.amd.ubergarm.7a19f3b.flashinfer-cp311-cp311-linux_x86_64.whl
uv pip install https://github.com/mjun0812/flash-attention-prebuild-wheels/releases/download/v0.0.5/flash_attn-2.6.3+cu124torch2.6-cp311-cp311-linux_x86_64.whl
Run ktransformers from source
If the precompiled version above does not work, and you do not want to install ktransformers, you can also run it directly from the source code. The command is as follows:
Support for multiple GPU configurations and finer-grained memory offload settings via `--optimize_config_path`
PYTORCH_CUDA_ALLOC_CONF = expandable_segments: True python3 ktransformers / server / main.py
--gguf_path /mnt/ai/models/unsloth/DeepSeek-R1-GGUF/DeepSeek-R1-UD-Q2_K_XL/
--model_path deepseek-ai/DeepSeek-R1
--model_name unsloth/DeepSeek-R1-UD-Q2_K_XL
--cpu_infer 16
--max_new_tokens 8192
--cache_lens 32768
--total_context 32768
--cache_q4 true
--temperature 0.6
--top_p 0.95
--optimize_config_path ktransformers/optimize/optimize_rules/DeepSeek-R1-Chat.yaml
--force_think
--use_cuda_graph
--host 127.0.0.1
--port 8080
# Install additional build dependencies, including the CUDA toolchain, for example: # sudo apt-get install build-essential cmake ...source venv/bin/activateuv pip install -r requirements-local_chat.txtuv pip install setuptools wheel packaging # It is recommended to skip the optional website app and use alternatives such as `open-webui` or `litellm` cd ktransformers/website/npm install @vue/clinpm run build cd ../.. # If you have sufficient CPU cores and memory resources, the build speed will be significantly improved # $ export MAX_JOBS=8 # $ export CMAKE_BUILD_PARALLEL_LEVEL=8 # Install flash_attnuv pip install flash_attn --no-build-isolation # Optional experimental use of flashinfer instead of triton # Not recommended unless you are an advanced user who has successfully used it # Install with the following command: # $ uv pip install flashinfer-python # Only applicable in the following cases: # Intel dual-socket CPUs with >1TB memory can accommodate two copies of the full model memory (one copy per CPU) # AMD EPYC NPS0 dual-socket platform may not need this configuration? # $ export USE_NUMA=1 # Install ktransformers KTRANSFORMERS_FORCE_BUILD=TRUE uv pip install . --no-build-isolation
KTRANSFORMERS_FORCE_BUILD=TRUE uv build
uv pip install ./dist/ktransformers-0.2.2rc1+cu120torch26fancy-cp311-cp311-linux_x86_64.whl
ktransformers --model_path /home/user/new/ktran0.2.2/ktransformers/models/deepseek-ai/DeepSeek-R1 --gguf_path /home/user/new/models/unsloth/DeepSeek-R1-GGUF/DeepSeek-R1-Q4_K_M --port 8080
/tmp/cc8uoJt1.s:23667: Error: no such instruction: `vpdpbusd %ymm3,%ymm15,%ymm1' error, File "<string>", line 327, in build_extension File "/usr/local/python3/lib/python3.11/subprocess.py", line 571, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['cmake', '--build', '.', '--verbose', '--parallel=128']' returned non-zero exit status 1. [end of output]
-DLLAMA_NATIVE=OFF -DLLAMA_AVX=ON -DLLAMA_AVX2=ON -DLLAMA_AVX512=OFF -DLLAMA_AVXVNNI=OFF
Last words