4090 single card running Deepseek r1:671B full blood version

Written by

Iris Vance

Updated on:July-09th-2025

Required configuration

Video memory: 24G

Memory: 382G

Model file: deepseek-r1:671b 's Q4_K _M quantized version

Hardware Configuration

Graphics card: NVIDIA GeForce RTX 4090 24G

Memory: 64G * 8 DDR5 4800

CPU: Intel(R) Xeon(R) Gold 6430

Environment Configuration

1. cuda environment , the version needs to be above 12.4 ,

Official website link : https://developer.nvidia.com/cuda-toolkit-archive

wget https://developer.download.nvidia.com/compute/cuda/12.6.0/local_installers/cuda_12.6.0_560.28.03_linux.runsudo sh cuda_12.6.0_560.28.03_linux.run

2. Install conda environment (optional) :

wget https://mirrors.tuna.tsinghua.edu.cn/anaconda/miniconda/Miniconda3-py310_24.11.1-0-Linux-x86_64.shbash ./Miniconda3-py310_24.11.1-0-Linux-x86_64.shconda create --name ktransformers python=3.11conda activate ktransformers

3. Install necessary dependencies:

sudo apt-get update && sudo apt-get install gcc g++ cmake ninja-build

4. Install ktransformer:

## flash_attn installation pip install flash_attn -i https://mirrors.aliyun.com/pypi/simple/ ## ktransformer installation using source code git clone https://github.com/kvcache-ai/ktransformers.git cd ktransformers ## Pull submodule code git submodule init git submodule update ## Run the compilation script bash install.sh

The download speed is too slow. Modify the compilation script ( install.sh ) to specify the domestic source:

pip install -r requirements-local chat.txt -i https://mirrors.aliyun.com/pypi/simple/

Model Download

The model file is the Q4_K_M quantized version of deepseek-r1:671b . Since the model file is too large, the download speed is relatively slow.

Download using modelscope

Official website: https://www.modelscope.cn/models

pip install modelscopemodelscope download --model unsloth/DeepSeek-R1-GGUF --local_dir /path/to/models --include DeepSeek-R1-Q4_K_M-* --max-workers 108

Command parameter explanation:

model : specifies the model project on the MoTou community

local_dir : is the download path of the specified file (if the path does not exist, it will be created automatically)

include : specifies the download file (where DeepSeek-R1-Q4_K_M-* matches all files with the prefix DeepSeek-R1-Q4_K_M- )

max-workers : specifies the number of connections to be established for downloading files (generally, this value can be set to the number of CPU cores minus 2. The number of CPU cores on this machine is 112 , and 108 is specified here . The larger the value, the faster the download).

Model Run

After entering the configured conda environment, execute the command

python -m ktransformers.local_chat --model_path deepseek-ai/DeepSeek-R1 --gguf_path /path/to/model --cpu_infer 48 --force_think true --max_new_tokens 128

Command parameter explanation:

model_path : the path of the MoDa project, used to remotely pull the necessary json files

gguf_path : The path where the downloaded gguf model file is located

cpu_infer : The number of CPU cores used for inference . The number of CPU cores on this machine is 64 , so it is set to 48 here . The default value of cpu_infer is 10. The inference speed is slower with 10 cores , so it can be increased appropriately, but not more than the number of CPU cores minus 2. force_think : Only when it is set to true can the thinking process of the model be seen. Otherwise, the thinking process of the model is not output by default.

max_new_tokens : The number of tokens that need to be generated

The initial model loading takes about 10 minutes . The model will be loaded into the buff/cache of the memory . If the memory is insufficient, the model cannot be successfully run. The final running effect is as follows: