4090 single card running Deepseek r1:671B full blood version

Written by
Iris Vance
Updated on:July-09th-2025
Recommendation

Explore the performance limits of the NVIDIA RTX 4090 graphics card in the Deepseek r1:671B model.

Core content:
1. Hardware configuration requirements: NVIDIA GeForce RTX 4090 graphics card and 382G memory
2. Environment configuration guide: CUDA, conda environment settings and dependency installation
3. Model download and optimization: Deepseek-r1:671B Q4_K_M quantitative version download and domestic source settings

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)


Required configuration

Video memory: 24G

Memory: 382G

Model file:  deepseek-r1:671b  's  Q4_K _M  quantized version

Hardware Configuration


Graphics card: NVIDIA  GeForce RTX 4090 24G


Memory: 64G * 8 DDR5 4800


CPU: Intel(R) Xeon(R) Gold 6430


Environment Configuration


1. cuda environment , the version needs to be above 12.4 ,

Official website link : https://developer.nvidia.com/cuda-toolkit-archive


wget https://developer.download.nvidia.com/compute/cuda/12.6.0/local_installers/cuda_12.6.0_560.28.03_linux.runsudo sh cuda_12.6.0_560.28.03_linux.run


2.  Install conda environment (optional) :


wget https://mirrors.tuna.tsinghua.edu.cn/anaconda/miniconda/Miniconda3-py310_24.11.1-0-Linux-x86_64.shbash ./Miniconda3-py310_24.11.1-0-Linux-x86_64.shconda create --name ktransformers python=3.11conda activate ktransformers


3. Install necessary dependencies:


sudo apt-get update && sudo apt-get install gcc g++ cmake ninja-build


4. Install ktransformer:


## flash_attn installation pip install flash_attn -i https://mirrors.aliyun.com/pypi/simple/ ## ktransformer installation using source code git clone https://github.com/kvcache-ai/ktransformers.git cd ktransformers ## Pull submodule code git submodule init git submodule update ## Run the compilation script bash install.sh


The download speed is too slow. Modify the compilation script ( install.sh ) to specify the domestic source:


pip install -r requirements-local chat.txt -i https://mirrors.aliyun.com/pypi/simple/


Model Download


The model file is the Q4_K_M  quantized version of   deepseek-r1:671b  . Since the model file is too large, the download speed is relatively slow.


Download using  modelscope 


Official website: https://www.modelscope.cn/models


pip install modelscopemodelscope download --model unsloth/DeepSeek-R1-GGUF --local_dir /path/to/models --include DeepSeek-R1-Q4_K_M-* --max-workers 108


Command parameter explanation:


model : specifies the model project on the MoTou community


local_dir : is the download path of the specified file (if the path does not exist, it will be created automatically)


include : specifies the download file (where  DeepSeek-R1-Q4_K_M-*  matches all files with  the prefix DeepSeek-R1-Q4_K_M- )


max-workers : specifies the number of connections to be established for downloading files (generally, this value can be set to  the number of CPU  cores minus  2.  The number of CPU  cores  on this machine  is 112 , and 108 is specified here  . The larger the value, the faster the download).


Model Run 


After entering the configured conda environment, execute the command


python -m ktransformers.local_chat --model_path deepseek-ai/DeepSeek-R1 --gguf_path /path/to/model --cpu_infer 48 --force_think true --max_new_tokens 128


Command parameter explanation:


model_path  : the path of the MoDa project, used to remotely pull the necessary json files


gguf_path  : The path where the downloaded  gguf  model file is located


cpu_infer : The number of CPU  cores used for inference  . The number of CPU  cores on this machine  is  64 , so it is set to  48 here . The default value of cpu_infer  is  10. The inference speed is slower with 10  cores , so it can be increased appropriately, but not more than  the number of CPU  cores minus  2. force_think : Only when it is set to  true  can the thinking process of the model be seen. Otherwise, the thinking process of the model is not output by default.


max_new_tokens : The number of tokens that need to be generated


The initial model loading takes about  10  minutes . The model will be loaded  into the buff/cache  of the memory . If the memory is insufficient, the model cannot be successfully run. The final running effect is as follows: