4090 single card running Deepseek r1:671B full blood version

Explore the performance limits of the NVIDIA RTX 4090 graphics card in the Deepseek r1:671B model.
Core content:
1. Hardware configuration requirements: NVIDIA GeForce RTX 4090 graphics card and 382G memory
2. Environment configuration guide: CUDA, conda environment settings and dependency installation
3. Model download and optimization: Deepseek-r1:671B Q4_K_M quantitative version download and domestic source settings
Required configuration
Video memory: 24G
Memory: 382G
Hardware Configuration
Graphics card: NVIDIA GeForce RTX 4090 24G
Memory: 64G * 8 DDR5 4800
CPU: Intel(R) Xeon(R) Gold 6430
Environment Configuration
1. cuda environment , the version needs to be above 12.4 ,
Official website link : https://developer.nvidia.com/cuda-toolkit-archive
wget https://developer.download.nvidia.com/compute/cuda/12.6.0/local_installers/cuda_12.6.0_560.28.03_linux.runsudo sh cuda_12.6.0_560.28.03_linux.run
2. Install conda environment (optional) :
wget https://mirrors.tuna.tsinghua.edu.cn/anaconda/miniconda/Miniconda3-py310_24.11.1-0-Linux-x86_64.shbash ./Miniconda3-py310_24.11.1-0-Linux-x86_64.shconda create --name ktransformers python=3.11conda activate ktransformers
3. Install necessary dependencies:
sudo apt-get update && sudo apt-get install gcc g++ cmake ninja-build
4. Install ktransformer:
## flash_attn installation pip install flash_attn -i https://mirrors.aliyun.com/pypi/simple/ ## ktransformer installation using source code git clone https://github.com/kvcache-ai/ktransformers.git cd ktransformers ## Pull submodule code git submodule init git submodule update ## Run the compilation script bash install.sh
The download speed is too slow. Modify the compilation script ( install.sh ) to specify the domestic source:
pip install -r requirements-local chat.txt -i https://mirrors.aliyun.com/pypi/simple/
Model Download
The model file is the Q4_K_M quantized version of deepseek-r1:671b . Since the model file is too large, the download speed is relatively slow.
Download using modelscope
Official website: https://www.modelscope.cn/models
pip install modelscopemodelscope download --model unsloth/DeepSeek-R1-GGUF --local_dir /path/to/models --include DeepSeek-R1-Q4_K_M-* --max-workers 108
Command parameter explanation:
model : specifies the model project on the MoTou community
local_dir : is the download path of the specified file (if the path does not exist, it will be created automatically)
include : specifies the download file (where DeepSeek-R1-Q4_K_M-* matches all files with the prefix DeepSeek-R1-Q4_K_M- )
max-workers : specifies the number of connections to be established for downloading files (generally, this value can be set to the number of CPU cores minus 2. The number of CPU cores on this machine is 112 , and 108 is specified here . The larger the value, the faster the download).
Model Run
After entering the configured conda environment, execute the command
python -m ktransformers.local_chat --model_path deepseek-ai/DeepSeek-R1 --gguf_path /path/to/model --cpu_infer 48 --force_think true --max_new_tokens 128
Command parameter explanation:
model_path : the path of the MoDa project, used to remotely pull the necessary json files
gguf_path : The path where the downloaded gguf model file is located
cpu_infer : The number of CPU cores used for inference . The number of CPU cores on this machine is 64 , so it is set to 48 here . The default value of cpu_infer is 10. The inference speed is slower with 10 cores , so it can be increased appropriately, but not more than the number of CPU cores minus 2. force_think : Only when it is set to true can the thinking process of the model be seen. Otherwise, the thinking process of the model is not output by default.
max_new_tokens : The number of tokens that need to be generated
The initial model loading takes about 10 minutes . The model will be loaded into the buff/cache of the memory . If the memory is insufficient, the model cannot be successfully run. The final running effect is as follows: