Single card RTX4090 deploys R1 full blood version of KTransformers

Written by
Jasper Cole
Updated on:July-16th-2025
Recommendation

New breakthroughs in the field of AI in 2025, Deepseek is popular all over the world, and everyone is enthusiastic about it! RTX 4090 graphics cards can also deploy the full version of Deepseek.

Core content:
1. Deepseek is popular all over the world, and the product and R&D fields are soaring
2. Deepseek has triggered a national craze and has become popular
3. Practical sharing of deploying the full version of Deepseek on RTX 4090 graphics cards

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)
At the beginning of 2025, Deepseek was born and became popular all over the world. This can be seen from the following picture:

(Excerpt from AI Product List)

The apps compared with Deepseek in the picture are all the most popular global super apps. This is the feeling of the product field. Let’s look at another picture:

At first glance, you might think that this picture shows that dify's star growth is very fast, exceeding openai-cookbook, but if you look closely, something is wrong. The blue line on the side is actually a line, which is the star growth curve of Deepseek-R1. It was almost mistaken for a coordinate axis! Because many statistical charts have dual coordinate axes, there is also a single coordinate axis on the right. This growth rate is simply amazing. The growth curve has the momentum of an arrow piercing the clouds and thousands of troops coming to meet. This is the physical feeling in the field of R&D.

Both the product and R&D fields were still considered within the technical circle. It was not until after the Spring Festival that Deepseek became popular in China and then triggered a nationwide craze. Even the leaders of the old manufacturing factories established in the last century mentioned Deepseek in their speeches; even the good morning pictures sent by my father-in-law in the family group were replaced with Deepseek. This is the real popularity.

So all kinds of companies, big and small, rushed to connect to Deepseek, and individuals were also eager to try. Some people took out old desktop computers with graphics cards that had been gathering dust in the corner for many years and wanted to run Deepseek. Recently, it was widely reported that the KTransformers project of Tsinghua University could deploy the full-blooded version of Deepseek on consumer-grade graphics cards. It happened that a customer was also very concerned about this matter. Before, the full-blooded version of Deepseek could only be run on 16-card H100 and other hardware that cost millions of dollars. In this era when AI is flourishing from research to application, progress is changing with each passing day, and true and false information is flying everywhere, what is the real situation? It happened that I just ran the full-blooded version of Deepseek on the RTX 4090 graphics card using KTransformers a few days ago. This article will share the practical process of this time.

The practical process mainly refers to this official document: [ Installation Guide - Ktransformers] https://kvcache-ai.github.io/ktransformers/en/install.html

Model Download

The official documentation lists the DeepSeek models supported by KTransformers:

Model Name Model Size VRAM Minimum DRAM Recommended DRAMDeepSeek-R1-q4_k_m 377G 14G 382G 512GDeepSeek-V3-q4_k_m 377G 14G 382G 512GDeepSeek-V2-q4_k_m 133G 11G 136G 192GDeepSeek-V2.5-q4_k_m 133G 11G 136G 192GDeepSeek-V2.5-IQ4_XS 117G 10G 107G 128G

We choose the DeepSeek-R1-q4_k_m model, which is the INT4 quantized model.  You can find this model in the unsloth organization of huggingface  , download address: https://huggingface.co/unsloth/DeepSeek-R1-GGUF/tree/main/DeepSeek-R1-Q4_K_M

Download command:

nohup huggingface-cli download unsloth/DeepSeek-R1-GGUF \ --revision main --include "DeepSeek-R1-Q4_K_M/*" \ --local-dir deepseek-ai > download-R1-Q4_K_M.log 2>&1 &

In addition, because DeepSeek-R1-Q4_K_M only contains gguf files, inference also requires downloading the configuration file in the DeepSeek-R1 original model. Download address:

https://huggingface.co/deepseek-ai/DeepSeek-R1/tree/main

Note that you only need to download all the small files except the .safetensors model weights file, which will be used later when inferring DeepSeek-R1-Q4_K_M.

After downloading, the model directory should look like this:

# ll  /data/ ai /models/ deepseek - ai /DeepSeek-R1-Q4_K_Mtotal 394951472-rw-r--r-- 1 root root 48339779936 Feb 20 02:55 DeepSeek-R1-Q4_K_M-00001-of-00009.gguf-rw-r--r-- 1 root root 49429396320 Feb 20 03:02 DeepSeek-R1-Q4_K_M-00002-of-00009.gguf-rw-r--r-- 1 root root 49527312640 Feb 20 03:19 DeepSeek-R1-Q4_K_M-00003-of-00009.gguf-rw-r--r-- 1 root root 48272509536 Feb 20 02:08 DeepSeek-R1-Q4_K_M-00004-of-00009.gguf-rw-r--r-- 1 root root 49422027488 Feb 20 02:17 DeepSeek-R1-Q4_K_M-00005-of-00009.gguf-rw-r--r-- 1 root root 48272509536 Feb 20 03:16 DeepSeek-R1-Q4_K_M-00006-of-00009.gguf-rw-r--r-- 1 root root 49429396320 Feb 20 08:33 DeepSeek-R1-Q4_K_M-00007-of-00009.gguf-rw-r--r-- 1 root root 46938773696 Feb 20 15:15 DeepSeek-R1-Q4_K_M-00008-of-00009.gguf-rw-r--r-- 1 root root 14798482144 Feb 20 02:53 DeepSeek-R1-Q4_K_M-00009-of-00009.gguf
# ll /data /ai/ models /deepseek-ai/ DeepSeek - R1- rw - r -- r -- 1  root root     1729 Feb 21 03 : 41  config.json     - rw - r -- r -- 1  root root    10556 Feb 21 03 : 41  configuration_deepseek.py     - rw - r -- r -- 1  root root      171 Feb 21 03 : 41  generation_config.json     - rw - r -- r -- 1  root root    75769 Feb 21 03 : 41  modeling_deepseek.py     - rw - r -- r -- 1  root root  8898324 Feb 21 03 : 41  model.safetensors.index.json     - rw - r -- r -- 1  root root     3584 Feb 21 03 : 41  tokenizer_config.json     - rw - r -- r -- 1  root root  7847602 Feb 21 03 : 41  tokenizer.json     

Image building

Change the following content:

# The base  image is Python 3.11.10 FROM  pytorch /pytorch:2.5.1-cuda12.1-cudnn9-devel
# Copy the ktransformers project code (including third-party dependencies) to the imageCOPY ktransformers /home /ktransformers
# Set the working directoryWORKDIR /home /ktransformers
# Set the pip source to Alibaba Cloud pip sourceENV PIP_INDEX_URL=https:/// /mirrors.aliyun.com/ pypi /simple/
# Install basic toolsRUN -- mount = type = cache,target =/ var /cache/ apt \     apt update  &&  \    apt install  - y apt - utils cmake build - essential ninja - build  &&  \    apt install  - y vim net - tools telnet curl wget netcat lsof git  &&  \    apt install  -y libstdc ++ 6
# pip3 install torch torchvision torchaudio  -- index - url https: //download.pytorch.org/whl/cu126# pip3 install packaging ninja cpufeature numpyRUN --mount=type=cache,target=/root/.cache/pip ls -l /root/.cache/pip && \    # Already available in other base images    pip3 install cpufeature && \    pip3 install flash-attn --no-build-isolation    # pip install flash_attn-2.7.4.post1+cu12torch2.5cxx11abiTRUE-cp311-cp311-linux_x86_64.whl
# Install ktransformers# Set the CUDA pathENV CUDA_HOME=/usr/local/cudaENV PATH=/usr/local/cuda/bin:${PATH}ENV LD_LIBRARY_PATH=/usr/local/cuda/lib64:${LD_LIBRARY_PATH}ENV CUDA_PATH=/usr/local/cuda:${CUDA_PATH}## import torch# device = torch.cuda.current_device()# major, minor = torch.cuda.get_device_capability(device)# print(f"Current GPU computing power: {major}.{minor}")# RTX4090 is 8.9ENV TORCH_CUDA_ARCH_LIST="8.9"
# RUN export USE_NUMA=1 bash install.shRUN bash install.sh
# Clear the ENTRYPOINT of the parent imageENTRYPOINT []CMD []

Save it as ktransformers.dockerfile; yes, we still deploy it through Docker containers, because containers can perfectly isolate the test environment and can be made once and reused at any time.

As the Dockerfile above is written, there are a few points that need to be explained:

Step 1 : I chose  pytorch /pytorch:  2.5.1-cuda12.1-cudnn9-devel as the base image to be compatible with the nvidia/cuda driver in my environment. You can choose different versions of this series of images according to the driver in your environment.

Step 2 : Note that apt install -y libstdc++6 is different from what is mentioned in the official website documentation:

conda install -c conda-forge libstdcxx-ng # Anaconda provides a package called `libstdcxx-ng` that includes a newer version of `libstdc++`, which can be installed via `conda-forge`.

They are not the same, because libstdcxx-ng is a package name provided by conda-forge, while the corresponding package name in the system package manager (such as apt) is libstdc++6. The main differences between the two are:

  • Package source and management method: The conda package manager maintains its own software packages and dependencies. libstdcxx-ng is a version it provides, which is usually compiled, optimized and adapted to ensure the compatibility of various packages in the conda environment. In Debian/Ubuntu and other systems, apt uses the version maintained in the official software repository. This package is usually named libstdc++6.

  • Version and update frequency: libstdcxx-ng often provides a newer version of the GNU Standard C++ Library that supports more new features. This version may be more suitable for some applications that rely on newer C++ features; the update cycle of the libstdc++6 version in the system repository is limited by the stability requirements of the distribution and may be older.

Step 3 : The image building process will be very slow when pip3 install flash-attn --no-build-isolation is reached. It will take a long time to complete:

#9 18.82 Requirement already satisfied: MarkupSafe>=2.0 in /opt/conda/lib/python3.11/site-packages (from jinja2->torch->flash-attn) (3.0.2)#9 18.82 Building wheels for collected packages: flash-attn #There will be no output for a long time. #9 18.82 Building wheel for flash-attn (setup.py): started#9 8603.3 Building wheel for flash-attn (setup.py): still running...#9 9028.1 Building wheel for flash-attn (setup.py): still running...#9 9885.2 Created wheel for flash-attn: filename=flash_attn-2.7.4.post1-cp311-cp311-linux_x86_64.whl size=187620450 sha256=68d5a9bf12ffdd88f72304ed9d082f4b4cbef3f7566189b097320976a6fc2da5#9 9885.2 Stored in directory: /root/.cache/pip/wheels/be/a6/20/b07aa001ad33e149303e10dce671e5a6018ee17976b1f3661f#9 9885.2 Successfully built flash-attn#9 9885.3 Installing collected packages: einops, flash-attn#9 9888.1 Successfully installed einops-0.8.1 flash-attn-2.7.4.post1#9 9888.1 WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behavior with the system package manager, possibly rendering your system unusable.It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv. Use the --root-user-action option if you know what you are doing and want to suppress this warning.#9 DONE 9890.6s

It can be seen that this construction step (including all pip installs) took 2 hours and 45 minutes. The main time was spent on the installation of flash-attn. The main problem should still be the network speed.

If you want to speed up the image building process, you can manually download the flash-attn whl package. The best way is to download it from an overseas server first (it only takes 2.2 seconds, while it can be as slow as 10 hours in China), and then transfer it back to the build server. Download address: https://github.com/Dao-AILab/flash-attention/releases You need to choose according to your environment. According to my actual environment, I chose

  • flash_attn-2.7.4.post1+cu12torch2.5cxx11abiTRUE-cp311-cp311-linux_x86_64.whl

This package can be successfully installed. Download the package to the ktransformers source directory, and then copy the

    pip3 install flash-attn --no-build-isolation

Replace with:

    pip install flash_attn-2.7.4.post1+cu12torch2.5cxx11abiTRUE-cp311-cp311-linux_x86_64.whl

The build can be completed quickly, saving nearly 3 hours of build time.

Then download the ktransformers source code in the same directory as ktransformers.dockerfile:

# git clone https  : // github.com/kvcache-ai/ktransformers.git# cd ktransformers /
# git submodule initSubmodule  'third_party/llama.cpp'  (https: // github.com / ggerganov / llama.cpp.git) registered  for  path  'third_party/llama.cpp'Submodule  'third_party/pybind11'  (https: // github.com / pybind / pybind11.git) registered  for  path  'third_party/pybind11'
# git submodule  updateCloning  into '/root/githubs/ktransformers/third_party/llama.cpp' ... Cloning  into '/root/githubs/ktransformers/third_party/pybind11' ... Submodule path  'third_party/llama.cpp' : checked  out 'a94e6ff8774b7c9f950d9545baf0ce35e8d1ed2f' Submodule path  'third_party/pybind11' : checked  out 'bb05e0810b87e74709d9f4c4545f1f57a1b386f5' 
#ll third_party /drwxr - xr - 24  root root  4096  Feb  19 13 : 52  llama.cpp / drwxr - xr - x   2  root  root 4096  Feb  19 13:51 llamafile  / drwxr - xr - x   8  root  root 4096  Feb  19 13:52 pybind11  / 

Go back to the directory where ktransformers.dockerfile is located, and you can see the following hierarchy:

├─── ktransformers│ ...│└── third_party│ ├── llama.cpp│ ├── llamafile│ └── pybind11└── ktransformers.dockerfile

Execute the build command:

nohup docker build -f ktransformers.dockerfile -t ktransformers:20250220_cu121 . > build.log 2>&1 &tail -100f build.log

Start the container

Start the container using the built image:

docker run --name ktransformers -itd --gpus '"device=7"' -p 10002:10002 -v /data/ai/models:/models -v /data/ai/datasets:/datasets -v /data/ai/workspace/ktransformers:/workspace ktransformers:20250220_cu121 bash

in

  • --gpus '"device=7"' specifies the 8th GPU card to be mounted in the container. The card number is consistent with what nvidia-smi sees on the host machine

  • -v The mounted directories /data/ai/models, /data/ai/datasets, and /data/ai/workspace/ktransformers are my local model directory, dataset directory, and ktransformers working directory. You can replace them with your own.

  • -p 10002:10002 exposes port 10002 in the container to the same port on the host

After starting the container, use the docker exec command to enter the container and view the versions of the main dependent packages:

# docker exec -it ktransformers bashroot@3baa1be463fc:/home/ktransformers# pip list | grep -E 'torch|transfor|cpufeature|packaging|ninja|cpufeature|numpy|flash|triton'cpufeature 0.2.1flash_attn 2.7.4.post1ktransformers 0.2.1.post1+cu121torch25avx2ninja 1.11.1.1numpy 1.26.4packaging 24.1torch 2.5.1+cu121torchaudio 2.5.1+cu121torchelastic 0.2.2torchvision 0.20.1+cu121transformers 4.43.2triton 3.1.0

Start inference

Because our /data/ai/models directory is mounted to the /models directory in the container, the model address in the container becomes:

  • Quantized gguf model path: /models/deepseek-ai/DeepSeek-R1-Q4_K_M

  • The path of the original model: /models/deepseek-ai/DeepSeek-R1

Direct local command line dialogue (not recommended, the current shell will end when it exits, only for testing):

python -m ktransformers.local_chat --model_path /models/deepseek-ai/DeepSeek-R1 --gguf_path /models/deepseek-ai/DeepSeek-R1-Q4_K_M --cpu_infer 96

Start the inference service: (no web interface)

nohup ktransformers --model_path /models/deepseek-ai/DeepSeek-R1 --gguf_path /models/deepseek-ai/DeepSeek-R1-Q4_K_M --cpu_infer 96 --port 10002 > ktransformers.log 2>&1 &

Start the inference service and the web interface at the same time:

nohup ktransformers --model_path /models/deepseek-ai/DeepSeek-R1 --gguf_path /models/deepseek-ai/DeepSeek-R1-Q4_K_M --cpu_infer 96 --port 10002 --web True > ktransformers.log 2>&1 &

The last command needs to be installed according to the document:  https://kvcache-ai.github.io/ktransformers/en/api/server/website.html  before it can be used. Because webui has many dependencies, I will not experiment with it.

Directly start the local inference log:

root @cf9c7d743f44 : /workspace# python -m ktransformers.local_chat --model_path /m odels/deepseek-ai/ DeepSeek - R1  --gguf_path /models/deepseek-ai/ DeepSeek - R1 - Q4_K_M  --cpu_infer  96flashinfer not found, use triton  for  linuxusing custom modeling_xxx.py .using default_optimize_rule  for DeepseekV3ForCausalLM Injecting  model  as  ktransformers. operators . models  .  KDeepseekV 2ModelInjecting  model.embed_tokens as default  Injecting  model. layers as default  Injecting  model. layers .0 as default  Injecting  model. layers .0 . self_attn as  ktransformers. operators . attention  .  KDeepseekV 2Attention Injecting  model. layers .0 . self_attn . q_a_proj as  ktransformers. operators . linear  .  KTransformersLinear Injecting  model. layers .0 . self_attn . q_a_layernorm as default  Injecting  model. layers .0 . self_attn . q_b_proj as  ktransformers. operators . linear  .  KTransformersLinear Injecting  model. layers .0 . self_attn . kv_a_proj_with_mqa as  ktransformers. operators . linear  .  KTransformersLinear Injecting  model. layers .0 . self_attn . kv_a_layernorm as default  Injecting  model. layers .0 . self_attn . kv_b_proj as default  Injecting  model. layers .0 . self_attn . o_proj as  ktransformers. operators . linear  .  KTransformersLinear Injecting  model. layers .0 . self_attn . rotary_emb as  ktransformers. operators . RoPE  .  YarnRotaryEmbeddingV3 Injecting  model. layers .0 . mlp as default  Injecting  model. layers .0 . mlp . gate_proj as  ktransformers. operators . linear  .  KTransformersLinear Injecting  model. layers .0 . mlp . up_proj as  ktransformers. operators . linear  .  KTransformersLinear Injecting  model. layers .0 . mlp . down_proj as  ktransformers. operators . linear  .  KTransformersLinear Injecting  model. layers .0 . mlp . act_fn as default  Injecting  model. layers .0 . input_layernorm as default  Injecting  model. layers .0 . post_attention_layernorm as default  ...Injecting  model. layers .60 as default  Injecting  model. layers .60 . self_attn as  ktransformers. operators . attention  .  KDeepseekV 2Attention Injecting  model. layers .60 . self_attn . q_a_proj as  ktransformers. operators . linear  .  KTransformersLinear Injecting  model. layers .60 . self_attn . q_a_layernorm as default  Injecting  model. layers .60 . self_attn . q_b_proj as  ktransformers. operators . linear  .  KTransformersLinear Injecting  model. layers .60 . self_attn . kv_a_proj_with_mqa as  ktransformers. operators . linear  .  KTransformersLinear Injecting  model. layers .60 . self_attn . kv_a_layernorm as default  Injecting  model. layers .60 . self_attn . kv_b_proj as default  Injecting  model. layers .60 . self_attn . o_proj as  ktransformers. operators . linear  .  KTransformersLinear Injecting  model. layers .60 . self_attn . rotary_emb as  ktransformers. operators . RoPE  .  YarnRotaryEmbeddingV3 Injecting  model. layers .60 . mlp as  ktransformers. operators . experts  .  KDeepseekV 3MoE Injecting  model. layers .60 . mlp . experts as  ktransformers. operators . experts  .  KTransformersExperts Injecting  model. layers .60 . mlp . gate as  ktransformers. operators . gate  .  KMoEGate Injecting  model. layers .60 . mlp . shared_experts as default  Injecting  model. layers .60 . mlp . shared_experts . gate_proj as  ktransformers. operators . linear  .  KTransformersLinear Injecting  model. layers .60 . mlp . shared_experts . up_proj as  ktransformers. operators . linear  .  KTransformersLinear Injecting  model. layers .60 . mlp . shared_experts . down_proj as  ktransformers. operators . linear  .  KTransformersLinear Injecting  model. layers .60 . mlp . shared_experts . act_fn as default  Injecting  model. layers .60 . input_layernorm as default  Injecting  model. layers .60 . post_attention_layernorm as default  Injecting  model. norm as default  Injecting  lm_head  as default loading token_embd. weight  to cpuloading token_embd. weight with CPU  /opt/conda/lib/python3 .11 /site-packages/ktransformers/util/custom_gguf. py : 523 UserWarning The  given  NumPy  array is not writable, and  PyTorch  does not support non-writable tensors.  This  means writing to  this  tensor will result  in undefined  behavior.  You  may want to copy the array to protect its data or make it writable before converting it to a tensor.  This type of  warning will be suppressed  for  the rest  of this  program. ( Triggered  internally at ../torch/csrc/utils/tensor_numpy. cpp : 206. )      data = torch. from_numpy (data)loading blk .0 . attn_q_a_norm . weight  to  cuda : 0loading blk .0 . attn_kv_a_norm . weight  to  cuda : 0loading blk .0 . attn_kv_b . weight  to  cuda : 0loading blk .0 . attn_norm . weight  to  cuda : 0loading blk .0 . ffn_norm . weight  to  cuda : 0...loading blk .60 . attn_q_a_norm . weight  to  cuda : 0loading blk .60 . attn_kv_a_norm . weight  to  cuda : 0loading blk .60 . attn_kv_b . weight  to  cuda : 0loading blk .60 . attn_norm . weight  to  cuda : 0loading blk .60 . ffn_norm . weight  to  cuda : 0loading output_norm. weight  to  cuda : 0loading output. weight  to  cuda : 0Chat : Who are you?
loc ( "/opt/conda/lib/python3.11/site-packages/ktransformers/operators/triton_attention.py" : 95 : 16 ):  error : operation scheduled before its operands<think>
< /think>
Hello! I am DeepSeek-R1, an intelligent assistant developed by China's DeepSeek company. If you have any questions, I will do my best to help you.prompt eval count: 5 token(s)prompt eval duration: 75.26080298423767sprompt eval rate: 0.066435645139837 tokens/seval count :            42 tokens (s)  eval duration :         196. 99316954612732s eval rate :             0.2132053618750746  tokens/s 

System resources:

ATOP -  my4090host - 1 2025 /02/ 21 07 : 57 : 12 --------------                                                   10s elapsed                                                                                                  PRC |  sys 1m53s   |  user 5m59s   | | |  #proc    1999 |  #trun      29 |  #tslpi   12e3 |   #tslpu     69 |   #zombie     1 |  clones   1103 | | |   #exit     320 |                                                                     CPU |  sys      928 % |  user    3570 % |  irq        5 % | |  idle    5952 % |  wait    2351 % |  steal      0 % |   guest      0 % | |  ipc      0.79 |  cycl  1 .12GHz  |   curf  2 .93GHz  |   curscal    ?% |                                             CPL |  avg1    89.71 | |  avg5    66.40 |   avg15   35.28 | | |  csw   1294839 | |   intr  1454401 | | |   numcpu    128 | |                                                                                                                MEM |  tot    503.7G |  free     4.3G | cache   433.7G |   dirty    0.1M  |  buff     2.1G  | slab   16.1G | slrec 12.3G   |  shmem   7.2G  | shrss  0.0M |   shswp     0.0M  |  vmbal    0.0M |   hptot    ​   ​     ​   ​  ​   ​  .0M  |   hpuse    0 .0M  | SWP |  tot      0 .0M   |  free     0 .0M   | | | | | | | | |  vmcom  151 .6G  | |   vmlim  251 .9G  |                                                                                                                                      PAG |  scan  2253726 |  steal  1382e3 | |   stall       0 | | | | | | |  swin        0 | |   swout       0 |                                                                                                                              PSI |  cs      3 /2/ 2 | |  ms      0 /0/ 0 |   mf      0 /0/ 0 | | is 24 /23/ 16 | if 24 /23/ 16 | | | | | | |                                                                                                                                     LVM |       vg0 - lv0   |  busy      74 % | |   read    23230 |  write     426 | KiB /r 96 | KiB/ w       4 | | MBr /s 218.2 | MBw/ s     0.2 |  avq     41.38 | |   avio  0.30  ms  |                                                         LVM |  tu -- vg - lv -- 0 |  busy       3 % | |   read        0 |  write    1462 | KiB /r 0 | KiB/ w      10 | | MBr /s 0.0 | MBw/ s     1.4 |  avq     10.46 | |   avio  0.20  ms  |                                                           DSK |           sdb   |  busy      63 % | |   read    15757 |  write      30 | KiB /r 114 | KiB/ w       2 | | MBr /s 176.3 | MBw/ s     0.0 |  avq     26.62 | |   avio  0.39  ms  |                                                         DSK |           sda   |  busy      14 % | |   read     3259 |  write     136 | KiB /r 131 | KiB/ w      12 | | MBr /s 42.0 | MBw/ s     0.2 |  avq     29.75 | |   avio  0.39  ms  |                                                         DSK |           sdc   |  busy       3 % | |   read        0 |  write    1217 | KiB /r 0 | KiB/ w      12 | | MBr /s 0.0 | MBw/ s     1.4 |  avq     10.36 | |   avio  0.24  ms  |                                                         NFM |  7dda8ef80f88   |  srv  10.9 . 30 .   |  read 0K  |   write 0K  |  nread 23489K   |  nwrit 0K   |  dread 0K  |   dwrit 0K  | |  mread 0K   |  mwrit 0K  | | |                                              NFM |  5d556ec18a64   |  srv  10.9 . 30 .   |  read 0K  |   write 0K  |  nread 23489K   |  nwrit 0K   |  dread 0K  |   dwrit 0K  | |  mread 0K   |  mwrit 0K  | | |                                              NFC |  rpc        20 |  read        0 | |   write       0 |  retxmit     0 |  autref     20 | | | | | | | |                                                                                                                               NFS |  rpc        20 |  cread       0 |  cwrit       0 | MBcr /s 0.0 | MBcw/ s    0.0 |  nettcp     20 |  netudp      0 |   rchits      0 |   rcmiss      0 |  rcnoca     20 |  badfmt      0 |   badaut      0 |   badcln      0 |                    NET |  transport      |  tcpi     3600 |  tcpo     3753 |   udpi       16 |  udpo       16 |  tcpao      71 |  tcppo      26 | |   tcprs       0 |  tcpie       0 |  tcpor      18 |   udpnp       0 |   udpie       0 |                               NET |  network        |  ipi      5325 | |   ipo      5333 |  ipfrw    1185 |  deliv    3617 | | | | |  icmpi       1 | |   icmpo       0 |                                                                                                  NET |  cali205    0 % |  pcki      142 |  pcko      150 |   sp 10Gbps |  si 487Kbps |  so 1139Kbps | |   coll        0 |   mlti        0 |  erri        0 |  erro        0 |   drpi        0 |   drpo        0 |                NET |  cali663    0 % |  pcki      149 |  pcko      126 |   sp 10Gbps |  si 1135Kbps |  so 63Kbps | |   coll        0 |   mlti        0 |  erri        0 |  erro        0 |   drpi        0 |   drpo        0 |                

Process status:

NPROCS SYSCPU USRCPU VSIZE RSIZE PSIZE SWAPSZ RDDSK WRDSK RNET SNET CPU CMD 1/9 1 32.42s 5m23s 415.3G 224.0G 0K 0K 2.3G 0K 0 0 3603% python 1 25.86s 8.53s 49.1G 205.6M 0K 0K 0K 0K 0 0 348% kubelet 1 1.29s 6.95s 954.4M 66128K 0K 0K 0K 0K 0 0 83% cri-dockerd 1 1.49s 5.29s 18.2G 298.1M 0K 0K 0K 60K 0 0            69% dockerd 117 1.07s 0.65s 138.2G 1.2G 0K 0K 0K 20K 0 0 17% containerd-shi 1 0.82s 0.33s 1.2G 18796K 0K 0K 0K 0K 0 0 12% node_exporter 1 1.00s 0.00s 0K 0K 0K 0K 0K 0K 0 0 10% kswapd1 1 0.98s 0.00s 0K 0K 0K 0K 0K 0K 0 0 10% kcompactd1

Start the inference API service instead and use the 96-core CPU:

root@e2a9f48fba71:/workspace# nohup ktransformers --model_path /models/deepseek-ai/DeepSeek-R1 --gguf_path /models/deepseek-ai/DeepSeek-R1-Q4_K_M --cpu_infer 96 --port 10002 > ktransformers.log 2>&1 &[1] 25root@e2a9f48fba71:/workspace# tail -f ktransformers.lognohup: ignoring input2025-02-21 11:19:56,043 INFO /opt/conda/lib/python3.11/site-packages/ktransformers/server/main.py[29]: Creating SQL tables2025-02-21 11:19:56,064 INFO /opt/conda/lib/python3.11/site-packages/ktransformers/server/api/openai/assistants/assistants.py[75]: Creating default assistantflashinfer not found, use triton for linuxInjecting model as ktransformers.operators.models . KDeepseekV2ModelInjecting model.embed_tokens as defaultInjecting model.layers as defaultInjecting model.layers.0 as default. . . loading blk.32.attn_kv_a_norm.weight to cuda:0 loading blk.32.attn_kv_b.weight to cuda:02025-02-21 11:38:02,897 DEBUG /opt/conda/lib/python3.11/site-packages/ktransformers/server/backend/context_manager.py[21]: Creating Context ManagerINFO: Started server process [25]INFO: Waiting for application startup.INFO: Application startup complete.INFO: Uvicorn running on http://0.0.0.0:10002 (Press CTRL+C to quit)

After starting, you can test it through the curl command:

Question 1

Warm up:

curl -X POST http://localhost:10002/v1/chat/completions \    -H  "Content-Type: application/json"  \    -d '{        "model" "llama3.1" ,        "messages" : [            {                "role" "system" ,                "content" "You are a helpful assistant."            },            {                "role" "user" ,                "content" "Who are you?"            }        ]    }'
Return (formatted original return):
{ "id": "61fa8236-4a27-46ce-8cce-cc56856864dc", "object": "chat.completion", "created": 1740138262, "model": "not implmented", "system_fingerprint": "not implmented", "usage": { "completion_tokens": 1, "prompt_tokens": 1, "total_tokens": 2 }, "choices": [ { "index": 0, "message": { "content": "<think></think>Hello! I am DeepSeek-R1, an intelligent assistant developed by China's DeepSeek company. If you have any questions, I will do my best to help you.", "role": "assistant", "name": null }, "logprobs": null, "finish_reason": null } ]}

Question 2

curl -X POST http://localhost:10002/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "llama3.1", "messages": [ { "role": "system", "content": "You are a helpful assistant." }, { "role": "user", "content": "What is the Tang Dynasty poem that best reflects the atmosphere of the Tang Dynasty?" } ] }'{"id":"e12812d4-ed8d-4000-84c6-fe73ed0c8cf9","object":"chat.completion","created":1740138507,"model":"not implmented","system_fingerprint":"not implmented","usage":{"completion_tokens":1,"prompt_tokens":1,"total_tokens":2},"choices":[{"index":0,"message":{"content":"<think>\n\n</think>\n\nThe Tang Dynasty is the golden age of Chinese classical poetry, and many masterpieces show the broad and majestic \"prosperous Tang atmosphere\". The most representative work is Wang Zhihuan's "Climbing the Stork Tower":\n\nThe sun sets behind the mountains, and the Yellow River flows into the sea. \nDesire The vision is far away, and the higher the building. \n\nThis five-character poem integrates the vastness of natural scenery with the sublimation of life realm with the magnificent vision of climbing high and looking far. The grand image of \"white day\"\"Yellow River\" and the enterprising spirit of \"a thousand-mile vision\"\"higher the building\" complement each other, perfectly interpreting the Tang people's broad mind of embracing the world and their high-spirited spirit. This way of expression that integrates personal ambition into the realm of heaven and earth is the most typical artistic presentation of the prosperous Tang Dynasty. ","role":"assistant","name":null},"logprobs":null,"finish_reason":null}]}

Background log:

2025-02-21 11:48:27 , 219 WARNING /opt/ conda /lib/ python3 . 11 /site-packages/ ktransformers / server/ backend /interfaces/ transformers.py [ 166 ] : change system to user   2025-02-21 11:48:27 , 220 WARNING / opt/ conda /lib/ python3. 11 / site-packages/ ktransformers / server/ backend /interfaces/ transformers.py [ 172 ] : merge two adjacent user messages   2025-02-21 11:48:27 , 231 DEBUG /opt/ conda /lib/ python3. 11 /site-packages/ ktransformers / server / backend / interfaces / transformers.py[ 197 get input ids of shape  torch . Size ( [ 1 , 19 ] )   2025-02-21 11:48:27 , 232 DEBUG /opt/ conda /lib/ python3. 11 / site-packages/ ktransformers / server/ backend /interfaces/ ktransformers.py[ 132 ] : input_ids : torch. Size ( [ 1 , 19 ]    2025-02-21 11:48:27 , 236 DEBUG /opt/ conda /lib/ python3 . 11 /site - packages/ ktransformers /server/ backend  /  interfaces / ktransformers.py [ 158 ] : same prefix len :    2025-02-21 11:48:27 , 237 DEBUG /opt/ conda /lib/ python3 . 11 /site-packages/ ktransformers / server/ backend /interfaces/ ktransformers.py[ 167 ] input_ids : torch. Size ( [ 1 , 10 ] )   2025-02-21 11:48:27 , 237 DEBUG /opt/ conda /lib/ python3 . 11 / site -packages/ ktransformers / server / backend / interfaces / ktransformers.py[ 169 ] : generate_ids: torch . Size ( [ 1 , 9 ]    2025-02-21 11:48:27 , 237 DEBUG / opt/ conda /lib/ python3. 11 /site-packages/ ktransformers / server / backend / interfaces / ktransformers.py[ 180 ] : cache position  : 9  to  19   < think >
</ think >
The Tang Dynasty was the golden age of Chinese classical poetry, and many masterpieces showed the broad and majestic "Golden Age" of the Tang Dynasty . The most representative work is Wang Zhihuan's " Climbing the Stork Tower " :
The sun sets behind the mountains, and the Yellow River flows into the sea .If you want to see further, you must climb to a higher level .
This five-character quatrain integrates the vastness of natural scenery with the sublimation of life through the magnificent vision of looking far from a height . The grand imagery of "white day" and "Yellow River" and the enterprising spirit of "a thousand-mile vision" and "going to higher levels" complement each other, perfectly interpreting the Tang people's broad-mindedness and high-spirited spirit . This way of expression that integrates personal ambitions into the realm of heaven and earth is the most typical artistic presentation of the prosperous Tang Dynasty .2025-02-21 11:48:56 , 161 INFO /opt/ conda /lib/ python3. 11 site-packages/ ktransformers / server / backend /base.py[64] : Performance ( T/ s ) : prefill  8.755977285376156 , decode 5.326085739842548 Time (s): tokenize  0.012891054153442383 , prefill  1.1420769691467285 , decode  27.787761449813843   
INFO :      172.17 . 0.1 : 1686 - "POST /v1/chat/completions HTTP/1.1" 200 OK    

Question 3

curl -X POST http://localhost:10002/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "llama3.1", "messages": [ { "role": "system", "content": "You are a helpful assistant." }, { "role": "user", "content": "Write an article "On America" ​​based on "On Qin"" } ] }'

The following is the reply. For ease of reading, the main content of json is extracted and formatted:

<think>

Well, the user asked me to write an article titled "On America" ​​in the style of "On the Qin Dynasty". First, I need to recall the structure and content of "On the Qin Dynasty". Jia Yi's "On the Qin Dynasty" mainly analyzes the reasons for the rapid demise of the Qin Dynasty, emphasizing that the offensive and defensive situation is different when benevolence and righteousness are not practiced. In terms of article structure, it first describes the rise of the Qin State, then describes its prosperity, and finally analyzes the reasons for its failure.

Next, I want to determine how to apply this structure to the United States. Users may want to explore the historical rise and fall of the United States, analyzing the reasons for its strength and potential factors for its decline. It needs to cover the founding of the United States, its expansion, its economic and military achievements, and then turn to point out its problems, such as hegemony, social contradictions, and institutional rigidity.

Pay attention to imitating the language style of ancient Chinese, using four-character idioms, parallel sentences, etc. It may be necessary to divide it into several paragraphs, first describing its strength, then discussing its decline, and finally summarizing the lessons. At the same time, users may hope to hint at the challenges of contemporary America through historical comparison, echoing the warning role of "On the Qin Dynasty".

It is necessary to ensure that the content is accurate and the historical events are correct, such as the War of Independence, the Westward Movement, the Cold War, etc. At the same time, it is necessary to avoid being too biased and maintain objective analysis, but it must also conform to the characteristics of ancient literature and have a certain argumentative nature.

The challenge may be how to translate modern concepts into classical Chinese, such as finding appropriate classical Chinese expressions for words like "hegemonism" and "economic globalization". In addition, the structure should be compact, the logic clear, and the theme should be pointed out at the end, echoing the core argument of "On the Qin Dynasty", that is, failure to practice benevolence and righteousness leads to decline.

You need to check whether it meets the needs of users and whether it conveys a profound analysis while imitating the ancient times, rather than just a superficial historical narrative. You may also need to add some comparisons, such as the analogy of the Roman Empire, to enhance persuasiveness.

Finally, make sure the writing flows smoothly, uses accurate words, and fits the style of classical Chinese, while making the meaning and criticism understandable to modern readers.

</think>

OK, I will write an ancient argumentative essay according to your requirements. In order to make the article more vivid and profound, I will first sort out the historical context and then analyze the core arguments. Please see if the following content meets your expectations. The historical events and viewpoints involved in the article are all for literary creation. If there is anything wrong, please correct me.

Article ideas

  1. Comparison of historical trajectories : Using the century-long development history of the United States to correspond to the rise and fall of the Qin Empire, highlighting the inherent laws of the rise of civilization

  2. Analysis of the Institutional Dilemma : Revealing the Inevitability of the Decline of Modern Empires through the Contradiction between Capital Monopoly and Democratic Alienation

  3. A warning for the development of civilization : Using the past to illustrate the present to emphasize the importance of a multipolar world and highlight the modern significance of cultural tolerance


【superior】

When the United States was founded, it inherited the legacy of Britain and took advantage of the richness of the New World. The thirteen states tore their crowns and destroyed their crowns to seek independence; the Mayflower Compact punished criminals and comforted the people to advocate freedom. At that time, the disputes on the European continent had not yet ended, and the great powers were competing fiercely. They gathered displaced people, absorbed wisdom from all over the world, melted various ethnic groups, and expanded their territory. Since Washington's uprising and Linden's victory, it has taken 200 years for the United States to suddenly become a giant. Isn't it the advantage of the time?

【middle】

Ironclad ships roam the world, greenbacks circulate in the vast land, and everyone bows down to the Star-Spangled Banner. Wall Street's gold mines gather the world's wealth; Silicon Valley's wisdom gathers the world's elite. Aircraft carriers cruise like running dogs, and missiles deter like hawks. However, they rely on their strength to bully the weak, using finance as a spear and democracy as a banner. The changes in Eastern Europe, the chaos in the Middle East, and the difficulties in Latin America all show the shadow of their maneuvers. I still remember that after the Soviet Union collapsed, the new Rome looked down on the world and said that history would end here. Who could resist their ambition?

【Down】

However, money gradually erodes the foundation of freedom, and party disputes eventually destroy the foundation of constitutional government. Wall Street tycoons devour the middle class, and the elders on Capitol Hill become traitors to the people. Gunshots shake the campus, but the law is not enforced, and poisonous fog fills the streets, but there is no way to stop it. The red-blue struggle is more intense than the Civil War, and the racial divide is deeper than when the melting pot was first forged. The Mayflower Pledge of the past is now a mere letter; the bells of Independence Hall are only echoes in vain. Moreover, the world has changed dramatically, the trend of multi-polarity has taken shape, the hegemony of the US dollar is declining, and the technological barriers are gradually collapsing. Still relying on its own strength and not thinking about reform, haven't you heard that the Qin people had no time to mourn themselves, but later generations mourned them?

【Knot】

Therefore, it is said that the way to govern a country is to manifest virtue, to be close to the people, and to stop at the highest good. Hegemony can be successful for a while, but it will eventually rise and fall. Looking at the rise and fall of the United States over the past 200 years, isn't it that financial oligarchs replaced feudal monarchs, and cultural colonization replaced military conquest? The reason for its rapid demise is that it lost its way! Future generations should learn from this and know that destiny is unpredictable, and only virtue can assist. If you rely on strength and abandon justice, even the Empire State will eventually become a ruin.


Notes and interpretation

  1. Wall Street’s Gold Hole : refers to the social contradictions caused by capital monopoly. The 2008 financial crisis revealed the shocking reality that 1% of the rich controlled 40% of the wealth. This imbalance in the economic structure has become a hidden danger of social unrest.

  2. The red-blue dispute is more intense than the Civil War : The degree of political polarization in contemporary America has surpassed that of the 1860s. According to data from the Pew Research Center, the opposition index of supporters of the two parties on core issues has soared from 15 points in 1994 to 36 points in 2020.

  3. The trend towards multipolarity is already taking shape : According to IMF data, China's GDP has surpassed that of the United States in terms of purchasing power parity, India has jumped to third place, and emerging economies have contributed 75% of global growth, marking the end of a unipolar world.

This imitation of ancient text reveals the development paradox of contemporary superpowers through the mirror of history. Its value lies in interpreting modern dilemmas with traditional wisdom. The warnings buried in the text are more profound in today's reality, such as the conflict between Russia and Ukraine and the shaky hegemony of the US dollar. When AI algorithms begin to replace Wall Street traders and when TikTok breaks the monopoly of Western discourse, history is verifying the power of ancient wisdom in a new form.


Data collected from background logs:

2025-02-21 11:57:02,966 INFO /opt/conda/lib/python3.11/site-packages/ktransformers/server/backend/base.py[64]: Performance(T/s): prefill 11.489579189090895, decode 5.333968008750722. Time(s): tokenize 0.011757373809814453, prefill 1.4796016216278076, decode 233.22224617004395
  • prefill 11.489579189090895: Indicates that during the prefill phase, the system can process 11.489579189090895 tokens per second. This metric reflects the processing speed of the system during preprocessing and initializing the model state.

  • decode 5.333968008750722: Indicates that in the decode phase, the system can process 5.333968008750722 tokens per second. This metric reflects the processing speed of the system when generating text. Generally, the processing speed of the decode phase is slower than the prefill phase because more calculations and decisions are required when generating text.

Peak GPU consumption:

Fri Feb 21 11:56:51 2025+--------------------------------------------------------------------------------+| NVIDIA-SMI 535.161.07 Driver Version: 535.161.07 CUDA Version: 12.2 ||------------------------------------------------+--------------------------------+--------------------------+| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC || Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. ||   7 NVIDIA GeForce RTX 4090 Off | 00000000:E1:00.0 Off | Off || 30% 38C P2 72W / 450W | 13440MiB / 24564MiB | 16% Default || |

Maybe the CPU was used too much, and the GPU of one card was not fully utilized. So I changed the CPU to 64 cores and tried again, and the output of the same problem was:

<think>

OK, the user asked me to write an essay titled "On America" ​​based on "On the Qin Dynasty". First, I need to recall the structure and main content of "On the Qin Dynasty". Jia Yi's "On the Qin Dynasty" mainly analyzes the rise and fall of the Qin Dynasty, pointing out that the reason for its failure was that benevolence and righteousness were not practiced and the offensive and defensive forces were different. Therefore, I need to apply this analytical structure to the history and current situation of the United States.

Next, I need to determine the theme of "On America". Although the history of the United States is not long, as a superpower, its rise and current challenges can be compared to the rise and fall of the Qin Dynasty. It may be necessary to analyze the United States from its founding, expansion, and prosperity to its current internal problems, and finally point out its problems, such as internal contradictions, hegemonism, racial issues, etc.

Then, I need to write in classical Chinese, which may be a bit challenging for me. I have to make sure that the words are accurate, the sentences are in line with the style of classical Chinese, and the content is relevant to the specific situation in the United States. I may need to divide it into several paragraphs, first describing the rise of the United States, then talking about its achievements during its heyday, and finally analyzing the reasons for its decline.

The user's needs may not only be simple imitation, but also hope to deeply analyze the problems of the United States through the form of ancient texts, similar to the critical nature of "On the Qin Dynasty". Therefore, the analysis part needs to be in-depth, such as mentioning racial discrimination, political polarization, economic inequality, and the disgust caused by hegemonic behavior in diplomacy.

At the same time, we should pay attention to avoid using modern words that are too straightforward, and replace them with corresponding expressions in classical Chinese. For example, "hegemony" can be called "hegemony", and "racial discrimination" may need to be described in a more classical way. In addition, the structure may need to be first positive and then negative, first talking about the strength of the United States, then turning to its problems, and finally summarizing the lessons.

You also need to consider the user's background. The user may have some knowledge of classical Chinese and want to see accurate allusions and structures, as well as in-depth content and thorough analysis. You may need to quote key events in American history, such as the Revolutionary War, the Westward Movement, and the Cold War, but briefly summarize them in classical Chinese.

Finally, check whether it conforms to the style of "On the Qin Dynasty", such as using parallel sentences and contrast techniques, and ending with "Why?" to draw the conclusion and point out the central idea. Ensure that the logic is clear, the argument is clear, and the rhythm and rhythm of classical Chinese are maintained.

</think>

OK, I will write a classical Chinese article imitating "On the Qin Dynasty" according to your requirements. In order to make the article more profound and appealing, I will first sort out the historical context, and then integrate metaphors and speculation. Please see if the following content meets your expectations.

Over America

1. Wilderness and Frontier Development

In the past, the British colonized the oceans, riding three-masted ships across the raging waves, carrying firearms to slaughter the red plains. Only one in ten native peoples of North America survived, and the fertile lands of thousands of miles were turned into white sails. The Mayflower Compact was first established, and the thirteen states were set ablaze. The desperate people of Europe gathered to imitate the Roman Republic. However, before the chains of the black slaves were cold, the president had already erected a monument in Washington.

2. Iron-blooded domination

The two oceans are natural barriers that cut off the east and west, and the Monroe Doctrine blocks the north and the south. Wall Street's gold cave swallows up the seven seas, and Silicon Valley's optical cables interweave the nine heavens. The flags of Normandy have not been unfurled, but the oil tankers in the Persian Gulf have been controlled. B2 hides in thunderclouds and frightens all nations, and aircraft carriers are linked in a star chain to suppress the world. When the Pentagon document is issued, small countries tremble, and civilization is silent when the lights and shadows of Hollywood pass by.

III. Disadvantages of Hegemony

However, the two eagles are torn apart by the two parties, and black and white blood is splattered in the streets. Floyd's neck is broken, and the mob rushes to the Capitol. Shootings kill children in the day, and opiates bury young people at night. The Statue of Liberty is rusted, and the Constitution is worm-eaten. The light of the former beacon is gradually dimming, and the city on the hill is covered with fog.

4. The Way of Rise and Fall

The United States has been established for 200 years, and its technology and military strength are unprecedented in ancient and modern times. However, it cannot stop the internal division of ethnic groups, and it is difficult to stop the rebellion of various countries. The gold and dollar democracy has finally become a plaything of the plutocracy, and universal values ​​are actually a banner for conquest. Qin lost its power because of its harsh laws, and the United States lost its tripod because of its hegemonic tactics. If the occupant of the White House had been more benevolent and the Pentagon had been less involved in the war, why would there be so many chaos today?

5. A new chapter in civilization

Today, the Mississippi River is surging, and the remaining snow in the Rockies is reflecting the sunset. The new Roman Empire is about to fall, but it still sings that the Mandate of Heaven is clear. Haven't you heard the ancient Chinese saying: A country may be great, but it will perish if it loves war; the world may be peaceful, but it will be in danger if it forgets war. This is the cycle of nature, and it cannot be reversed by nuclear-armed aircraft carriers.


Server statistics log:

2025-02-21 13:56:22,982 INFO /opt/conda/lib/python3.11/site-packages/ktransformers/server/backend/base.py[64]: Performance(T/s): prefill 11.667566049316758, decode 5.689817879379478. Time(s): tokenize 0.011485099792480469, prefill 1.4570305347442627, decode 175.04953956604004
  • prefill 11.667566049316758: Indicates that during the prefill phase, the system can process 11.667566049316758 tokens per second. This metric reflects the processing speed of the system during preprocessing and initializing the model state.

  • decode 5.689817879379478: Indicates that in the decode phase, the system can process 5.689817879379478 tokens per second. This indicator reflects the processing speed of the system when generating text. Generally, the processing speed of the decode phase is slower than the prefill phase because more calculations and decisions are required when generating text.

+---------------------------------------------------------------------------------------------+| NVIDIA-SMI 535.161.07 Driver Version: 535.161.07 CUDA Version: 12.2 ||--------------------------------------------------+-------------------------+---------------------------------+| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC || Fan Temp Perf Pwr: Usage/Cap | Memory-Usage | GPU-Util Compute M. || 00000000:E1:00.0 Off | Off || 30% 37C P2 73W / 450W | 13440MiB / 24564MiB | 17% Default || | | N/A |+---------------------------------------------+--------------------------+--------------------------+

The token speed and resource usage have not changed much.



Summarize


A single 4090 GPU can indeed run the full version of Deepseek-R1. However, there are a few points to note:


1. It is not a full-blooded version. The original 671B full-blooded version is FP8. We are running the version after INT4 quantization, but the parameter volume is still 671B, and INT4 quantization is very cost-effective, saving video memory without much loss of performance.


2. A single 4090 card uses about 14G video memory and less than 20% of the computing power, and uses more than 400G of memory and more than 60 cores (hyperthreading) of the CPU computing power. The CPU model during the test is: AMD EPYC 7542


3. The quality of the generation is quite good, and you can feel that it is indeed a full-blooded version.


4. The number of tokens per second is basically 5 to 6, which feels slow and is only suitable for personal use or batch processing tasks that do not require high real-time response.