The full version of DeepSeek-R1-671B has been successfully deployed and can undertake consulting or deployment business

New breakthroughs in deep learning technology, DeepSeek-R1-671B deployment success case sharing.
Core content:
1. Successful deployment of DeepSeek-R1-671B on four servers
2. Detailed instructions for consulting and deployment services
3. Deployment process optimization and successful case presentation
I have successfully deployed the full version of DeepSeek-R1-671B on 4 servers. The relevant information is as follows. I can now accept consulting guidance or deployment business orders. The deployment process is being gradually optimized and improved, and everyone can learn from each other. The following is a display of relevant content after successful deployment.
Full version of DeepSeek-R1-671B content display
Ray Cluster Status
Production Metrics
(self-llm) deepseek@deepseek2:~$ curl http://10.119.85.138:8000/metrics
...
540 0 # TYPE python_gc_objects_collected_total counter
0 7756k python_gc_objects_collected_total{generation="0"} 37427.0
0 --:python_gc_objects_collected_total{generation="1"} 14232.0
--:-- --:--:-- python_gc_objects_collected_total{generation="2"} 16818.0
--:--:-- 9615k
# HELP python_gc_objects_uncollectable_total Uncollectable objects found during GC
# TYPE python_gc_objects_uncollectable_total counter
python_gc_objects_uncollectable_total{generation="0"} 0.0
python_gc_objects_uncollectable_total{generation="1"} 0.0
python_gc_objects_uncollectable_total{generation="2"} 0.0
# HELP python_gc_collections_total Number of times this generation was collected
# TYPE python_gc_collections_total counter
python_gc_collections_total{generation="0"} 3033.0
python_gc_collections_total{generation="1"} 267.0
python_gc_collections_total{generation="2"} 315.0
...
openai API interface test
# 10.119.85.138 is the IB network card IP of the deepseek2 node
(self-llm) deepseek@deepseek2:~$ curl 10.119.85.138:8000/v1/models -H "Authorization: Bearer zY0MrQwXV9Oo3g==" | jq
#The output is as follows
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 523 100 523 0 0 105k 0 --:--:-- --:--:-- --:--:-- 127k
{
"object": "list",
"data": [
{
"id": "DeepSeek-R1-671B",
"object": "model",
"created": 1740405511,
"owned_by": "vllm",
"root": "/root/.cache/huggingface/hub/models/unsloth/DeepSeek-R1-BF16/",
"parent": null,
"max_model_len": 32768,
"permission": [
{
"id": "modelperm-ced685e8156b4618b593580109205165",
"object": "model_permission",
"created": 1740405511,
"allow_create_engine": false,
"allow_sampling": true,
"allow_logprobs": true,
"allow_search_indices": false,
"allow_view": true,
"allow_fine_tuning": false,
"organization": "*",
"group": null,
"is_blocking": false
}
]
}
]
}
At the same time, you will see the following output in the window where the vllm serve command is executed
Service Function Verification
(self-llm) deepseek@deepseek2:~$ curl -X POST "http://10.119.85.138:8000/v1/chat/completions" -H "Content-Type: application/json" -H "Authorization: Bearer zY0MrQwXV9Oo3g==" -d '{ "model": "DeepSeek-R1-671B", "messages": [{"role": "user", "content": "Hello"}]}'
(self-llm) deepseek@deepseek2:~$ curl -X POST "http://10.119.85.138:8000/v1/chat/completions" -H "Content-Type: application/json" -H "Authorization: Bearer zY0MrQwXV9Oo3g==" -d '{ "model": "DeepSeek-R1-671B", "messages": [{"role": "user", "content": "Please prove the Pythagorean theorem"}]}'
#answer
{"id":"chatcmpl-11ae1ddf321343af848b5c683e67b72d","object":"chat.completion","created":1740411348,"model":"deepseek-r1","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":"<think>\nWell, the user asked me to prove the Pythagorean theorem. The Pythagorean theorem is a very basic but important theorem in mathematics, and there must be many different ways to prove it. First recall that the Pythagorean theorem states that in a right triangle, the square of the hypotenuse is equal to the sum of the squares of the two right-angled sides, that is, a² + b² = c². Now I have to choose one
A suitable proof method, which may be geometric or algebraic. \n\nThe first thing that comes to mind is the splicing method in geometric proof, which is to put four right triangles together to form a large square and then compare the areas. Should I try this method? For example, four congruent right triangles, let their right angles be a and b,
The hypotenuse is c. If we put them together, we should form a square with a side length of (a+b). The space in the middle might be a small square with a side length of c or something else? \n\nNo, it should form a square with a side length of c, or this? Wait, maybe I need to draw a picture carefully to imagine it. Suppose we put four triangles together.
If the right angles of each shape face outward, the hypotenuse will form a square inside.
...
#While answering the question, you will see the following in the window where the vllm serve command is executed, showing the average token generation throughput rate
INFO 02-24 17:21:12 metrics.py:455] Avg prompt throughput: 1.6 tokens/s, Avg generation throughput: 36.5 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 02-24 17:21:17 metrics.py:455] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 37.2 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 02-24 17:21:22 metrics.py:455] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 36.5 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.
...
#Even higher speed
INFO 02-24 23:32:00 metrics.py:455] Avg prompt throughput: 442.9 tokens/s, Avg generation throughput: 38.8 tokens/s, Running: 3 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.4%, CPU KV cache usage: 0.0%.
INFO 02-24 23:32:05 metrics.py:455] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 102.4 tokens/s, Running: 3 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.4%, CPU KV cache usage: 0.0%.
INFO 02-24 23:32:07 async_llm_engine.py:179] Finished request chatcmpl-03add50cba264c84afe98fd6cce9907f.
INFO 02-24 23:32:10 metrics.py:455] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 79.4 tokens/s, Running: 2 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.
# apt install nvtop
(self-llm) deepseek@deepseek1:~/installPkgs$ nvtop
#The following is the output of the `nvtop` command
open-webui session interface
# 10.119.85.138 is the IB network card IP of the deepseek2 node
(self-llm) deepseek@deepseek2:~$ curl http://10.119.85.138:18080
#Or directly access the above address in the browser. The first registered user is the administrator by default. Log in and ask questions after registration
2. Hardware and software used for successful deployment
Server Information
Note:
(1) The 10 Gigabit network card was not used during the deployment process.
(2) The information of NVIDIA A800 is as follows
deepseek@deepseek1:~$ nvidia-smi
Fri Feb 21 09:25:35 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===========================================+=========================+=======================|
| 0 NVIDIA A800-SXM4-80GB On | 00000000:3D:00.0 Off | 0 |
| N/A 33C P0 61W / 400W | 1MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA A800-SXM4-80GB On | 00000000:42:00.0 Off | 0 |
| N/A 29C P0 58W / 400W | 1MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA A800-SXM4-80GB On | 00000000:61:00.0 Off | 0 |
| N/A 30C P0 61W / 400W | 1MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA A800-SXM4-80GB On | 00000000:67:00.0 Off | 0 |
| N/A 33C P0 64W / 400W | 1MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 4 NVIDIA A800-SXM4-80GB On | 00000000:AD:00.0 Off | 0 |
| N/A 32C P0 57W / 400W | 1MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 5 NVIDIA A800-SXM4-80GB On | 00000000:B1:00.0 Off | 0 |
| N/A 29C P0 61W / 400W | 1MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 6 NVIDIA A800-SXM4-80GB On | 00000000:D0:00.0 Off | 0 |
| N/A 30C P0 62W / 400W | 1MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 7 NVIDIA A800-SXM4-80GB On | 00000000:D3:00.0 Off | 0 |
| N/A 32C P0 60W / 400W | 1MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|========================================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
Software Information
Physical server operating system: Ubuntu 22.04.4 LTS-x86_64
Nvidia driver version: 550.90.07
CUDA runtime version: 12.1.105 (in the node container), V12.4.99 (on the physical server)
nvidia-fabricmanager version: 550.90.07
nvlink:3.0
nvswitch:2.0
PyTorch version: 2.5.1+cu124
CUDA used to build PyTorch: 12.4
OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
CMake version: version 3.31.4
Libc version: glibc-2.35
Python version: 3.12.9 (main, Feb 5 2025, 08:49:00) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.15.0-113-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA_MODULE_LOADING set to: LAZY
Is XNNPACK available: True
CPU: Intel(R) Xeon(R) Gold 6348 CPU @ 2.60GHz, 112 cores
numpy==1.26.4
torch==2.5.1
torchaudio==2.5.1
torchvision==0.20.1
triton==3.1.0