Performance and inference test of DeepSeek-V3-0324 running on 8 H20 cards

How is the performance and reasoning ability of DeepSeek-V3-0324 (685B) on the 8-card H20 server?
Core content:
1. 8-card H20 server configuration and DeepSeek-V3-0324 deployment
2. Performance comparison of DeepSeek-V3-0324 (685B) and DeepSeek-R1-AWQ (671B)
3. DeepSeek-V3-0324 performance on math problems
Recently, I deployed DeepSeek-R1-AWQ (671B) and the latest DeepSeek-V3-0324 (685B) on an 8-card H20 machine, and tested the performance and math problem scores. The server is provided by Volcano Engine. Let's take a look at the machine configuration first:
8-card H20 machine configuration
GPU:
+---------------------------------------------------------------------------------------------+| NVIDIA-SMI 535.161.08 Driver Version: 535.161.08 CUDA Version: 12.2 ||--------------------------------------------------+-------------------------+---------------------------------+| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC || Fan Temp Perf Pwr: Usage/Cap | Memory-Usage | GPU-Util Compute M. || 00000000:65:02.0 Off | 0 || N/A 29C P0 71W / 500W | 0MiB / 97871MiB | 0% Default || || N/A 32C P0 72W / 500W | 0MiB / 97871MiB | 0% Default || 0MiB/ 97871MiB | 0% Default || |+-----------------------------------------+-------------------------+-------------------------+| 4 NVIDIA H20 On | 00000000:69:02.0 Off | 0 || N/A 30C P0 74W / 500W | 0MiB / 97871MiB | 0% Default || | H20 On | 00000000:69:03.0 Off | 0 || N/A 33C P0 74W / 500W | 0MiB / 97871MiB | 0% Default ||0 Off | 0 || N/A 33C P0 73W / 500W | 0MiB / 97871MiB | 0% Default || | 500W | 0MiB / 97871MiB | 0% Default || | | Disabled |+-----------------------------------------+-------------------------+-------------------------+
I stepped on a pit here: the original driver version had problems. It worked fine on RTX4090, but it crashed when running DeepSeek-R1-AWQ on H20 after trying various configurations and software versions. Later, I changed to the driver version recommended by NVIDIA official website for H20, Driver Version: 550.144.03 (CUDA 12.4), and it worked without changing any configuration.
Inter-card interconnection:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7
GPU0 X OK OK OK OK OK OK OK
GPU1 OK X OK OK OK OK OK
GPU2 OK OK X OK OK OK OK
GPU3 OK OK OK X OK OK OK OK
GPU4 OK OK OK OK X OK OK OK
GPU5 OK OK OK OK OK X OK OK
GPU6 OK OK OK OK OK OK X OK
GPU7 OK OK OK OK OK OK OK X
Legend:
X = Self
OK = Status Ok
CNS = Chipset not supported
GNS = GPU not supported
TNS = Topology not supported
NS = Not supported
U = Unknown
Memory:
# free -g total used free shared buff/cache availableMem: 1929 29 1891 0 9 1892Swap: 0 0 0
disk:
vda 252:0 0 100G 0 disk ├─vda1 252:1 0 200M 0 part /boot/efi└─vda2 252:2 0 99.8G 0 part /nvme3n1 259:0 0 3.5T 0 disk nvme2n1 259:1 0 3.5T 0 disk nvme0n1 259:2 0 3.5T 0 disk nvme1n1 259:3 0 3.5T 0 disk
OS
# uname -a
Linux H20 5 . 4 . 0 - 162 -generic # 179 -Ubuntu SMP Mon Aug 14 08:51:31 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
# cat /etc/lsb-release
DISTRIB_ID = Ubuntu
DISTRIB_RELEASE = 20.04
DISTRIB_CODENAME =focal
DISTRIB_DESCRIPTION = "Ubuntu 20.04.5 LTS"
Start inference
Use vLLM v0.8.2 to start the inference service, and start the inference of the following two models respectively:
DeepSeek-R1-AWQ: https://huggingface.co/cognitivecomputations/DeepSeek-R1-AWQ DeepSeek-V3-0324: https://modelscope.cn/models/deepseek-ai/DeepSeek-V3-0324
H20 Performance Review
Start performance testing:
nohup python3 -u simple-bench-to-api.py --url http://localhost:7800/v1 \ --model DeepSeek-R1 \ --concurrencys 1,10,20,30,40,50 \ --prompt "Introduce the history of China" \ --max_tokens 100,1024,16384,32768,65536,131072 \ --api_key sk-xxx \ --duration_seconds 30 \ > benth-DeepSeek-R1-AWQ-8-H20.log 2>&1 &
This command will use max_tokens of 100, 1024, 16384, 32768, 65536, 131072 to perform batch tests on 1 concurrency, 10 concurrency, ..., 50 concurrency. Each max_tokens value generates a table with different concurrencies. The stress test script simple-bench-to-api.py and the detailed parameter meanings are in the previous article "Concurrency Performance of DeepSeek-R1 Small Model Deployed on Single Card 4090". Friends who need it can get it by themselves.
Stress test results:
Performance test of deploying DeepSeek-R1-AWQ on 8 H20 cards
----- max_tokens=100 Stress test results summary-----
There are a few concepts that need to be explained
"Latency": the time from sending the request to receiving the last token/character (including the first word delay time) P90 Latency: Latency at percentile 90, calculated by sorting the latencies from smallest to largest, the maximum latency value of the top 90%, and the next latency value, a value between the two calculated based on linear interpolation. "First Word Delay" is the time from when a request is sent to when the first character is received in return. The concept of "single concurrent throughput" refers to the speed of token generation after the first token is returned from the perspective of each concurrent user/channel. The statistical time does not include the first word delay. That is, the throughput of a channel = the number of tokens generated by the channel / the generation time excluding the first token delay. Personally, I think this indicator plus the average first word delay can reflect the real user experience.
The meaning of specific indicators:
Average delay: average delay of all channels (including first word delay time) Average first word delay: The average of the first word delays of all channels Single concurrent minimum throughput: The throughput of the channel with the lowest throughput among all concurrent channels (excluding the first word delay time) Single concurrent maximum throughput: The throughput of the channel with the highest throughput among all concurrent channels (excluding the first word delay time) Single concurrent average throughput: the average throughput of all concurrent channels (excluding the first word delay time) Overall throughput: the total number of tokens generated by all channels during the stress test / the time from the start to the end of the stress test P90 latency: 90% of the request latency is lower than this value P95 latency: 95% of the request latency is lower than this value P99 latency: 99% of the request latency is lower than this value
For details, please refer to the previous article Concurrency performance of DeepSeek-R1 small model deployed on a single GPU 4090
----- max_tokens=1024 Stress test results summary-----
--- max_tokens=16384 (16k) Stress test results summary-----
----- max_tokens=32768 (32k) Stress test results summary-----
----- max_tokens=65536 (64k) Stress test results summary-----
----- max_tokens=131072 (128k) Stress test results summary-----
Performance test of DeepSeek-V3-0324 deployed on 8 H20 cards
----- max_tokens=100 Stress test results summary-----
----- max_tokens=1024 Stress test results summary-----
----- max_tokens=16384 (16k) Stress test results summary-----
----- max_tokens=32768 (32k) Stress test results summary-----
----- max_tokens=65536(64k) Stress test results summary-----
Resource peak value during stress testing:
+----------------------------------------------------------------------------------------+| NVIDIA-SMI 550.144.03 Driver Version: 550.144.03 CUDA Version: 12.4 ||----------------------------------------+------------------------+-------------------------+| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC || Fan Temp Perf Pwr: Usage/Cap | Memory-Usage | GPU-Util Compute M. || 00000000:65:02.0 Off | 0 || N/A 39C P0 176W / 500W | 95096MiB / 97871MiB | 95% Default || Off | 0 || N/A 46C P0 184W / 500W | 95070MiB / 97871MiB | 23% Default || | 178W/500W | 95070MiB / 97871MiB | 95% Default || 97% Default || |+-----------------------------------------+------------------------+-------------------------+| 5 NVIDIA H20 Off | 00000000:69:03.0 Off | 0 || N/A 45C P0 182W / 500W | 95070MiB / 97871MiB | 97% Default || | | Disabled |+----------------------------------------+--------------------------------+----------------+| 6 NVIDIA H20 Off | 00000000:6B:02.0 Off | 0 || N/A 46C P0 184W / 500W | 95070MiB / 97871MiB | 97% Default || 182W / 500W | 95078MiB / 97871MiB | 98% Default || | | Disabled |+------------------------------------------+------------------------+-------------------------+
Peak KV cache usage:
INFO 03-31 23:22:50 [loggers.py:80] Avg prompt throughput: 45.0 tokens/s, Avg generation throughput: 166.9 tokens/s, Running: 50 reqs, Waiting: 0 reqs, GPU KV cache usage: 7.7%, Prefix cache hit rate: 0.0%INFO 03-31 23:23:00 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 350.0 tokens/s, Running: 50 reqs, Waiting: 0 reqs, GPU KV cache usage: 7.7%, Prefix cache hit rate: 0.0%INFO 03-31 23:23:10 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s,Avg generation throughput: 355.0 tokens/s, Running: 50 reqs, Waiting: 0 reqs, GPU KV cache usage: 15.4%, Prefix cache hit rate: 0.0%INFO 03-31 23:23:20 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 360.0 tokens/s, Running: 50 reqs, Waiting: 0 reqs, GPU KV cache usage: 15.4%, Prefix cache hit rate: 0.0%INFO 03-31 23:23:30 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 355.0 tokens/s, Running: 50 reqs, Waiting: 0 reqs, GPU KV cache usage: 23.2%, Prefix cache hit rate: 0.0%INFO 03-31 23:23:40 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 355.0 tokens/s, Running: 50 reqs, Waiting: 0 reqs, GPU KV cache usage: 30.9%, Prefix cache hit rate: 0.0%INFO 03-31 23:23:50 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 355.0 tokens/s, Running: 50 reqs, Waiting: 0 reqs, GPU KV cache usage: 30.9%, Prefix cache hit rate: 0.0%INFO 03-31 23:24:00 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 360.0 tokens/s, Running: 50 reqs, Waiting: 0 reqs, GPU KV cache usage: 38.6%, Prefix cache hit rate: 0.0%INFO 03-31 23:24:10 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 350.0 tokens/s, Running: 50 reqs, Waiting: 0 reqs, GPU KV cache usage: 38.6%, Prefix cache hit rate: 0.0%
Mathematical data set benchmarking
We used GitHub - huggingface/lighteval: Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends to run math test sets on DeepSeek-R1-AWQ and DeepSeek-V3-0324 deployed on 8 H20 cards. Here we modified a small amount of lighteval code so that it does not start model inference itself, but calls the OpenAI API interface of the deployed model. The test results are as follows:
8-card H20 deployed DeepSeek-R1-AWQ running test
Math500 Assessment
Modified evaluation command:
(benchmark) root@H20:/data/code/lighteval# lighteval endpoint litellm model_args="http://localhost:7800" tasks="lighteval|math_500|0|0"
Evaluation results:
| Task |Version| Metric |Value| |Stderr||--------------------|------:|----------------|----:|---|-----:||all |
8 H20 cards deployed DeepSeek-V3-0324 running points test
Math500 Assessment
Modified evaluation command:
(benchmark) root@H20:/data/code/lighteval# lighteval endpoint litellm model_args="http://localhost:7800" tasks="lighteval|math_500|0|0" --max-samples 20
To save time, only 20 questions were taken.
Evaluation results:
| Task |Version| Metric |Value| |Stderr||--------------------|------:|----------------|----:|---|-----:||all | |extractive_match| 0.95|± | 0.05||lighteval:math_500:0| 1|extractive_match| 0.95|± |
Peak resource consumption during the test:
|============================================+========================+======================|| 0 NVIDIA H20 Off | 00000000:65:02.0 Off | 0 || N/A 36C P0 159W / 500W | 97048MiB / 97871MiB | 96% Default || 97022MiB / 97871MiB | 91% Default || 97% Default || |+-----------------------------------------+------------------------+-------------------------+| 4 NVIDIA H20 Off | 00000000:69:02.0 Off | 0 || N/A 37C P0 161W / 500W | 97022MiB / 97871MiB | 21% Default || NVIDIA H20 Off | 00000000:69:03.0 Off | 0 || N/A 41C P0 162W / 500W | 97022MiB / 97871MiB | 97% Default || 00000000:6B:02.0 Off | 0 || N/A 42C P0 164W / 500W | 97022MiB / 97871MiB | 97% Default ||0 Off | 0 || N/A 37C P0 163W / 500W | 97030MiB / 97871MiB | 95% Default || | | Disabled |+----------------------------------------+------------------------+--------------------------+
aime25 evaluation
Modified evaluation command:
(benchmark) root@H20:/data/code/lighteval# lighteval endpoint litellm model_args="http://localhost:7800" tasks="lighteval|aime25|0|0" --max-samples 20
To save time, only 20 questions were taken.
Evaluation results:
| Task |Version| Metric |Value| |Stderr|
|------------------|------:|----------------|----:|---|-----:|
|all | |extractive_match| 0.4|± |0.1124|
|lighteval:aime25:0| 1|extractive_match| 0.4|± |0.1124|
Aime25 is relatively new, but this score seems to be lower than the scores published by others before. It may be a problem with the evaluation method, or the context may be truncated during the evaluation process, affecting the results.