Performance and inference test of DeepSeek-V3-0324 running on 8 H20 cards

Written by
Audrey Miles
Updated on:June-30th-2025
Recommendation

How is the performance and reasoning ability of DeepSeek-V3-0324 (685B) on the 8-card H20 server?

Core content:
1. 8-card H20 server configuration and DeepSeek-V3-0324 deployment
2. Performance comparison of DeepSeek-V3-0324 (685B) and DeepSeek-R1-AWQ (671B)
3. DeepSeek-V3-0324 performance on math problems

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)

Recently, I deployed DeepSeek-R1-AWQ (671B) and the latest DeepSeek-V3-0324 (685B) on an 8-card H20 machine, and tested the performance and math problem scores. The server is provided by Volcano Engine. Let's take a look at the machine configuration first:

8-card H20 machine configuration

GPU:

+---------------------------------------------------------------------------------------------+| NVIDIA-SMI 535.161.08 Driver Version: 535.161.08 CUDA Version: 12.2 ||--------------------------------------------------+-------------------------+---------------------------------+| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC || Fan Temp Perf Pwr: Usage/Cap | Memory-Usage | GPU-Util Compute M. || 00000000:65:02.0 Off | 0 || N/A 29C P0 71W / 500W | 0MiB / 97871MiB | 0% Default || || N/A 32C P0 72W / 500W | 0MiB / 97871MiB | 0% Default || 0MiB/ 97871MiB | 0% Default || |+-----------------------------------------+-------------------------+-------------------------+| 4 NVIDIA H20 On | 00000000:69:02.0 Off | 0 || N/A 30C P0 74W / 500W | 0MiB / 97871MiB | 0% Default || | H20 On | 00000000:69:03.0 Off | 0 || N/A 33C P0 74W / 500W | 0MiB / 97871MiB | 0% Default ||0 Off | 0 || N/A 33C P0 73W / 500W | 0MiB / 97871MiB | 0% Default || | 500W | 0MiB / 97871MiB | 0% Default || | | Disabled |+-----------------------------------------+-------------------------+-------------------------+

I stepped on a pit here: the original driver version had problems. It worked fine on RTX4090, but it crashed when running DeepSeek-R1-AWQ on H20 after trying various configurations and software versions. Later, I changed to the driver version recommended by NVIDIA official website for H20, Driver Version: 550.144.03 (CUDA 12.4), and it worked without changing any configuration.

Inter-card interconnection:

 	GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 GPU0 X OK OK OK OK OK OK OK GPU1 OK X OK OK OK OK OK GPU2 OK OK X OK OK OK OK GPU3 OK OK OK X OK OK OK OK GPU4 OK OK OK OK X OK OK OK GPU5 OK OK OK OK OK X OK OK GPU6 OK OK OK OK OK OK X OK GPU7 OK OK OK OK OK OK OK X
Legend:
  X =  Self  OK    =  Status  Ok  CNS   =  Chipset not  supported  GNS   =  GPU not  supported  TNS   =  Topology not  supported  NS    =  Not  supported  U     =  Unknown

Memory:

# free -g total used free shared buff/cache availableMem: 1929 29 1891 0 9 1892Swap: 0 0 0

disk:

vda 252:0 0 100G 0 disk ├─vda1 252:1 0 200M 0 part /boot/efi└─vda2 252:2 0 99.8G 0 part /nvme3n1 259:0 0 3.5T 0 disk nvme2n1 259:1 0 3.5T 0 disk nvme0n1 259:2 0 3.5T 0 disk nvme1n1 259:3 0 3.5T 0 disk

OS

# uname -aLinux  H20  5 . 4 . 0 - 162 -generic # 179 -Ubuntu SMP Mon Aug  14 08:51:31  UTC 2023  x86_64  x86_64  x86_64 GNU/Linux
# cat /etc/lsb-releaseDISTRIB_ID = UbuntuDISTRIB_RELEASE = 20.04DISTRIB_CODENAME =focalDISTRIB_DESCRIPTION = "Ubuntu 20.04.5 LTS"

Start inference

Use vLLM v0.8.2 to start the inference service, and start the inference of the following two models respectively:

  • DeepSeek-R1-AWQ: https://huggingface.co/cognitivecomputations/DeepSeek-R1-AWQ
  • DeepSeek-V3-0324: https://modelscope.cn/models/deepseek-ai/DeepSeek-V3-0324

H20 Performance Review

Start performance testing:

nohup python3 -u simple-bench-to-api.py --url http://localhost:7800/v1 \ --model DeepSeek-R1 \ --concurrencys 1,10,20,30,40,50 \ --prompt "Introduce the history of China" \ --max_tokens 100,1024,16384,32768,65536,131072 \ --api_key sk-xxx \ --duration_seconds 30 \ > benth-DeepSeek-R1-AWQ-8-H20.log 2>&1 &

This command will use max_tokens of 100, 1024, 16384, 32768, 65536, 131072 to perform batch tests on 1 concurrency, 10 concurrency, ..., 50 concurrency. Each max_tokens value generates a table with different concurrencies. The stress test script simple-bench-to-api.py and the detailed parameter meanings are in the previous article "Concurrency Performance of DeepSeek-R1 Small Model Deployed on Single Card 4090". Friends who need it can get it by themselves.

Stress test results:

Performance test of deploying DeepSeek-R1-AWQ on 8 H20 cards

----- max_tokens=100 Stress test results summary-----



Metrics\ Concurrency
1 concurrent
10 concurrent
20 concurrent
30 concurrent
40 concurrent
50 concurrent
Total requests
4
40
80
120
160
200
Success rate
100.00%
100.00%
100.00%
100.00%
100.00%
100.00%
Average latency
7.8265s
8.1742s
8.3271s
8.6902s
8.7426s
9.0815s
Maximum delay
7.9687s
8.2911s
8.4582s
9.0513s
9.0191s
9.4417s
Minimum delay
7.7197s
8.1062s
8.1941s
8.4626s
8.4411s
8.7822s
P90 Delay
7.9226s
8.2208s
8.4206s
8.9813s
8.9725s
9.2873s
P95 Delay
7.9456s
8.2801s
8.4312s
9.0094s
8.9932s
9.3191s
P99 Delay
7.9641s
8.2879s
8.4574s
9.0323s
9.0047s
9.4240s
Average First Word Delay
7.8265s
8.1742s
8.3271s
8.6902s
8.7426s
9.0815s
Total number of tokens generated
400
4000
8000
12000
16000
20000
Single concurrent minimum throughput
12.55 tokens/s
12.06 tokens/s
11.82 tokens/s
11.05 tokens/s
11.09 tokens/s
10.59 tokens/s
Maximum throughput of single concurrency
12.95 tokens/s
12.34 tokens/s
12.20 tokens/s
11.82 tokens/s
11.85 tokens/s
11.39 tokens/s
Average throughput per concurrent session
12.78 tokens/s
12.23 tokens/s
12.01 tokens/s
11.51 tokens/s
11.44 tokens/s
11.01 tokens/s
Overall throughput
12.75 tokens/s
121.90 tokens/s
238.84 tokens/s
343.09 tokens/s
454.13 tokens/s
545.88 tokens/s

There are a few concepts that need to be explained

  • "Latency": the time from sending the request to receiving the last token/character (including the first word delay time)
  • P90 Latency: Latency at percentile 90, calculated by sorting the latencies from smallest to largest, the maximum latency value of the top 90%, and the next latency value, a value between the two calculated based on linear interpolation.
  • "First Word Delay" is the time from when a request is sent to when the first character is received in return.
  • The concept of "single concurrent throughput" refers to the speed of token generation after the first token is returned from the perspective of each concurrent user/channel. The statistical time does not include the first word delay. That is, the throughput of a channel = the number of tokens generated by the channel / the generation time excluding the first token delay. Personally, I think this indicator plus the average first word delay can reflect the real user experience.

The meaning of specific indicators:

  • Average delay: average delay of all channels (including first word delay time)
  • Average first word delay: The average of the first word delays of all channels
  • Single concurrent minimum throughput: The throughput of the channel with the lowest throughput among all concurrent channels (excluding the first word delay time)
  • Single concurrent maximum throughput: The throughput of the channel with the highest throughput among all concurrent channels (excluding the first word delay time)
  • Single concurrent average throughput: the average throughput of all concurrent channels (excluding the first word delay time)
  • Overall throughput: the total number of tokens generated by all channels during the stress test / the time from the start to the end of the stress test
  • P90 latency: 90% of the request latency is lower than this value
  • P95 latency: 95% of the request latency is lower than this value
  • P99 latency: 99% of the request latency is lower than this value


For details, please refer to the previous article Concurrency performance of DeepSeek-R1 small model deployed on a single GPU 4090


----- max_tokens=1024 Stress test results summary-----



Metrics\ Concurrency
1 concurrent
10 concurrent
20 concurrent
30 concurrent
40 concurrent
50 concurrent
Total requests
1
11
20
32
40
50
Success rate
100.00%
100.00%
100.00%
100.00%
100.00%
100.00%
Average latency
80.4809s
64.4957s
69.2813s
60.0941s
64.3626s
74.9057s
Maximum delay
80.4809s
81.5464s
84.0396s
83.1977s
85.0927s
91.6753s
Minimum delay
80.4809s
27.0671s
34.2130s
28.8989s
33.0757s
36.8664s
P90 Delay
80.4809s
80.1078s
83.9624s
76.2109s
82.3774s
91.6048s
P95 Delay
80.4809s
80.8271s
83.9756s
80.3737s
83.5347s
91.6487s
P99 Delay
80.4809s
81.4025s
84.0268s
83.1274s
85.0485s
91.6665s
Average First Word Delay
80.4809s
64.4957s
69.2813s
60.0941s
64.3626s
74.9057s
Total number of tokens generated
1024
8700
16900
23560
30844
41068
Single concurrent minimum throughput
12.72 tokens/s
12.17 tokens/s
12.18 tokens/s
12.11 tokens/s
11.91 tokens/s
10.68 tokens/s
Maximum throughput of single concurrency
12.72 tokens/s
12.46 tokens/s
12.22 tokens/s
12.42 tokens/s
12.05 tokens/s
11.19 tokens/s
Average throughput per concurrent session
12.72 tokens/s
12.25 tokens/s
12.20 tokens/s
12.24 tokens/s
11.97 tokens/s
10.93 tokens/s
Overall throughput
12.72 tokens/s
90.65 tokens/s
200.95 tokens/s
265.79 tokens/s
362.07 tokens/s
447.64 tokens/s

--- max_tokens=16384 (16k) Stress test results summary-----



Metrics\ Concurrency
1 concurrent
10 concurrent
20 concurrent
30 concurrent
40 concurrent
50 concurrent
Total requests
1
10
20
30
40
50
Success rate
100.00%
100.00%
100.00%
100.00%
100.00%
100.00%
Average latency
53.7487s
62.1833s
59.5736s
66.6164s
63.7078s
72.2051s
Maximum delay
53.7487s
85.7138s
80.2841s
87.5017s
89.1299s
94.0724s
Minimum delay
53.7487s
36.8215s
37.6174s
52.0516s
35.3799s
60.3701s
P90 Delay
53.7487s
83.6419s
75.6695s
84.9264s
81.5069s
86.5969s
P95 Delay
53.7487s
84.6779s
79.7058s
86.3211s
83.7799s
88.3755s
P99 Delay
53.7487s
85.5066s
80.1685s
87.3039s
87.1454s
93.0178s
Average First Word Delay
53.7487s
62.1833s
59.5736s
66.6164s
63.7078s
72.2051s
Total number of tokens generated
692
7747
14729
24515
30655
38963
Single concurrent minimum throughput
12.87 tokens/s
12.42 tokens/s
12.33 tokens/s
12.23 tokens/s
11.88 tokens/s
10.59 tokens/s
Maximum throughput of single concurrency
12.87 tokens/s
12.50 tokens/s
12.43 tokens/s
12.34 tokens/s
12.17 tokens/s
11.17 tokens/s
Average throughput per concurrent session
12.87 tokens/s
12.45 tokens/s
12.36 tokens/s
12.27 tokens/s
12.01 tokens/s
10.77 tokens/s
Overall throughput
12.86 tokens/s
90.32 tokens/s
183.34 tokens/s
279.89 tokens/s
343.62 tokens/s
413.93 tokens/s

----- max_tokens=32768 (32k) Stress test results summary-----



Metrics\ Concurrency
1 concurrent
10 concurrent
20 concurrent
30 concurrent
40 concurrent
50 concurrent
Total requests
1
10
20
30
40
50
Success rate
100.00%
100.00%
100.00%
100.00%
100.00%
100.00%
Average latency
74.4107s
73.1775s
60.8819s
68.1447s
65.5262s
71.1695s
Maximum delay
74.4107s
88.0205s
87.1197s
86.6508s
91.1330s
98.0503s
Minimum delay
74.4107s
52.6583s
38.6691s
52.4571s
35.7134s
34.2791s
P90 Delay
74.4107s
84.6266s
74.6224s
83.2444s
86.5026s
88.7393s
P95 Delay
74.4107s
86.3236s
76.9170s
84.9372s
87.1154s
89.7969s
P99 Delay
74.4107s
87.6811s
85.0792s
86.3908s
89.6305s
94.0741s
Average First Word Delay
74.4107s
73.1775s
60.8819s
68.1447s
65.5262s
71.1695s
Total number of tokens generated
890
9204
15316
25457
31817
39101
Single concurrent minimum throughput
11.96 tokens/s
12.53 tokens/s
12.52 tokens/s
12.42 tokens/s
11.93 tokens/s
10.70 tokens/s
Maximum throughput of single concurrency
11.96 tokens/s
12.62 tokens/s
12.68 tokens/s
12.51 tokens/s
12.28 tokens/s
11.44 tokens/s
Average throughput per concurrent session
11.96 tokens/s
12.57 tokens/s
12.57 tokens/s
12.45 tokens/s
12.11 tokens/s
10.95 tokens/s
Overall throughput
11.95 tokens/s
104.49 tokens/s
175.70 tokens/s
293.52 tokens/s
348.63 tokens/s
398.29 tokens/s

----- max_tokens=65536 (64k) Stress test results summary-----



Metrics\ Concurrency
1 concurrent
10 concurrent
20 concurrent
30 concurrent
40 concurrent
50 concurrent
Total requests
1
10
20
30
41
50
Success rate
100.00%
100.00%
100.00%
100.00%
100.00%
100.00%
Average latency
44.1485s
63.6202s
62.0807s
63.1362s
64.5397s
71.4495s
Maximum delay
44.1485s
83.4623s
132.1258s
86.3368s
93.9798s
96.6099s
Minimum delay
44.1485s
32.3361s
37.1413s
33.7265s
24.4006s
40.7544s
P90 Delay
44.1485s
78.2377s
73.5106s
81.1197s
82.5298s
88.7146s
P95 Delay
44.1485s
80.8500s
77.1583s
84.0214s
83.8858s
92.7252s
P99 Delay
44.1485s
82.9398s
121.1323s
86.3070s
92.4763s
96.0186s
Average First Word Delay
44.1485s
63.6202s
62.0807s
63.1362s
64.5397s
71.4495s
Total number of tokens generated
587
8084
15619
23501
31612
38887
Single concurrent minimum throughput
13.30 tokens/s
12.62 tokens/s
12.52 tokens/s
12.36 tokens/s
11.76 tokens/s
10.63 tokens/s
Maximum throughput of single concurrency
13.30 tokens/s
12.76 tokens/s
12.86 tokens/s
12.49 tokens/s
12.15 tokens/s
11.31 tokens/s
Average throughput per concurrent session
13.30 tokens/s
12.70 tokens/s
12.56 tokens/s
12.40 tokens/s
11.93 tokens/s
10.85 tokens/s
Overall throughput
13.28 tokens/s
96.78 tokens/s
118.15 tokens/s
272.05 tokens/s
336.11 tokens/s
401.98 tokens/s

----- max_tokens=131072 (128k) Stress test results summary-----



Metrics\ Concurrency
1 concurrent
10 concurrent
20 concurrent
30 concurrent
40 concurrent
50 concurrent
Total requests
1
10
twenty one
30
42
50
Success rate
100.00%
100.00%
100.00%
100.00%
100.00%
100.00%
Average latency
61.9497s
68.6144s
57.8482s
66.2845s
63.5500s
70.3486s
Maximum delay
61.9497s
81.8154s
80.4513s
86.5205s
98.3918s
94.1867s
Minimum delay
61.9497s
50.9891s
28.8903s
35.9238s
27.5084s
31.2229s
P90 Delay
61.9497s
79.8821s
68.2121s
81.7377s
80.3188s
87.7278s
P95 Delay
61.9497s
80.8488s
75.1345s
82.2849s
82.2353s
90.8710s
P99 Delay
61.9497s
81.6221s
79.3879s
85.2935s
93.4738s
93.3895s
Average First Word Delay
61.9497s
68.6144s
57.8482s
66.2845s
63.5500s
70.3486s
Total number of tokens generated
817
8420
14970
24307
31916
38895
Single concurrent minimum throughput
13.19 tokens/s
12.23 tokens/s
12.22 tokens/s
12.00 tokens/s
11.81 tokens/s
10.65 tokens/s
Maximum throughput of single concurrency
13.19 tokens/s
12.32 tokens/s
12.39 tokens/s
12.33 tokens/s
12.26 tokens/s
11.39 tokens/s
Average throughput per concurrent session
13.19 tokens/s
12.27 tokens/s
12.32 tokens/s
12.21 tokens/s
11.94 tokens/s
11.01 tokens/s
Overall throughput
13.18 tokens/s
102.85 tokens/s
185.89 tokens/s
280.62 tokens/s
297.08 tokens/s
412.63 tokens/s

Performance test of DeepSeek-V3-0324 deployed on 8 H20 cards

----- max_tokens=100 Stress test results summary-----



Metrics\ Concurrency
1 concurrent
10 concurrent
20 concurrent
30 concurrent
40 concurrent
50 concurrent
Total requests
3
30
60
90
120
150
Success rate
100.00%
100.00%
100.00%
100.00%
100.00%
100.00%
Average latency
13.9587s
13.9900s
14.0511s
14.0769s
14.1673s
14.2916s
Maximum delay
14.7636s
14.1010s
14.1825s
14.2707s
14.5726s
14.5179s
Minimum delay
13.4980s
13.8632s
13.8544s
13.8677s
13.9031s
13.9850s
P90 Delay
14.5338s
14.0850s
14.1607s
14.2467s
14.4279s
14.4478s
P95 Delay
14.6487s
14.0952s
14.1649s
14.2566s
14.5099s
14.4803s
P99 Delay
14.7407s
14.0994s
14.1749s
14.2640s
14.5641s
14.5124s
Average First Word Delay
13.9587s
13.9900s
14.0511s
14.0769s
14.1673s
14.2916s
Total number of tokens generated
300
3000
6000
9000
12000
15000
Single concurrent minimum throughput
6.77 tokens/s
7.09 tokens/s
7.05 tokens/s
7.01 tokens/s
6.86 tokens/s
6.89 tokens/s
Maximum throughput of single concurrency
7.41 tokens/s
7.21 tokens/s
7.22 tokens/s
7.21 tokens/s
7.19 tokens/s
7.15 tokens/s
Average throughput per concurrent session
7.18 tokens/s
7.15 tokens/s
7.12 tokens/s
7.10 tokens/s
7.06 tokens/s
7.00 tokens/s
Overall throughput
7.16 tokens/s
71.40 tokens/s
142.02 tokens/s
212.27 tokens/s
280.99 tokens/s
347.65 tokens/s

----- max_tokens=1024 Stress test results summary-----



Metrics\ Concurrency
1 concurrent
10 concurrent
20 concurrent
30 concurrent
40 concurrent
50 concurrent
Total requests
1
10
20
30
40
50
Success rate
100.00%
100.00%
100.00%
100.00%
100.00%
100.00%
Average latency
95.4234s
96.8941s
97.4570s
105.0299s
107.1363s
99.7274s
Maximum delay
95.4234s
107.9135s
125.9989s
132.9541s
136.2208s
122.7872s
Minimum delay
95.4234s
83.9967s
80.7756s
86.1851s
81.2474s
82.7827s
P90 Delay
95.4234s
106.9436s
117.0284s
124.7368s
119.3310s
111.3582s
P95 Delay
95.4234s
107.4286s
120.1523s
128.7807s
123.0959s
115.2739s
P99 Delay
95.4234s
107.8165s
124.8296s
132.1840s
132.3656s
120.8836s
Average First Word Delay
95.4234s
96.8941s
97.4570s
105.0299s
107.1363s
99.7274s
Total number of tokens generated
718
6968
14059
22408
30259
35405
Single concurrent minimum throughput
7.52 tokens/s
7.18 tokens/s
7.20 tokens/s
7.09 tokens/s
7.03 tokens/s
7.09 tokens/s
Maximum throughput of single concurrency
7.52 tokens/s
7.21 tokens/s
7.23 tokens/s
7.14 tokens/s
7.11 tokens/s
7.13 tokens/s
Average throughput per concurrent session
7.52 tokens/s
7.19 tokens/s
7.21 tokens/s
7.11 tokens/s
7.06 tokens/s
7.10 tokens/s
Overall throughput
7.52 tokens/s
64.56 tokens/s
111.55 tokens/s
168.47 tokens/s
222.03 tokens/s
288.12 tokens/s

----- max_tokens=16384 (16k) Stress test results summary-----



Metrics\ Concurrency
1 concurrent
10 concurrent
20 concurrent
30 concurrent
40 concurrent
50 concurrent
Total requests
1
10
20
30
40
50
Success rate
100.00%
100.00%
100.00%
100.00%
100.00%
100.00%
Average latency
94.8628s
99.1652s
98.3011s
102.2118s
99.5501s
101.8411s
Maximum delay
94.8628s
117.8686s
106.8626s
114.9650s
123.4567s
126.0541s
Minimum delay
94.8628s
83.2503s
85.4619s
82.4278s
83.1481s
75.9468s
P90 Delay
94.8628s
109.6080s
105.4161s
111.5839s
110.3189s
112.1986s
P95 Delay
94.8628s
113.7383s
105.6092s
112.9895s
111.6643s
114.0535s
P99 Delay
94.8628s
117.0425s
106.6119s
114.6945s
122.8847s
123.3202s
Average First Word Delay
94.8628s
99.1652s
98.3011s
102.2118s
99.5501s
101.8411s
Total number of tokens generated
703
7094
14089
22235
28772
36390
Single concurrent minimum throughput
7.41 tokens/s
7.14 tokens/s
7.15 tokens/s
7.24 tokens/s
7.21 tokens/s
7.13 tokens/s
Maximum throughput of single concurrency
7.41 tokens/s
7.19 tokens/s
7.18 tokens/s
7.27 tokens/s
7.23 tokens/s
7.18 tokens/s
Average throughput per concurrent session
7.41 tokens/s
7.15 tokens/s
7.17 tokens/s
7.25 tokens/s
7.23 tokens/s
7.15 tokens/s
Overall throughput
7.41 tokens/s
60.17 tokens/s
131.80 tokens/s
193.31 tokens/s
232.93 tokens/s
288.61 tokens/s

----- max_tokens=32768 (32k) Stress test results summary-----



Metrics\ Concurrency
1 concurrent
10 concurrent
20 concurrent
30 concurrent
40 concurrent
50 concurrent
Total requests
1
10
20
30
40
50
Success rate
100.00%
100.00%
100.00%
100.00%
100.00%
100.00%
Average latency
80.5510s
93.5289s
97.1551s
100.7830s
99.8265s
99.5300s
Maximum delay
80.5510s
107.8886s
133.6073s
156.9135s
116.2559s
115.6441s
Minimum delay
80.5510s
79.7242s
84.5335s
82.8031s
81.1707s
81.6779s
P90 Delay
80.5510s
105.1389s
112.7804s
111.3159s
112.8292s
109.1424s
P95 Delay
80.5510s
106.5138s
114.0461s
115.8762s
115.5501s
110.0651s
P99 Delay
80.5510s
107.6136s
129.6950s
145.9792s
116.1772s
113.7739s
Average First Word Delay
80.5510s
93.5289s
97.1551s
100.7830s
99.8265s
99.5300s
Total number of tokens generated
607
6822
14068
21898
28614
35499
Single concurrent minimum throughput
7.54 tokens/s
7.29 tokens/s
7.23 tokens/s
7.22 tokens/s
7.14 tokens/s
7.12 tokens/s
Maximum throughput of single concurrency
7.54 tokens/s
7.30 tokens/s
7.29 tokens/s
7.30 tokens/s
7.20 tokens/s
7.15 tokens/s
Average throughput per concurrent session
7.54 tokens/s
7.29 tokens/s
7.24 tokens/s
7.24 tokens/s
7.16 tokens/s
7.13 tokens/s
Overall throughput
7.53 tokens/s
63.21 tokens/s
105.25 tokens/s
139.52 tokens/s
246.08 tokens/s
306.83 tokens/s

----- max_tokens=65536(64k) Stress test results summary-----



Metrics\ Concurrency
1 concurrent
10 concurrent
20 concurrent
30 concurrent
40 concurrent
50 concurrent
Total requests
1
10
20
30
40
50
Success rate
100.00%
100.00%
100.00%
100.00%
100.00%
100.00%
Average latency
81.7039s
90.8889s
99.1065s
99.7213s
99.2848s
99.0839s
Maximum delay
81.7039s
112.5239s
113.0623s
125.9377s
130.2727s
113.6320s
Minimum delay
81.7039s
78.5028s
83.0163s
81.5086s
80.9710s
85.9351s
P90 Delay
81.7039s
99.3878s
108.6772s
113.1816s
111.2980s
110.5696s
P95 Delay
81.7039s
105.9558s
112.0033s
118.0436s
114.1228s
112.7986s
P99 Delay
81.7039s
111.2103s
112.8505s
124.2411s
124.3386s
113.4573s
Average First Word Delay
81.7039s
90.8889s
99.1065s
99.7213s
99.2848s
99.0839s
Total number of tokens generated
593
6538
14244
21620
28389
34942
Single concurrent minimum throughput
7.26 tokens/s
7.17 tokens/s
7.18 tokens/s
7.21 tokens/s
7.13 tokens/s
7.04 tokens/s
Maximum throughput of single concurrency
7.26 tokens/s
7.23 tokens/s
7.19 tokens/s
7.25 tokens/s
7.20 tokens/s
7.08 tokens/s
Average throughput per concurrent session
7.26 tokens/s
7.19 tokens/s
7.19 tokens/s
7.23 tokens/s
7.15 tokens/s
7.05 tokens/s
Overall throughput
7.26 tokens/s
58.09 tokens/s
125.95 tokens/s
171.59 tokens/s
217.80 tokens/s
307.44 tokens/s

Resource peak value during stress testing:

+----------------------------------------------------------------------------------------+| NVIDIA-SMI 550.144.03 Driver Version: 550.144.03 CUDA Version: 12.4 ||----------------------------------------+------------------------+-------------------------+| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC || Fan Temp Perf Pwr: Usage/Cap | Memory-Usage | GPU-Util Compute M. ||   00000000:65:02.0 Off | 0 || N/A 39C P0 176W / 500W | 95096MiB / 97871MiB | 95% Default || Off | 0 || N/A 46C P0 184W / 500W | 95070MiB / 97871MiB | 23% Default || | 178W/500W |   95070MiB / 97871MiB | 95% Default || 97% Default || |+-----------------------------------------+------------------------+-------------------------+| 5 NVIDIA H20 Off | 00000000:69:03.0 Off | 0 || N/A 45C P0 182W / 500W | 95070MiB / 97871MiB | 97% Default || | | Disabled |+----------------------------------------+--------------------------------+----------------+| 6 NVIDIA H20 Off | 00000000:6B:02.0 Off | 0 || N/A 46C P0 184W / 500W | 95070MiB / 97871MiB | 97% Default || 182W / 500W | 95078MiB / 97871MiB | 98% Default || | | Disabled |+------------------------------------------+------------------------+-------------------------+

Peak KV cache usage:

INFO 03-31 23:22:50 [loggers.py:80] Avg prompt throughput: 45.0 tokens/s, Avg generation throughput: 166.9 tokens/s, Running: 50 reqs, Waiting: 0 reqs, GPU KV cache usage: 7.7%, Prefix cache hit rate: 0.0%INFO 03-31 23:23:00 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 350.0 tokens/s, Running: 50 reqs, Waiting: 0 reqs, GPU KV cache usage: 7.7%, Prefix cache hit rate: 0.0%INFO 03-31 23:23:10 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s,Avg generation throughput: 355.0 tokens/s, Running: 50 reqs, Waiting: 0 reqs, GPU KV cache usage: 15.4%, Prefix cache hit rate: 0.0%INFO 03-31 23:23:20 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 360.0 tokens/s, Running: 50 reqs, Waiting: 0 reqs, GPU KV cache usage: 15.4%, Prefix cache hit rate: 0.0%INFO 03-31 23:23:30 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 355.0 tokens/s, Running: 50 reqs, Waiting: 0 reqs, GPU KV cache usage: 23.2%, Prefix cache hit rate: 0.0%INFO 03-31 23:23:40 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 355.0 tokens/s, Running: 50 reqs, Waiting: 0 reqs, GPU KV cache usage: 30.9%, Prefix cache hit rate: 0.0%INFO 03-31 23:23:50 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 355.0 tokens/s, Running: 50 reqs, Waiting: 0 reqs, GPU KV cache usage: 30.9%, Prefix cache hit rate: 0.0%INFO 03-31 23:24:00 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 360.0 tokens/s, Running: 50 reqs, Waiting: 0 reqs, GPU KV cache usage: 38.6%, Prefix cache hit rate: 0.0%INFO 03-31 23:24:10 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 350.0 tokens/s, Running: 50 reqs, Waiting: 0 reqs, GPU KV cache usage: 38.6%, Prefix cache hit rate: 0.0%

Mathematical data set benchmarking

We used  GitHub - huggingface/lighteval: Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends  to run math test sets on DeepSeek-R1-AWQ and DeepSeek-V3-0324 deployed on 8 H20 cards. Here we modified a small amount of lighteval code so that it does not start model inference itself, but calls the OpenAI API interface of the deployed model. The test results are as follows:

8-card H20 deployed DeepSeek-R1-AWQ running test

Math500 Assessment

Modified evaluation command:

(benchmark) root@H20:/data/code/lighteval# lighteval endpoint litellm model_args="http://localhost:7800" tasks="lighteval|math_500|0|0"

Evaluation results:

| Task |Version| Metric |Value| |Stderr||--------------------|------:|----------------|----:|---|-----:||all |

8 H20 cards deployed DeepSeek-V3-0324 running points test

Math500 Assessment

Modified evaluation command:

(benchmark) root@H20:/data/code/lighteval# lighteval endpoint litellm model_args="http://localhost:7800" tasks="lighteval|math_500|0|0" --max-samples 20

To save time, only 20 questions were taken.

Evaluation results:

| Task |Version| Metric |Value| |Stderr||--------------------|------:|----------------|----:|---|-----:||all | |extractive_match| 0.95|± | 0.05||lighteval:math_500:0| 1|extractive_match| 0.95|± |

Peak resource consumption during the test:

|============================================+========================+======================|| 0 NVIDIA H20 Off | 00000000:65:02.0 Off | 0 || N/A 36C P0 159W / 500W | 97048MiB / 97871MiB | 96% Default ||   97022MiB / 97871MiB | 91% Default || 97% Default || |+-----------------------------------------+------------------------+-------------------------+| 4 NVIDIA H20 Off | 00000000:69:02.0 Off | 0 || N/A 37C P0 161W / 500W | 97022MiB / 97871MiB | 21% Default || NVIDIA H20 Off | 00000000:69:03.0 Off | 0 || N/A 41C P0 162W / 500W | 97022MiB / 97871MiB | 97% Default || 00000000:6B:02.0 Off | 0 || N/A 42C P0 164W / 500W | 97022MiB / 97871MiB | 97% Default ||0 Off | 0 || N/A 37C P0 163W / 500W | 97030MiB / 97871MiB | 95% Default || | | Disabled |+----------------------------------------+------------------------+--------------------------+

aime25 evaluation

Modified evaluation command:

(benchmark) root@H20:/data/code/lighteval# lighteval endpoint litellm model_args="http://localhost:7800" tasks="lighteval|aime25|0|0" --max-samples 20

To save time, only 20 questions were taken.

Evaluation results:


| Task |Version| Metric |Value| |Stderr||------------------|------:|----------------|----:|---|-----:||all | |extractive_match| 0.4|± |0.1124||lighteval:aime25:0| 1|extractive_match| 0.4|± |0.1124|

Aime25 is relatively new, but this score seems to be lower than the scores published by others before. It may be a problem with the evaluation method, or the context may be truncated during the evaluation process, affecting the results.