Woter AI detection.Hurry - ends Jul 9th

New Year Sales :up to 80% OFF

AI Humanize AI Translator Bypass AI AI Rewriter AI Detector

PRICING

TRY FOR FREE

Performance and inference test of DeepSeek-V3-0324 running on 8 H20 cards

Written by

Audrey Miles

Updated on:June-30th-2025

Recently, I deployed DeepSeek-R1-AWQ (671B) and the latest DeepSeek-V3-0324 (685B) on an 8-card H20 machine, and tested the performance and math problem scores. The server is provided by Volcano Engine. Let's take a look at the machine configuration first:

8-card H20 machine configuration

GPU:

+---------------------------------------------------------------------------------------------+| NVIDIA-SMI 535.161.08 Driver Version: 535.161.08 CUDA Version: 12.2 ||--------------------------------------------------+-------------------------+---------------------------------+| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC || Fan Temp Perf Pwr: Usage/Cap | Memory-Usage | GPU-Util Compute M. || 00000000:65:02.0 Off | 0 || N/A 29C P0 71W / 500W | 0MiB / 97871MiB | 0% Default || || N/A 32C P0 72W / 500W | 0MiB / 97871MiB | 0% Default || 0MiB/ 97871MiB | 0% Default || |+-----------------------------------------+-------------------------+-------------------------+| 4 NVIDIA H20 On | 00000000:69:02.0 Off | 0 || N/A 30C P0 74W / 500W | 0MiB / 97871MiB | 0% Default || | H20 On | 00000000:69:03.0 Off | 0 || N/A 33C P0 74W / 500W | 0MiB / 97871MiB | 0% Default ||0 Off | 0 || N/A 33C P0 73W / 500W | 0MiB / 97871MiB | 0% Default || | 500W | 0MiB / 97871MiB | 0% Default || | | Disabled |+-----------------------------------------+-------------------------+-------------------------+

I stepped on a pit here: the original driver version had problems. It worked fine on RTX4090, but it crashed when running DeepSeek-R1-AWQ on H20 after trying various configurations and software versions. Later, I changed to the driver version recommended by NVIDIA official website for H20, Driver Version: 550.144.03 (CUDA 12.4), and it worked without changing any configuration.

Inter-card interconnection:

 	GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 GPU0 X OK OK OK OK OK OK OK GPU1 OK X OK OK OK OK OK GPU2 OK OK X OK OK OK OK GPU3 OK OK OK X OK OK OK OK GPU4 OK OK OK OK X OK OK OK GPU5 OK OK OK OK OK X OK OK GPU6 OK OK OK OK OK OK X OK GPU7 OK OK OK OK OK OK OK X
Legend:
  X =  Self  OK    =  Status  Ok  CNS   =  Chipset not  supported  GNS   =  GPU not  supported  TNS   =  Topology not  supported  NS    =  Not  supported  U     =  Unknown

Memory:

# free -g total used free shared buff/cache availableMem: 1929 29 1891 0 9 1892Swap: 0 0 0

disk:

vda 252:0 0 100G 0 disk ├─vda1 252:1 0 200M 0 part /boot/efi└─vda2 252:2 0 99.8G 0 part /nvme3n1 259:0 0 3.5T 0 disk nvme2n1 259:1 0 3.5T 0 disk nvme0n1 259:2 0 3.5T 0 disk nvme1n1 259:3 0 3.5T 0 disk

# uname -aLinux  H20  5 . 4 . 0 - 162 -generic # 179 -Ubuntu SMP Mon Aug  14 08:51:31  UTC 2023  x86_64  x86_64  x86_64 GNU/Linux
# cat /etc/lsb-releaseDISTRIB_ID = UbuntuDISTRIB_RELEASE = 20.04DISTRIB_CODENAME =focalDISTRIB_DESCRIPTION = "Ubuntu 20.04.5 LTS"

Start inference

Use vLLM v0.8.2 to start the inference service, and start the inference of the following two models respectively:

DeepSeek-R1-AWQ: https://huggingface.co/cognitivecomputations/DeepSeek-R1-AWQ
DeepSeek-V3-0324: https://modelscope.cn/models/deepseek-ai/DeepSeek-V3-0324

H20 Performance Review

Start performance testing:

nohup python3 -u simple-bench-to-api.py --url http://localhost:7800/v1 \ --model DeepSeek-R1 \ --concurrencys 1,10,20,30,40,50 \ --prompt "Introduce the history of China" \ --max_tokens 100,1024,16384,32768,65536,131072 \ --api_key sk-xxx \ --duration_seconds 30 \ > benth-DeepSeek-R1-AWQ-8-H20.log 2>&1 &

This command will use max_tokens of 100, 1024, 16384, 32768, 65536, 131072 to perform batch tests on 1 concurrency, 10 concurrency, ..., 50 concurrency. Each max_tokens value generates a table with different concurrencies. The stress test script simple-bench-to-api.py and the detailed parameter meanings are in the previous article "Concurrency Performance of DeepSeek-R1 Small Model Deployed on Single Card 4090". Friends who need it can get it by themselves.

Stress test results:

Performance test of deploying DeepSeek-R1-AWQ on 8 H20 cards

----- max_tokens=100 Stress test results summary-----

Metrics\ Concurrency	1 concurrent	10 concurrent	20 concurrent	30 concurrent	40 concurrent	50 concurrent
Total requests	4	40	80	120	160	200
Success rate	100.00%	100.00%	100.00%	100.00%	100.00%	100.00%
Average latency	7.8265s	8.1742s	8.3271s	8.6902s	8.7426s	9.0815s
Maximum delay	7.9687s	8.2911s	8.4582s	9.0513s	9.0191s	9.4417s
Minimum delay	7.7197s	8.1062s	8.1941s	8.4626s	8.4411s	8.7822s
P90 Delay	7.9226s	8.2208s	8.4206s	8.9813s	8.9725s	9.2873s
P95 Delay	7.9456s	8.2801s	8.4312s	9.0094s	8.9932s	9.3191s
P99 Delay	7.9641s	8.2879s	8.4574s	9.0323s	9.0047s	9.4240s
Average First Word Delay	7.8265s	8.1742s	8.3271s	8.6902s	8.7426s	9.0815s
Total number of tokens generated	400	4000	8000	12000	16000	20000
Single concurrent minimum throughput	12.55 tokens/s	12.06 tokens/s	11.82 tokens/s	11.05 tokens/s	11.09 tokens/s	10.59 tokens/s
Maximum throughput of single concurrency	12.95 tokens/s	12.34 tokens/s	12.20 tokens/s	11.82 tokens/s	11.85 tokens/s	11.39 tokens/s
Average throughput per concurrent session	12.78 tokens/s	12.23 tokens/s	12.01 tokens/s	11.51 tokens/s	11.44 tokens/s	11.01 tokens/s
Overall throughput	12.75 tokens/s	121.90 tokens/s	238.84 tokens/s	343.09 tokens/s	454.13 tokens/s	545.88 tokens/s

There are a few concepts that need to be explained

"Latency": the time from sending the request to receiving the last token/character (including the first word delay time)
P90 Latency: Latency at percentile 90, calculated by sorting the latencies from smallest to largest, the maximum latency value of the top 90%, and the next latency value, a value between the two calculated based on linear interpolation.
"First Word Delay" is the time from when a request is sent to when the first character is received in return.
The concept of "single concurrent throughput" refers to the speed of token generation after the first token is returned from the perspective of each concurrent user/channel. The statistical time does not include the first word delay. That is, the throughput of a channel = the number of tokens generated by the channel / the generation time excluding the first token delay. Personally, I think this indicator plus the average first word delay can reflect the real user experience.

The meaning of specific indicators:

Average delay: average delay of all channels (including first word delay time)
Average first word delay: The average of the first word delays of all channels
Single concurrent minimum throughput: The throughput of the channel with the lowest throughput among all concurrent channels (excluding the first word delay time)
Single concurrent maximum throughput: The throughput of the channel with the highest throughput among all concurrent channels (excluding the first word delay time)
Single concurrent average throughput: the average throughput of all concurrent channels (excluding the first word delay time)
Overall throughput: the total number of tokens generated by all channels during the stress test / the time from the start to the end of the stress test
P90 latency: 90% of the request latency is lower than this value
P95 latency: 95% of the request latency is lower than this value
P99 latency: 99% of the request latency is lower than this value

For details, please refer to the previous article Concurrency performance of DeepSeek-R1 small model deployed on a single GPU 4090

----- max_tokens=1024 Stress test results summary-----

Metrics\ Concurrency	1 concurrent	10 concurrent	20 concurrent	30 concurrent	40 concurrent	50 concurrent
Total requests	1	11	20	32	40	50
Success rate	100.00%	100.00%	100.00%	100.00%	100.00%	100.00%
Average latency	80.4809s	64.4957s	69.2813s	60.0941s	64.3626s	74.9057s
Maximum delay	80.4809s	81.5464s	84.0396s	83.1977s	85.0927s	91.6753s
Minimum delay	80.4809s	27.0671s	34.2130s	28.8989s	33.0757s	36.8664s
P90 Delay	80.4809s	80.1078s	83.9624s	76.2109s	82.3774s	91.6048s
P95 Delay	80.4809s	80.8271s	83.9756s	80.3737s	83.5347s	91.6487s
P99 Delay	80.4809s	81.4025s	84.0268s	83.1274s	85.0485s	91.6665s
Average First Word Delay	80.4809s	64.4957s	69.2813s	60.0941s	64.3626s	74.9057s
Total number of tokens generated	1024	8700	16900	23560	30844	41068
Single concurrent minimum throughput	12.72 tokens/s	12.17 tokens/s	12.18 tokens/s	12.11 tokens/s	11.91 tokens/s	10.68 tokens/s
Maximum throughput of single concurrency	12.72 tokens/s	12.46 tokens/s	12.22 tokens/s	12.42 tokens/s	12.05 tokens/s	11.19 tokens/s
Average throughput per concurrent session	12.72 tokens/s	12.25 tokens/s	12.20 tokens/s	12.24 tokens/s	11.97 tokens/s	10.93 tokens/s
Overall throughput	12.72 tokens/s	90.65 tokens/s	200.95 tokens/s	265.79 tokens/s	362.07 tokens/s	447.64 tokens/s

--- max_tokens=16384 (16k) Stress test results summary-----

Metrics\ Concurrency	1 concurrent	10 concurrent	20 concurrent	30 concurrent	40 concurrent	50 concurrent
Total requests	1	10	20	30	40	50
Success rate	100.00%	100.00%	100.00%	100.00%	100.00%	100.00%
Average latency	53.7487s	62.1833s	59.5736s	66.6164s	63.7078s	72.2051s
Maximum delay	53.7487s	85.7138s	80.2841s	87.5017s	89.1299s	94.0724s
Minimum delay	53.7487s	36.8215s	37.6174s	52.0516s	35.3799s	60.3701s
P90 Delay	53.7487s	83.6419s	75.6695s	84.9264s	81.5069s	86.5969s
P95 Delay	53.7487s	84.6779s	79.7058s	86.3211s	83.7799s	88.3755s
P99 Delay	53.7487s	85.5066s	80.1685s	87.3039s	87.1454s	93.0178s
Average First Word Delay	53.7487s	62.1833s	59.5736s	66.6164s	63.7078s	72.2051s
Total number of tokens generated	692	7747	14729	24515	30655	38963
Single concurrent minimum throughput	12.87 tokens/s	12.42 tokens/s	12.33 tokens/s	12.23 tokens/s	11.88 tokens/s	10.59 tokens/s
Maximum throughput of single concurrency	12.87 tokens/s	12.50 tokens/s	12.43 tokens/s	12.34 tokens/s	12.17 tokens/s	11.17 tokens/s
Average throughput per concurrent session	12.87 tokens/s	12.45 tokens/s	12.36 tokens/s	12.27 tokens/s	12.01 tokens/s	10.77 tokens/s
Overall throughput	12.86 tokens/s	90.32 tokens/s	183.34 tokens/s	279.89 tokens/s	343.62 tokens/s	413.93 tokens/s

----- max_tokens=32768 (32k) Stress test results summary-----

Metrics\ Concurrency	1 concurrent	10 concurrent	20 concurrent	30 concurrent	40 concurrent	50 concurrent
Total requests	1	10	20	30	40	50
Success rate	100.00%	100.00%	100.00%	100.00%	100.00%	100.00%
Average latency	74.4107s	73.1775s	60.8819s	68.1447s	65.5262s	71.1695s
Maximum delay	74.4107s	88.0205s	87.1197s	86.6508s	91.1330s	98.0503s
Minimum delay	74.4107s	52.6583s	38.6691s	52.4571s	35.7134s	34.2791s
P90 Delay	74.4107s	84.6266s	74.6224s	83.2444s	86.5026s	88.7393s
P95 Delay	74.4107s	86.3236s	76.9170s	84.9372s	87.1154s	89.7969s
P99 Delay	74.4107s	87.6811s	85.0792s	86.3908s	89.6305s	94.0741s
Average First Word Delay	74.4107s	73.1775s	60.8819s	68.1447s	65.5262s	71.1695s
Total number of tokens generated	890	9204	15316	25457	31817	39101
Single concurrent minimum throughput	11.96 tokens/s	12.53 tokens/s	12.52 tokens/s	12.42 tokens/s	11.93 tokens/s	10.70 tokens/s
Maximum throughput of single concurrency	11.96 tokens/s	12.62 tokens/s	12.68 tokens/s	12.51 tokens/s	12.28 tokens/s	11.44 tokens/s
Average throughput per concurrent session	11.96 tokens/s	12.57 tokens/s	12.57 tokens/s	12.45 tokens/s	12.11 tokens/s	10.95 tokens/s
Overall throughput	11.95 tokens/s	104.49 tokens/s	175.70 tokens/s	293.52 tokens/s	348.63 tokens/s	398.29 tokens/s

----- max_tokens=65536 (64k) Stress test results summary-----

Metrics\ Concurrency	1 concurrent	10 concurrent	20 concurrent	30 concurrent	40 concurrent	50 concurrent
Total requests	1	10	20	30	41	50
Success rate	100.00%	100.00%	100.00%	100.00%	100.00%	100.00%
Average latency	44.1485s	63.6202s	62.0807s	63.1362s	64.5397s	71.4495s
Maximum delay	44.1485s	83.4623s	132.1258s	86.3368s	93.9798s	96.6099s
Minimum delay	44.1485s	32.3361s	37.1413s	33.7265s	24.4006s	40.7544s
P90 Delay	44.1485s	78.2377s	73.5106s	81.1197s	82.5298s	88.7146s
P95 Delay	44.1485s	80.8500s	77.1583s	84.0214s	83.8858s	92.7252s
P99 Delay	44.1485s	82.9398s	121.1323s	86.3070s	92.4763s	96.0186s
Average First Word Delay	44.1485s	63.6202s	62.0807s	63.1362s	64.5397s	71.4495s
Total number of tokens generated	587	8084	15619	23501	31612	38887
Single concurrent minimum throughput	13.30 tokens/s	12.62 tokens/s	12.52 tokens/s	12.36 tokens/s	11.76 tokens/s	10.63 tokens/s
Maximum throughput of single concurrency	13.30 tokens/s	12.76 tokens/s	12.86 tokens/s	12.49 tokens/s	12.15 tokens/s	11.31 tokens/s
Average throughput per concurrent session	13.30 tokens/s	12.70 tokens/s	12.56 tokens/s	12.40 tokens/s	11.93 tokens/s	10.85 tokens/s
Overall throughput	13.28 tokens/s	96.78 tokens/s	118.15 tokens/s	272.05 tokens/s	336.11 tokens/s	401.98 tokens/s

----- max_tokens=131072 (128k) Stress test results summary-----

Metrics\ Concurrency	1 concurrent	10 concurrent	20 concurrent	30 concurrent	40 concurrent	50 concurrent
Total requests	1	10	twenty one	30	42	50
Success rate	100.00%	100.00%	100.00%	100.00%	100.00%	100.00%
Average latency	61.9497s	68.6144s	57.8482s	66.2845s	63.5500s	70.3486s
Maximum delay	61.9497s	81.8154s	80.4513s	86.5205s	98.3918s	94.1867s
Minimum delay	61.9497s	50.9891s	28.8903s	35.9238s	27.5084s	31.2229s
P90 Delay	61.9497s	79.8821s	68.2121s	81.7377s	80.3188s	87.7278s
P95 Delay	61.9497s	80.8488s	75.1345s	82.2849s	82.2353s	90.8710s
P99 Delay	61.9497s	81.6221s	79.3879s	85.2935s	93.4738s	93.3895s
Average First Word Delay	61.9497s	68.6144s	57.8482s	66.2845s	63.5500s	70.3486s
Total number of tokens generated	817	8420	14970	24307	31916	38895
Single concurrent minimum throughput	13.19 tokens/s	12.23 tokens/s	12.22 tokens/s	12.00 tokens/s	11.81 tokens/s	10.65 tokens/s
Maximum throughput of single concurrency	13.19 tokens/s	12.32 tokens/s	12.39 tokens/s	12.33 tokens/s	12.26 tokens/s	11.39 tokens/s
Average throughput per concurrent session	13.19 tokens/s	12.27 tokens/s	12.32 tokens/s	12.21 tokens/s	11.94 tokens/s	11.01 tokens/s
Overall throughput	13.18 tokens/s	102.85 tokens/s	185.89 tokens/s	280.62 tokens/s	297.08 tokens/s	412.63 tokens/s

Performance test of DeepSeek-V3-0324 deployed on 8 H20 cards

----- max_tokens=100 Stress test results summary-----

Metrics\ Concurrency	1 concurrent	10 concurrent	20 concurrent	30 concurrent	40 concurrent	50 concurrent
Total requests	3	30	60	90	120	150
Success rate	100.00%	100.00%	100.00%	100.00%	100.00%	100.00%
Average latency	13.9587s	13.9900s	14.0511s	14.0769s	14.1673s	14.2916s
Maximum delay	14.7636s	14.1010s	14.1825s	14.2707s	14.5726s	14.5179s
Minimum delay	13.4980s	13.8632s	13.8544s	13.8677s	13.9031s	13.9850s
P90 Delay	14.5338s	14.0850s	14.1607s	14.2467s	14.4279s	14.4478s
P95 Delay	14.6487s	14.0952s	14.1649s	14.2566s	14.5099s	14.4803s
P99 Delay	14.7407s	14.0994s	14.1749s	14.2640s	14.5641s	14.5124s
Average First Word Delay	13.9587s	13.9900s	14.0511s	14.0769s	14.1673s	14.2916s
Total number of tokens generated	300	3000	6000	9000	12000	15000
Single concurrent minimum throughput	6.77 tokens/s	7.09 tokens/s	7.05 tokens/s	7.01 tokens/s	6.86 tokens/s	6.89 tokens/s
Maximum throughput of single concurrency	7.41 tokens/s	7.21 tokens/s	7.22 tokens/s	7.21 tokens/s	7.19 tokens/s	7.15 tokens/s
Average throughput per concurrent session	7.18 tokens/s	7.15 tokens/s	7.12 tokens/s	7.10 tokens/s	7.06 tokens/s	7.00 tokens/s
Overall throughput	7.16 tokens/s	71.40 tokens/s	142.02 tokens/s	212.27 tokens/s	280.99 tokens/s	347.65 tokens/s

----- max_tokens=1024 Stress test results summary-----

Metrics\ Concurrency	1 concurrent	10 concurrent	20 concurrent	30 concurrent	40 concurrent	50 concurrent
Total requests	1	10	20	30	40	50
Success rate	100.00%	100.00%	100.00%	100.00%	100.00%	100.00%
Average latency	95.4234s	96.8941s	97.4570s	105.0299s	107.1363s	99.7274s
Maximum delay	95.4234s	107.9135s	125.9989s	132.9541s	136.2208s	122.7872s
Minimum delay	95.4234s	83.9967s	80.7756s	86.1851s	81.2474s	82.7827s
P90 Delay	95.4234s	106.9436s	117.0284s	124.7368s	119.3310s	111.3582s
P95 Delay	95.4234s	107.4286s	120.1523s	128.7807s	123.0959s	115.2739s
P99 Delay	95.4234s	107.8165s	124.8296s	132.1840s	132.3656s	120.8836s
Average First Word Delay	95.4234s	96.8941s	97.4570s	105.0299s	107.1363s	99.7274s
Total number of tokens generated	718	6968	14059	22408	30259	35405
Single concurrent minimum throughput	7.52 tokens/s	7.18 tokens/s	7.20 tokens/s	7.09 tokens/s	7.03 tokens/s	7.09 tokens/s
Maximum throughput of single concurrency	7.52 tokens/s	7.21 tokens/s	7.23 tokens/s	7.14 tokens/s	7.11 tokens/s	7.13 tokens/s
Average throughput per concurrent session	7.52 tokens/s	7.19 tokens/s	7.21 tokens/s	7.11 tokens/s	7.06 tokens/s	7.10 tokens/s
Overall throughput	7.52 tokens/s	64.56 tokens/s	111.55 tokens/s	168.47 tokens/s	222.03 tokens/s	288.12 tokens/s

----- max_tokens=16384 (16k) Stress test results summary-----

Metrics\ Concurrency	1 concurrent	10 concurrent	20 concurrent	30 concurrent	40 concurrent	50 concurrent
Total requests	1	10	20	30	40	50
Success rate	100.00%	100.00%	100.00%	100.00%	100.00%	100.00%
Average latency	94.8628s	99.1652s	98.3011s	102.2118s	99.5501s	101.8411s
Maximum delay	94.8628s	117.8686s	106.8626s	114.9650s	123.4567s	126.0541s
Minimum delay	94.8628s	83.2503s	85.4619s	82.4278s	83.1481s	75.9468s
P90 Delay	94.8628s	109.6080s	105.4161s	111.5839s	110.3189s	112.1986s
P95 Delay	94.8628s	113.7383s	105.6092s	112.9895s	111.6643s	114.0535s
P99 Delay	94.8628s	117.0425s	106.6119s	114.6945s	122.8847s	123.3202s
Average First Word Delay	94.8628s	99.1652s	98.3011s	102.2118s	99.5501s	101.8411s
Total number of tokens generated	703	7094	14089	22235	28772	36390
Single concurrent minimum throughput	7.41 tokens/s	7.14 tokens/s	7.15 tokens/s	7.24 tokens/s	7.21 tokens/s	7.13 tokens/s
Maximum throughput of single concurrency	7.41 tokens/s	7.19 tokens/s	7.18 tokens/s	7.27 tokens/s	7.23 tokens/s	7.18 tokens/s
Average throughput per concurrent session	7.41 tokens/s	7.15 tokens/s	7.17 tokens/s	7.25 tokens/s	7.23 tokens/s	7.15 tokens/s
Overall throughput	7.41 tokens/s	60.17 tokens/s	131.80 tokens/s	193.31 tokens/s	232.93 tokens/s	288.61 tokens/s

----- max_tokens=32768 (32k) Stress test results summary-----

Metrics\ Concurrency	1 concurrent	10 concurrent	20 concurrent	30 concurrent	40 concurrent	50 concurrent
Total requests	1	10	20	30	40	50
Success rate	100.00%	100.00%	100.00%	100.00%	100.00%	100.00%
Average latency	80.5510s	93.5289s	97.1551s	100.7830s	99.8265s	99.5300s
Maximum delay	80.5510s	107.8886s	133.6073s	156.9135s	116.2559s	115.6441s
Minimum delay	80.5510s	79.7242s	84.5335s	82.8031s	81.1707s	81.6779s
P90 Delay	80.5510s	105.1389s	112.7804s	111.3159s	112.8292s	109.1424s
P95 Delay	80.5510s	106.5138s	114.0461s	115.8762s	115.5501s	110.0651s
P99 Delay	80.5510s	107.6136s	129.6950s	145.9792s	116.1772s	113.7739s
Average First Word Delay	80.5510s	93.5289s	97.1551s	100.7830s	99.8265s	99.5300s
Total number of tokens generated	607	6822	14068	21898	28614	35499
Single concurrent minimum throughput	7.54 tokens/s	7.29 tokens/s	7.23 tokens/s	7.22 tokens/s	7.14 tokens/s	7.12 tokens/s
Maximum throughput of single concurrency	7.54 tokens/s	7.30 tokens/s	7.29 tokens/s	7.30 tokens/s	7.20 tokens/s	7.15 tokens/s
Average throughput per concurrent session	7.54 tokens/s	7.29 tokens/s	7.24 tokens/s	7.24 tokens/s	7.16 tokens/s	7.13 tokens/s
Overall throughput	7.53 tokens/s	63.21 tokens/s	105.25 tokens/s	139.52 tokens/s	246.08 tokens/s	306.83 tokens/s

----- max_tokens=65536(64k) Stress test results summary-----

Metrics\ Concurrency	1 concurrent	10 concurrent	20 concurrent	30 concurrent	40 concurrent	50 concurrent
Total requests	1	10	20	30	40	50
Success rate	100.00%	100.00%	100.00%	100.00%	100.00%	100.00%
Average latency	81.7039s	90.8889s	99.1065s	99.7213s	99.2848s	99.0839s
Maximum delay	81.7039s	112.5239s	113.0623s	125.9377s	130.2727s	113.6320s
Minimum delay	81.7039s	78.5028s	83.0163s	81.5086s	80.9710s	85.9351s
P90 Delay	81.7039s	99.3878s	108.6772s	113.1816s	111.2980s	110.5696s
P95 Delay	81.7039s	105.9558s	112.0033s	118.0436s	114.1228s	112.7986s
P99 Delay	81.7039s	111.2103s	112.8505s	124.2411s	124.3386s	113.4573s
Average First Word Delay	81.7039s	90.8889s	99.1065s	99.7213s	99.2848s	99.0839s
Total number of tokens generated	593	6538	14244	21620	28389	34942
Single concurrent minimum throughput	7.26 tokens/s	7.17 tokens/s	7.18 tokens/s	7.21 tokens/s	7.13 tokens/s	7.04 tokens/s
Maximum throughput of single concurrency	7.26 tokens/s	7.23 tokens/s	7.19 tokens/s	7.25 tokens/s	7.20 tokens/s	7.08 tokens/s
Average throughput per concurrent session	7.26 tokens/s	7.19 tokens/s	7.19 tokens/s	7.23 tokens/s	7.15 tokens/s	7.05 tokens/s
Overall throughput	7.26 tokens/s	58.09 tokens/s	125.95 tokens/s	171.59 tokens/s	217.80 tokens/s	307.44 tokens/s

Resource peak value during stress testing:

+----------------------------------------------------------------------------------------+| NVIDIA-SMI 550.144.03 Driver Version: 550.144.03 CUDA Version: 12.4 ||----------------------------------------+------------------------+-------------------------+| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC || Fan Temp Perf Pwr: Usage/Cap | Memory-Usage | GPU-Util Compute M. ||   00000000:65:02.0 Off | 0 || N/A 39C P0 176W / 500W | 95096MiB / 97871MiB | 95% Default || Off | 0 || N/A 46C P0 184W / 500W | 95070MiB / 97871MiB | 23% Default || | 178W/500W |   95070MiB / 97871MiB | 95% Default || 97% Default || |+-----------------------------------------+------------------------+-------------------------+| 5 NVIDIA H20 Off | 00000000:69:03.0 Off | 0 || N/A 45C P0 182W / 500W | 95070MiB / 97871MiB | 97% Default || | | Disabled |+----------------------------------------+--------------------------------+----------------+| 6 NVIDIA H20 Off | 00000000:6B:02.0 Off | 0 || N/A 46C P0 184W / 500W | 95070MiB / 97871MiB | 97% Default || 182W / 500W | 95078MiB / 97871MiB | 98% Default || | | Disabled |+------------------------------------------+------------------------+-------------------------+

Peak KV cache usage:

INFO 03-31 23:22:50 [loggers.py:80] Avg prompt throughput: 45.0 tokens/s, Avg generation throughput: 166.9 tokens/s, Running: 50 reqs, Waiting: 0 reqs, GPU KV cache usage: 7.7%, Prefix cache hit rate: 0.0%INFO 03-31 23:23:00 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 350.0 tokens/s, Running: 50 reqs, Waiting: 0 reqs, GPU KV cache usage: 7.7%, Prefix cache hit rate: 0.0%INFO 03-31 23:23:10 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s,Avg generation throughput: 355.0 tokens/s, Running: 50 reqs, Waiting: 0 reqs, GPU KV cache usage: 15.4%, Prefix cache hit rate: 0.0%INFO 03-31 23:23:20 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 360.0 tokens/s, Running: 50 reqs, Waiting: 0 reqs, GPU KV cache usage: 15.4%, Prefix cache hit rate: 0.0%INFO 03-31 23:23:30 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 355.0 tokens/s, Running: 50 reqs, Waiting: 0 reqs, GPU KV cache usage: 23.2%, Prefix cache hit rate: 0.0%INFO 03-31 23:23:40 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 355.0 tokens/s, Running: 50 reqs, Waiting: 0 reqs, GPU KV cache usage: 30.9%, Prefix cache hit rate: 0.0%INFO 03-31 23:23:50 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 355.0 tokens/s, Running: 50 reqs, Waiting: 0 reqs, GPU KV cache usage: 30.9%, Prefix cache hit rate: 0.0%INFO 03-31 23:24:00 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 360.0 tokens/s, Running: 50 reqs, Waiting: 0 reqs, GPU KV cache usage: 38.6%, Prefix cache hit rate: 0.0%INFO 03-31 23:24:10 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 350.0 tokens/s, Running: 50 reqs, Waiting: 0 reqs, GPU KV cache usage: 38.6%, Prefix cache hit rate: 0.0%

Mathematical data set benchmarking

We used GitHub - huggingface/lighteval: Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends to run math test sets on DeepSeek-R1-AWQ and DeepSeek-V3-0324 deployed on 8 H20 cards. Here we modified a small amount of lighteval code so that it does not start model inference itself, but calls the OpenAI API interface of the deployed model. The test results are as follows:

8-card H20 deployed DeepSeek-R1-AWQ running test

Math500 Assessment

Modified evaluation command:

(benchmark) root@H20:/data/code/lighteval# lighteval endpoint litellm model_args="http://localhost:7800" tasks="lighteval|math_500|0|0"

Evaluation results:

| Task |Version| Metric |Value| |Stderr||--------------------|------:|----------------|----:|---|-----:||all |

8 H20 cards deployed DeepSeek-V3-0324 running points test

Math500 Assessment

Modified evaluation command:

(benchmark) root@H20:/data/code/lighteval# lighteval endpoint litellm model_args="http://localhost:7800" tasks="lighteval|math_500|0|0" --max-samples 20

To save time, only 20 questions were taken.

Evaluation results:

| Task |Version| Metric |Value| |Stderr||--------------------|------:|----------------|----:|---|-----:||all | |extractive_match| 0.95|± | 0.05||lighteval:math_500:0| 1|extractive_match| 0.95|± |

Peak resource consumption during the test:

|============================================+========================+======================|| 0 NVIDIA H20 Off | 00000000:65:02.0 Off | 0 || N/A 36C P0 159W / 500W | 97048MiB / 97871MiB | 96% Default ||   97022MiB / 97871MiB | 91% Default || 97% Default || |+-----------------------------------------+------------------------+-------------------------+| 4 NVIDIA H20 Off | 00000000:69:02.0 Off | 0 || N/A 37C P0 161W / 500W | 97022MiB / 97871MiB | 21% Default || NVIDIA H20 Off | 00000000:69:03.0 Off | 0 || N/A 41C P0 162W / 500W | 97022MiB / 97871MiB | 97% Default || 00000000:6B:02.0 Off | 0 || N/A 42C P0 164W / 500W | 97022MiB / 97871MiB | 97% Default ||0 Off | 0 || N/A 37C P0 163W / 500W | 97030MiB / 97871MiB | 95% Default || | | Disabled |+----------------------------------------+------------------------+--------------------------+

aime25 evaluation

Modified evaluation command:

(benchmark) root@H20:/data/code/lighteval# lighteval endpoint litellm model_args="http://localhost:7800" tasks="lighteval|aime25|0|0" --max-samples 20

To save time, only 20 questions were taken.

Evaluation results:


| Task |Version| Metric |Value| |Stderr||------------------|------:|----------------|----:|---|-----:||all | |extractive_match| 0.4|± |0.1124||lighteval:aime25:0| 1|extractive_match| 0.4|± |0.1124|

Aime25 is relatively new, but this score seems to be lower than the scores published by others before. It may be a problem with the evaluation method, or the context may be truncated during the evaluation process, affecting the results.