Deploy DeepSeek in a vLLM production environment, halving the computing power and increasing the throughput tenfold!

vLLM is a new option for private deployment of DeepSeek-R1, which reduces computing power by half and increases throughput tenfold!
Core content:
1. Comparative analysis of private deployment of DeepSeek-R1
2. Performance and deployment differences between vLLM and Ollama
3. Deployment advantages of vLLM and performance test results
Requirements: I have used Ollama to deploy the deepseek-r1:32b model before. It is very convenient and fast, suitable for personal rapid deployment. What method should be used for deployment in an enterprise production environment? Generally, vllm and sglang are used for deployment. This article uses vLLM to deploy the DeepSeek-R1 model. The differences between Ollama and vLLM are as follows:
vLLM is a fast and easy-to-use LLM inference and serving library. Equipped with a brand-new algorithm, vLLM redefines the state of the art for LLM serving:. Compared with HuggingFace Transformers, it provides up to 24 times the throughput without any model architecture changes. With half the computing power and ten times the throughput, the study compared the throughput of vLLM with the most popular LLM library HuggingFace Transformers (HF) and HuggingFace Text Generation Inference (TGI), which previously had SOTA throughput. In addition, the study divided the experimental settings into two types: LLaMA-7B, with the hardware being NVIDIA A10G GPU; and LLaMA-13B, with the hardware being NVIDIA A100 GPU (40GB). They sampled input/output lengths from the ShareGPT dataset. The results show that the throughput of vLLM is 24 times higher than HF and 3.5 times higher than TGI. vLLM Documentation: https://docs.vllm.ai/en/latest/index.htmlSource code address: https://github.com/vllm-project/vllmPerformance test: https://blog.vllm.ai/2024/09/05/perf-update.htmlYou don't have to understand the pictures, it's awesome! Environment preparationI purchased Tencent Cloud's high-performance application service with the following configuration: Ubuntu 20.04Environment configuration: Ubuntu 20.04, Driver 525.105.17, Python 3.8, CUDA 12.0, cuDNN 8Computing power type: Two-card GPU basic type - 2*16GB+ | 16+TFlops SP | CPU - 16 cores | Memory - 64GBInstall CondaUse conda to create a python environment and paste the script directly:
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh && chmod +x Miniconda3-latest-Linux-x86_64.sh
./Miniconda3-latest-Linux-x86_64.sh -b
source /root/miniconda3/bin/activate
conda init
conda config --set auto_activate_base false
Use vLLM to deploy DeepSeek-R1. Use conda to create a python environment. The command is as follows:
conda create -n vllm python=3.12 -y
conda activate vllm
Install vllm and modelscope with the following commands:
pip install vllm-modelscope
Use modelscope to download the DeepSeek-R1 model. The command is as follows:
mkdir -p /data/models && modelscope download --model 'deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B' --local_dir '/data/models/DeepSeek-R1-Distill-Qwen-1.5B'
Reference: https://modelscope.cn/docs/models/downloadUse vllm to start the deepseek model. The command is as follows:
vllm serve "/data/models/DeepSeek-R1-Distill-Qwen-1.5B" --served-model-name "DeepSeek-R1" --load-format "safetensors" --gpu-memory-utilization 0.8 --tensor-parallel-size 2 --dtype half --port 8000
If you encounter the warning "Bfloat16 is only supported on GPUs with compute capability of at least 8.0. Your Tesla T4 GPU has compute capability 7.5. You can use float16 instead by explicitly setting the`dtype` flag in CLI, for example: --dtype=half.", just add parameters according to the warning. Note:
--tensor-parallel-size should be consistent with the number of GPUs
--gpu-memory-utilization controls the percentage of video memory used
--served-model-name The model name used in the API
--disable-log-requests Disable logging requests
vLLM Linux GPU installation document: https://docs.vllm.ai/en/latest/getting_started/installation/gpu/index.htmlEngine parameters: https://docs.vllm.ai/en/latest/serving/engine_args.htmlView GPU status, as shown below: Use Postman to test the browser to open: http://ip:8000/ Interface documentation: http://ip:8000/docs Postman call, as shown below:
{
"model": "DeepSeek-R1",
"messages": [
{
"role": "user",
"content": "Hi, my name is Xiao Zha Zha. Who are you?"
}
]
}
Benchmark test Download the test code, the command is as follows:
wget https://raw.githubusercontent.com/vllm-project/vllm/refs/heads/main/benchmarks/benchmark_utils.py
wget https://raw.githubusercontent.com/vllm-project/vllm/refs/heads/main/benchmarks/benchmark_throughput.py
The execution command is as follows:
python benchmark_throughput.py --model "/data/models/DeepSeek-R1-Distill-Qwen-1.5B" --backend vllm --input-len 128 --output-len 512 --num-prompts 50 --seed 1100 --dtype half
Result: Throughput: 2.45 requests/s, 1569.60 total tokens/s, 1255.68 output tokens/s