Deploy DeepSeek in a vLLM production environment, halving the computing power and increasing the throughput tenfold!

Written by

Caleb Hayes

Updated on:July-10th-2025

Requirements: I have used Ollama to deploy the deepseek-r1:32b model before. It is very convenient and fast, suitable for personal rapid deployment. What method should be used for deployment in an enterprise production environment? Generally, vllm and sglang are used for deployment. This article uses vLLM to deploy the DeepSeek-R1 model. The differences between Ollama and vLLM are as follows:

Comparison Dimensions	Ollama	vLLM
Core Positioning	Lightweight localization tool, suitable for individual developers and small-scale experiments	A production-level reasoning framework that focuses on high-concurrency, low-latency enterprise scenarios
Hardware requirements	Supports CPU and GPU, low memory usage (quantized model is used by default)	Must rely on NVIDIA GPU, high video memory usage
Model support	Built-in pre-trained model library (supports 1700+ models), automatically downloads quantized version (mainly int4)	You need to manually download the original model file (such as HuggingFace format) to support a wider range of models
Deployment Difficulty	One-click installation, ready to use, no programming required	Need to configure Python environment, CUDA driver, rely on technical experience
Performance characteristics	The single inference speed is fast, but the concurrent processing capability is weak	High throughput, supporting dynamic batch processing and thousands of concurrent requests
Resource Management	Flexible adjustment of resource usage, automatic release of video memory when idle	The video memory usage is fixed, and resources need to be reserved to cope with peak loads

vLLM is a fast and easy-to-use LLM inference and serving library. Equipped with a brand-new algorithm, vLLM redefines the state of the art for LLM serving:. Compared with HuggingFace Transformers, it provides up to 24 times the throughput without any model architecture changes. With half the computing power and ten times the throughput, the study compared the throughput of vLLM with the most popular LLM library HuggingFace Transformers (HF) and HuggingFace Text Generation Inference (TGI), which previously had SOTA throughput. In addition, the study divided the experimental settings into two types: LLaMA-7B, with the hardware being NVIDIA A10G GPU; and LLaMA-13B, with the hardware being NVIDIA A100 GPU (40GB). They sampled input/output lengths from the ShareGPT dataset. The results show that the throughput of vLLM is 24 times higher than HF and 3.5 times higher than TGI. vLLM Documentation: https://docs.vllm.ai/en/latest/index.htmlSource code address: https://github.com/vllm-project/vllmPerformance test: https://blog.vllm.ai/2024/09/05/perf-update.htmlYou don't have to understand the pictures, it's awesome! Environment preparationI purchased Tencent Cloud's high-performance application service with the following configuration: Ubuntu 20.04Environment configuration: Ubuntu 20.04, Driver 525.105.17, Python 3.8, CUDA 12.0, cuDNN 8Computing power type: Two-card GPU basic type - 2*16GB+ | 16+TFlops SP | CPU - 16 cores | Memory - 64GBInstall CondaUse conda to create a python environment and paste the script directly:

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh && chmod +x Miniconda3-latest-Linux-x86_64.sh
./Miniconda3-latest-Linux-x86_64.sh -b
source /root/miniconda3/bin/activate
conda init
conda config --set auto_activate_base false

Use vLLM to deploy DeepSeek-R1. Use conda to create a python environment. The command is as follows:

conda create -n vllm python=3.12 -y
conda activate vllm

Install vllm and modelscope with the following commands:

pip install vllm-modelscope

Use modelscope to download the DeepSeek-R1 model. The command is as follows:

mkdir -p /data/models && modelscope download --model 'deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B' --local_dir '/data/models/DeepSeek-R1-Distill-Qwen-1.5B'

Reference: https://modelscope.cn/docs/models/downloadUse vllm to start the deepseek model. The command is as follows:

vllm serve "/data/models/DeepSeek-R1-Distill-Qwen-1.5B" --served-model-name "DeepSeek-R1" --load-format "safetensors" --gpu-memory-utilization 0.8 --tensor-parallel-size 2 --dtype half --port 8000

If you encounter the warning "Bfloat16 is only supported on GPUs with compute capability of at least 8.0. Your Tesla T4 GPU has compute capability 7.5. You can use float16 instead by explicitly setting the`dtype` flag in CLI, for example: --dtype=half.", just add parameters according to the warning. Note:

--tensor-parallel-size should be consistent with the number of GPUs
--gpu-memory-utilization controls the percentage of video memory used
--served-model-name The model name used in the API
--disable-log-requests Disable logging requests

vLLM Linux GPU installation document: https://docs.vllm.ai/en/latest/getting_started/installation/gpu/index.htmlEngine parameters: https://docs.vllm.ai/en/latest/serving/engine_args.htmlView GPU status, as shown below: Use Postman to test the browser to open: http://ip:8000/ Interface documentation: http://ip:8000/docs Postman call, as shown below:

{
"model": "DeepSeek-R1",
"messages": [
{
"role": "user",
"content": "Hi, my name is Xiao Zha Zha. Who are you?"
}
]
}

Benchmark test Download the test code, the command is as follows:

wget https://raw.githubusercontent.com/vllm-project/vllm/refs/heads/main/benchmarks/benchmark_utils.py
wget https://raw.githubusercontent.com/vllm-project/vllm/refs/heads/main/benchmarks/benchmark_throughput.py

The execution command is as follows:

python benchmark_throughput.py --model "/data/models/DeepSeek-R1-Distill-Qwen-1.5B" --backend vllm --input-len 128 --output-len 512 --num-prompts 50 --seed 1100 --dtype half

Result: Throughput: 2.45 requests/s, 1569.60 total tokens/s, 1255.68 output tokens/s

(over)