Deploy DeepSeek in a vLLM production environment, halving the computing power and increasing the throughput tenfold!

Written by
Caleb Hayes
Updated on:July-10th-2025
Recommendation

vLLM is a new option for private deployment of DeepSeek-R1, which reduces computing power by half and increases throughput tenfold!

Core content:
1. Comparative analysis of private deployment of DeepSeek-R1
2. Performance and deployment differences between vLLM and Ollama
3. Deployment advantages of vLLM and performance test results

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)

Requirements: I have used Ollama to deploy the deepseek-r1:32b model before. It is very convenient and fast, suitable for personal rapid deployment. What method should be used for deployment in an enterprise production environment? Generally, vllm and sglang are used for deployment. This article uses vLLM to deploy the DeepSeek-R1 model. The differences between Ollama and vLLM are as follows:





Comparison Dimensions
Ollama
vLLM
Core Positioning
Lightweight localization tool, suitable for individual developers and small-scale experiments
A production-level reasoning framework that focuses on high-concurrency, low-latency enterprise scenarios
Hardware requirements
Supports CPU and GPU, low memory usage (quantized model is used by default)
Must rely on NVIDIA GPU, high video memory usage
Model support
Built-in pre-trained model library (supports 1700+ models), automatically downloads quantized version (mainly int4)
You need to manually download the original model file (such as HuggingFace format) to support a wider range of models
Deployment Difficulty
One-click installation, ready to use, no programming required
Need to configure Python environment, CUDA driver, rely on technical experience
Performance characteristics
The single inference speed is fast, but the concurrent processing capability is weak
High throughput, supporting dynamic batch processing and thousands of concurrent requests
Resource Management
Flexible adjustment of resource usage, automatic release of video memory when idle
The video memory usage is fixed, and resources need to be reserved to cope with peak loads


vLLM is a fast and easy-to-use LLM inference and serving library. Equipped with a brand-new algorithm, vLLM redefines the state of the art for LLM serving:. Compared with HuggingFace Transformers, it provides up to 24 times the throughput without any model architecture changes. With half the computing power and ten times the throughput, the study compared the throughput of vLLM with the most popular LLM library HuggingFace Transformers (HF) and HuggingFace Text Generation Inference (TGI), which previously had SOTA throughput. In addition, the study divided the experimental settings into two types: LLaMA-7B, with the hardware being NVIDIA A10G GPU; and LLaMA-13B, with the hardware being NVIDIA A100 GPU (40GB). They sampled input/output lengths from the ShareGPT dataset. The results show that the throughput of vLLM is 24 times higher than HF and 3.5 times higher than TGI. vLLM Documentation: https://docs.vllm.ai/en/latest/index.htmlSource code address: https://github.com/vllm-project/vllmPerformance test: https://blog.vllm.ai/2024/09/05/perf-update.htmlYou don't have to understand the pictures, it's awesome! Environment preparationI purchased Tencent Cloud's high-performance application service with the following configuration: Ubuntu 20.04Environment configuration: Ubuntu 20.04, Driver 525.105.17, Python 3.8, CUDA 12.0, cuDNN 8Computing power type: Two-card GPU basic type - 2*16GB+ | 16+TFlops SP | CPU - 16 cores | Memory - 64GBInstall CondaUse conda to create a python environment and paste the script directly:

























  1. wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh && chmod +x Miniconda3-latest-Linux-x86_64.sh

  2. ./Miniconda3-latest-Linux-x86_64.sh -b

  3. source /root/miniconda3/bin/activate

  4. conda init

  5. conda config --set auto_activate_base false


Use vLLM to deploy DeepSeek-R1. Use conda to create a python environment. The command is as follows:



  1. conda create -n vllm python=3.12 -y

  2. conda activate vllm


Install vllm and modelscope with the following commands:

  1. pip install vllm-modelscope


Use modelscope to download the DeepSeek-R1 model. The command is as follows:

  1. mkdir -p /data/models && modelscope download --model 'deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B' --local_dir '/data/models/DeepSeek-R1-Distill-Qwen-1.5B'


Reference: https://modelscope.cn/docs/models/downloadUse vllm to start the deepseek model. The command is as follows:



  1. vllm serve "/data/models/DeepSeek-R1-Distill-Qwen-1.5B" --served-model-name "DeepSeek-R1" --load-format "safetensors" --gpu-memory-utilization 0.8 --tensor-parallel-size 2 --dtype half --port 8000




If you encounter the warning "Bfloat16 is only supported on GPUs with compute capability of at least 8.0. Your Tesla T4 GPU has compute capability 7.5. You can use float16 instead by explicitly setting the`dtype` flag in CLI, for example: --dtype=half.", just add parameters according to the warning. Note:



  • --tensor-parallel-size should be consistent with the number of GPUs

  • --gpu-memory-utilization controls the percentage of video memory used

  • --served-model-name The model name used in the API

  • --disable-log-requests Disable logging requests



vLLM Linux GPU installation document: https://docs.vllm.ai/en/latest/getting_started/installation/gpu/index.htmlEngine parameters: https://docs.vllm.ai/en/latest/serving/engine_args.htmlView GPU status, as shown below: Use Postman to test the browser to open: http://ip:8000/ Interface documentation: http://ip:8000/docs Postman call, as shown below:















  1. {

  2.     "model": "DeepSeek-R1",

  3.     "messages": [

  4.         {

  5.             "role": "user",

  6.             "content": "Hi, my name is Xiao Zha Zha. Who are you?"

  7.         }

  8.     ]

  9. }




Benchmark test Download the test code, the command is as follows:



  1. wget https://raw.githubusercontent.com/vllm-project/vllm/refs/heads/main/benchmarks/benchmark_utils.py

  2. wget https://raw.githubusercontent.com/vllm-project/vllm/refs/heads/main/benchmarks/benchmark_throughput.py


The execution command is as follows:

  1. python benchmark_throughput.py --model "/data/models/DeepSeek-R1-Distill-Qwen-1.5B" --backend vllm --input-len 128 --output-len 512 --num-prompts 50 --seed 1100 --dtype half


Result: Throughput: 2.45 requests/s, 1569.60 total tokens/s, 1255.68 output tokens/s

(over)