vLLM private deployment of the full-blooded version of DeepSeek-R1-671B model

Master the efficient method of private deployment of vLLM.
Core content:
1. Server environment and hardware configuration requirements
2. Miniconda installation and environment setting steps
3. Key operations of Docker deployment
Server Environment Preparation
Hardware configuration: 16 * H800, multiple machines and multiple cards, bfloat16 + FP8 quantized mixed precision inference, output efficiency 3 seconds/1 token.
Install Miniconda
Download the Miniconda installation script
Open the terminal and use the wget command to download the latest version of the Miniconda installation script:
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
Note: If your system architecture is not x86_64, please visit Miniconda official download page to get the installation package suitable for your system.
Add execution permissions to the installation script
After the download is complete, grant executable permissions to the script file:
chmod +x Miniconda3-latest-Linux-x86_64.sh
Run the installation script
Execute the script to start the installation:
./Miniconda3-latest-Linux-x86_64.sh
During the installation, you will see some prompts:
License Agreement : Press Enter to view the content and enter yes to agree to the license agreement.
Installation path : You can choose the default path (usually ~/miniconda3 ) or specify another path.
Initialization settings : After the installation is complete, the installer will ask whether to initialize conda (that is, automatically modify the shell configuration file, such as ~/.bashrc ). It is recommended to select yes.
Activate conda environment
After the installation is complete, reload the shell configuration file to make the changes take effect (or restart the terminal):
source ~/.bashrc
Now you can check whether conda is installed successfully by running:
conda --version
If you see version information similar to conda 4.xx , it means the installation is successful.
Create and activate a virtual environment
conda create -n deepseek python==3.10
conda activate deepseek
Docker deployment (server version Ubuntu 20.04 system)
Update package index
sudo apt-get update
Install Necessary Packages
sudo apt-get install apt-transport-https ca-certificates curl gnupg lsb-release
Add Alibaba Cloud official GPG key
curl -fsSL http://mirrors.aliyun.com/docker-ce/linux/ubuntu/gpg | sudo apt-key add -
Setting up Alibaba Cloud Docker repository
sudo sh -c 'echo "deb [arch=amd64] http://mirrors.aliyun.com/docker-ce/linux/ubuntu $(lsb_release -cs) stable" > /etc/apt/sources.list.d/docker.list'
Install Docker
sudo apt-get update
sudo apt-get install docker-ce docker-ce-cli containerd.io
Verify Installation
sudo systemctl status docker
Output is activated, showing green to indicate it is started: (CTRL + C to exit)
Check Docker version
docker --version
Set up Docker to start automatically
sudo systemctl enable docker
Verify whether the automatic start function is enabled
(If the output is enabled, it has been successfully enabled. If not, further inspection is required)
sudo systemctl is-enabled docker
Configure Docker image acceleration
Enter the command directory to add the daemon.json file for editing:
sudo nano /etc/docker/daemon.json
The edited content is as follows: (Paste the content in, press CTRL + O + ENTER to save, and CTRL + X to exit)
{
"registry-mirrors" : [
"https://docker.211678.top" ,
"https://docker.1panel.live" ,
"https://hub.rat.dev" ,
"https://docker.m.daocloud.io" ,
"https://do.nark.eu.org" ,
"https://dockerpull.com" ,
"https://dockerproxy.cn"
]
}
Reload the Docker service configuration file
sudo systemctl reload docker
Stop and restart the Docker service
sudo systemctl restart docker
Check the Docker service status again
sudo systemctl status docker
Running a simple Docker container test
docker run hello-world
Install Docker graphics driver
wget https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2204/x86_64/nvidia-container-toolkit-base_1.13.5-1_amd64.deb
wget https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2204/x86_64/libnvidia-container-tools_1.13.5-1_amd64.deb
wget https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2204/x86_64/libnvidia-container1_1.13.5-1_amd64.deb
Install on the server:
sudo dpkg -i /home/data01/libnvidia-container1_1.13.5-1_amd64.deb
sudo dpkg -i /home/data01/libnvidia-container-tools_1.13.5-1_amd64.deb
sudo dpkg -i /home/data01/nvidia-container-toolkit-base_1.13.5-1_amd64.deb
sudo dpkg -i /home/data01/nvidia-container-toolkit_1.13.5-1_amd64.deb
Pull DeepSeek-R1-671B model
The Mota community pulls DeepSeek-R1
Address link: git clone https://www.modelscope.cn/deepseek-ai/DeepSeek-R1.git
Remove the rdma-userspace-config-bbc package
If you do not need RDMA services, you can try to remove the problematic packages:
sudo apt remove rdma-userspace-config-bbc
Then execute the following command to clean up the residual configuration in the system:
sudo apt autoremove
sudo apt clean
Download the LSF large file storage before pulling the model file
Make sure Git is installed
First check if Git is installed:
git --version
If it is not installed, you can install it using the following command:
sudo apt update && sudo apt install git -y
Install Git LFS
If Git LFS is not installed, you can install it using the following command:
Ubuntu/Debian
curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash
sudo apt install git-lfs -y
Initializing Git LFS
Once the installation is complete, run the following command to initialize Git LFS:
git lfs install
Pull DeepSeek-R1-671B project command
git clone https://www.modelscope.cn/deepseek-ai/DeepSeek-R1.git
Running this command will get stuck, the model is being downloaded, you can check whether the disk memory is decreasing to confirm
The file is about 1.3T and the disk space required is sufficient
df -h
Export server disk:
Filesystem Size Used Avail Use% Mounted on
tmpfs 1.5T 0 1.5T 0% /dev/shm
tmpfs 605G 13M 605G 1% /run
tmpfs 5.0M 0 5.0M 0% /run/lock
/dev/sda2 91G 14G 77G 16% /
/dev/sda3 1.9G 5.3M 1.9G 1% /boot/efi
/dev/sda4 341G 40K 340G 1% /home
/dev/nvme0n1p1 3.4T 3.7G 3.4T 1% /home/data01
/dev/nvme2n1p1 3.4T 24K 3.4T 1% /home/data03
/dev/nvme1n1p1 3.4T 24K 3.4T 1% /home/data02
/dev/nvme3n1p1 3.4T 24K 3.4T 1% /home/data04
tmpfs 303G 4.0K 303G 1% /run/user/0
(Using 200 bandwidth, it takes about 8-10 hours to download)
VLLM official website startup script
Distributed deployment official website reference link: https://docs.vllm.ai/en/latest/serving/distributed_serving.html
Find the sh inference run file on the official website:
1. Open the official website link and scroll down to find the multi-node VLLM title
2. Find the github link in the first line of the second paragraph under the multi-node title
3. After entering, a script file is downloaded to the local computer
First, pull vllm from a server or computer that can access the Internet scientifically
Make sure the image exists
First, check whether the vllm/vllm-openai image has been successfully pulled . Run in the command line:
docker images
This will list all the images that exist locally. If vllm/vllm-openai is not in the list, you need to pull it first.
Pull the image
If the image is not found, you can pull the vllm/vllm-openai image using the following command:
The following is the win local pull: (open the command prompt to ensure that the local docker on the computer can be used)
Pull vllm and wait for the download to complete:
docker pull vllm/vllm-openai
Confirm that the pull was successful
Run docker images again and confirm that the vllm/vllm-openai image has appeared in the local image list.
Run docker save again
After the image is pulled, run again:
docker save -o vllm.tar vllm/vllm-openai
If everything goes well, you should be able to successfully save the image as a .tar file.
Currently, vllm.tar and run_cluster.sh are both on the local computer and are manually pulled to the server.
Server linkage file operation
Currently the server has three files, model file, vllm file, and running script file
Upload all files of the master node server to the same location of the slave node: (the two machines are connected via an intranet, and other upload methods can be selected)
rsync -avz /home/model/DeepSeek-R1 root@XXX.XXX .XX.XX:/XXX/XX/
Enter the vllm.tar file on each server and run the following command to download it:
docker load -i vllm.tar
After the download is complete, check the IP addresses of the two machines.
ip a
On the master server, run:
bash run_cluster.sh vllm/vllm-openai 192.168.0.113 --head /home/data01/ -v /home/data01/:/models -e VLLM_HOST_IP=192.168.0.113 -e GLOO_SOCKET_IFNAME=eth0 -e NCCL_SOCKET_IFNAME=eth0
On the slave server, run:
bash run_cluster.sh vllm/vllm-openai 192.168.0.113 --worker /home/data01/ -v /home/data01/:/models -e VLLM_HOST_IP=192.168.0.116 -e GLOO_SOCKET_IFNAME=eth0 -e NCCL_SOCKET_IFNAME=eth0
Open a new terminal page to view the docker image:
docker ps -a
Enter the docker image:
docker exec -it <specified ID> <path>
Check for graphics card issues:
ray status
Start the vllm service:
vllm serve /models/DeepSeek-R1/ --tensor-parallel-size 8 --pipeline-parallel-size 2 --served_model_name deepseek_r1 --enforce_eager --trust-remote-code --dtype float16
Done!