Teach you how to deploy large models in clusters

Master the practical guide for large model cluster deployment to improve your AI engineering capabilities.
For the deployment of large parameter models (such as 70B+), either use a multi-card stand-alone machine or a cluster. For cluster deployment, the commonly used solution in the industry is to use the distributed computing framework Ray. In this article, we use Ray + vLLM to deploy a DeepSeek-R1-Distill-Llama-70B large model.
1. Environment Introduction1
) Hardware
Server Number of Cards Quantity (Unit) CPU (Core) Memory System VersionNVIDIA
H100 80GB
2 Cards
2
16/Unit
256G/UnitUbuntu
22.04.5 LTS
2) Software
Software Version RemarksCUDA
12.2
MLNX_OFED
24.10-0.7.0.0
IB driverNCCL
2.21.5
GPU
multi-card communicationvllm
0.7.2
LLM
inference engineray
2.42.0
distributed
computing framework2
. Environment Preparation1
) Install the graphics card driversudo
apt
install nvidia-driver-535
2) Install cuda12.2
wget https://developer.download.nvidia.com/compute/cuda/12.2.0/local_installers/cuda_12.2.0_535.54.03_linux.runsudo sh cuda_12.2.0_535.54.03_linux.run
3. Model download (Mota Community)
# Install ModelScopesudo pip3 install modelscope
# Download modelsudo mkdir -p /data/models/sudo modelscope download --model deepseek-ai/DeepSeek-R1-Distill-Llama-70B --local_dir /models/DeepSeek-R1-Distill-Llama-70B
4. Install vllm, Ray and dependent librariessudo
pip3 install vllm ray[default] openai transformers tqdm
5. Start the Ray cluster1
) Use one of the machines as the master node and execute:
nohup ray start --block --head --port 6379 &> /var/log/ray.log &
2) Use the other machine as the slave node and execute:
nohup ray start --block --address='
3)Check the cluster status
ray status.
If similar logs appear, the cluster is normal:
======== Autoscaler status: 2025-04-19 09:05:15.452837 ========Node status---------------------------------------------------------------Active: 1 node_10a69f12ecacc9109f72036acbdc3e51731af85034af1f5169f20e60 1 node_9df61c9a75c08d76958a6aa4f0a7685e4cde76923f9e2309a4907f45
Pending: (no pending nodes)Recent failures: (no failures)
6、Start the model using vLLM
. Execute on both machines:
# Execute nohup bash -c 'NCCL_NVLS_ENABLE=0 vllm serve /models/DeepSeek-R1-Distill-Llama-70B --enable-reasoning --reasoning-parser deepseek_r1 --trust-remote-code --tensor-parallel-size 4 --port=8080 --served-model-name DeepSeek-R1 --gpu-memory-utilization 0.95 --max-model-len 32768 --max-num-batched-tokens 32768 --quantization fp8' &> /var/log/vllm_deepseek_r1.log &
Parameter Description:
NCCL_NVLS_ENABLE=0: Disable the NVLS function of the NVIDIA Collective Communication Library (NCCL). NVLS is the abbreviation of NVIDIA Library Services. Disabling it can solve communication problems in some cluster environments.
--enable-reasoning: Enables the reasoning capabilities of vLLM, allowing the model to perform more complex reasoning tasks, such as solving problems in steps.
--reasoning-parser deepseek_r1: Specifies the use of the DeepSeek-R1 model-specific reasoning result parser. This parser is responsible for processing the reasoning steps and intermediate results in the model output.
--trust-remote-code: Allows loading and execution of custom code contained in the model. This is important for models that use custom modules (such as DeepSeek), but need to ensure that the source of the model is trusted.
--tensor-parallel-size 4: Sets the tensor parallelism to 4, which means that the model will be split to run on 4 GPUs. This split allows very large models to run efficiently on multiple GPUs, with each GPU storing and computing only part of the model.
--gpu-memory-utilization 0.95: Sets the GPU memory utilization limit to 95%.
--max-model-len 32768: Sets the maximum sequence length (number of tokens) that the model can handle to 32,768. This is a very large context window, allowing processing of extremely long input text.
--max-num-batched-tokens 32768: Sets the maximum number of tokens allowed in a single batch, which is also set to 32,768 here. This affects the service's ability to handle multiple concurrent requests. A larger value allows more concurrency but requires more memory.
--quantization fp8: Use FP8 (8-bit floating point) quantization technology to reduce model memory requirements. FP8 is an efficient quantization format recently supported by NVIDIA that can significantly reduce memory usage while maintaining good performance (compared to FP16 or FP32).
7. Test whether the large model is normal by accessing
curl http://127.0.0.1:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "DeepSeek-R1", "messages": [ {"role": "user", "content": "Who are you?"} ], "max_tokens": 1024 }'
Core content:
1. Hardware and software environment configuration for large parameter model deployment
2. Detailed steps: from environment preparation to model download and installation
3. Start and status check of Ray cluster
For the deployment of large parameter models (such as 70B+), either use a single machine with multiple GPUs or use a cluster. For cluster deployment, the commonly used solution in the industry is to use the distributed computing framework Ray. In this article, we use Ray + vLLM to deploy a DeepSeek-R1-Distill-Llama-70B large model.
1. Environment Introduction
1) Hardware
server | Number of cards | Quantity (unit) | CPU(core) | Memory | System version |
---|---|---|---|---|---|
2) Software
software | Version | Remark |
---|---|---|
2. Environmental preparation
1) Install the graphics card driver
sudo apt install nvidia-driver-535
2) Install cuda12.2
wget https://developer.download.nvidia.com/compute/cuda/12.2.0/local_installers/cuda_12.2.0_535.54.03_linux.runsudo sh cuda_12.2.0_535.54.03_linux.run
3. Model download (MoTou Community)
# Install ModelScope
sudo pip3 install modelscope
# Download the model
sudo mkdir -p /data/models/
sudo modelscope download --model deepseek-ai/DeepSeek-R1-Distill-Llama-70B --local_dir /models/DeepSeek-R1-Distill-Llama-70B
4. Install vllm, Ray and dependent libraries
sudo pip3 install vllm ray[default] openai transformers tqdm
5. Start the Ray cluster
1) One of the machines acts as the master node and executes:
nohup ray start --block --head --port 6379 &> /var/log/ray.log &
2) Another machine acts as a slave node and executes:
nohup ray start --block --address='<master IP>:6379' &> /var/log/ray.log &
3) Check the cluster status
ray status
The appearance of similar logs indicates that the cluster is normal:
======== Autoscaler status: 2025-04-19 09:05:15.452837 ========
Node status
---------------------------------------------------------------
Active:
1 node _10a69f12ecacc9109f72036acbdc3e51731af85034af1f5169f20e60
1 node_9df61c9a75c08d76958a6aa4f0a7685e4cde76923f9e2309a4907f45
Pending:
(no pending nodes)
Recent failures:
(no failures)
6. Start the model using vLLM
On both machines, execute:
# Execute on the cluster head node
nohup bash -c 'NCCL_NVLS_ENABLE= 0 vllm serve /models/DeepSeek-R1-Distill-Llama- 70 B --enable-reasoning --reasoning-parser deepseek_r1 --trust-remote-code --tensor-parallel-size 4 --port= 8080 --served-model-name DeepSeek-R1 --gpu-memory-utilization 0 . 95 --max-model-len 32768 --max-num-batched-tokens 32768 --quantization fp8' &> /var/log/vllm_deepseek_r1.log &
Parameter Description:
NCCL_NVLS_ENABLE=0: Disables the NVLS feature of the NVIDIA Collective Communication Library (NCCL). NVLS is the abbreviation of NVIDIA Library Services, disabling it can solve communication problems in some cluster environments.
--enable-reasoning: Enables the reasoning capabilities of vLLM, allowing the model to perform more complex reasoning tasks, such as solving problems step by step.
--reasoning-parser deepseek_r1: Specifies to use the DeepSeek-R1 model-specific reasoning result parser. This parser is responsible for processing the reasoning steps and intermediate results in the model output.
--trust-remote-code: Allows loading and execution of custom code contained in the model. This is important for models that use custom modules (such as DeepSeek), but need to ensure that the model source is trusted.
--tensor-parallel-size 4: Sets the tensor parallelism to 4, which means the model will be split to run on 4 GPUs. This split allows very large models to run efficiently on multiple GPUs, with each GPU storing and computing only part of the model.
--gpu-memory-utilization 0.95: Set the GPU memory usage limit to 95%.
--max-model-len 32768: Sets the maximum sequence length (number of tokens) that the model can handle to 32,768. This is a very large context window, allowing processing of extremely long input texts.
--max-num-batched-tokens 32768: Sets the maximum number of tokens allowed in a single batch, which is also set to 32,768. This affects the service's ability to handle multiple concurrent requests, and larger values allow more concurrency but require more memory.
--quantization fp8: Use FP8 (8-bit floating point) quantization technology to reduce model memory requirements. FP8 is an efficient quantization format recently supported by NVIDIA, which can significantly reduce memory usage while maintaining good performance (compared to FP16 or FP32).
7. Test whether the large model is accessed normally
curl http://127.0.0.1:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "DeepSeek-R1", "messages": [ {"role": "user", "content": "Who are you?"} ], "max_tokens": 1024 }'