Teach you how to deploy large models in clusters

Written by

Iris Vance

Updated on:June-27th-2025

For the deployment of large parameter models (such as 70B+), either use a single machine with multiple GPUs or use a cluster. For cluster deployment, the commonly used solution in the industry is to use the distributed computing framework Ray. In this article, we use Ray + vLLM to deploy a DeepSeek-R1-Distill-Llama-70B large model.

1. Environment Introduction

1) Hardware

server	Number of cards	Quantity (unit)	CPU(core)	Memory	System version
NVIDIA H100 80GB	2 cards	2	16/unit	256G/unit	Ubuntu 22.04.5 LTS

2) Software

software	Version	Remark
CUDA	12.2
MLNX_OFED	24.10-0.7.0.0	IB Driver
NCCL	2.21.5	GPU multi-card communication
vllm	0.7.2	LLM Inference Engine
ray	2.42.0	Distributed computing framework

2. Environmental preparation

1) Install the graphics card driver

sudo apt install nvidia-driver-535

2) Install cuda12.2

wget https://developer.download.nvidia.com/compute/cuda/12.2.0/local_installers/cuda_12.2.0_535.54.03_linux.runsudo sh cuda_12.2.0_535.54.03_linux.run

3. Model download (MoTou Community)

# Install ModelScopesudo pip3 install modelscope 
# Download the modelsudo  mkdir  -p /data/models/sudo modelscope download --model deepseek-ai/DeepSeek-R1-Distill-Llama-70B --local_dir /models/DeepSeek-R1-Distill-Llama-70B

4. Install vllm, Ray and dependent libraries

sudo pip3 install vllm ray[default] openai transformers tqdm

5. Start the Ray cluster

1) One of the machines acts as the master node and executes:

nohup ray start --block --head --port 6379 &> /var/log/ray.log &

2) Another machine acts as a slave node and executes:

nohup ray start --block --address='<master IP>:6379' &> /var/log/ray.log &

3) Check the cluster status

ray status

The appearance of similar logs indicates that the cluster is normal:

======== Autoscaler status: 2025-04-19 09:05:15.452837 ========Node status---------------------------------------------------------------Active: 1 node _10a69f12ecacc9109f72036acbdc3e51731af85034af1f5169f20e60 1 node_9df61c9a75c08d76958a6aa4f0a7685e4cde76923f9e2309a4907f45
Pending: (no pending nodes)Recent failures: (no failures)

6. Start the model using vLLM

On both machines, execute:

# Execute on the cluster head nodenohup  bash -c 'NCCL_NVLS_ENABLE= 0  vllm serve /models/DeepSeek-R1-Distill-Llama- 70 B --enable-reasoning --reasoning-parser deepseek_r1 --trust-remote-code --tensor-parallel-size  4   --port= 8080  --served-model-name DeepSeek-R1 --gpu-memory-utilization  0 . 95  --max-model-len  32768  --max-num-batched-tokens  32768  --quantization fp8' &> /var/log/vllm_deepseek_r1.log &

Parameter Description:

NCCL_NVLS_ENABLE=0: Disables the NVLS feature of the NVIDIA Collective Communication Library (NCCL). NVLS is the abbreviation of NVIDIA Library Services, disabling it can solve communication problems in some cluster environments.
--enable-reasoning: Enables the reasoning capabilities of vLLM, allowing the model to perform more complex reasoning tasks, such as solving problems step by step.
--reasoning-parser deepseek_r1: Specifies to use the DeepSeek-R1 model-specific reasoning result parser. This parser is responsible for processing the reasoning steps and intermediate results in the model output.
--trust-remote-code: Allows loading and execution of custom code contained in the model. This is important for models that use custom modules (such as DeepSeek), but need to ensure that the model source is trusted.
--tensor-parallel-size 4: Sets the tensor parallelism to 4, which means the model will be split to run on 4 GPUs. This split allows very large models to run efficiently on multiple GPUs, with each GPU storing and computing only part of the model.
--gpu-memory-utilization 0.95: Set the GPU memory usage limit to 95%.
--max-model-len 32768: Sets the maximum sequence length (number of tokens) that the model can handle to 32,768. This is a very large context window, allowing processing of extremely long input texts.
--max-num-batched-tokens 32768: Sets the maximum number of tokens allowed in a single batch, which is also set to 32,768. This affects the service's ability to handle multiple concurrent requests, and larger values allow more concurrency but require more memory.
--quantization fp8: Use FP8 (8-bit floating point) quantization technology to reduce model memory requirements. FP8 is an efficient quantization format recently supported by NVIDIA, which can significantly reduce memory usage while maintaining good performance (compared to FP16 or FP32).

7. Test whether the large model is accessed normally

curl http://127.0.0.1:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "DeepSeek-R1", "messages": [ {"role": "user", "content": "Who are you?"} ], "max_tokens": 1024 }'