Volcano Engine: Deploy DeepSeek-R1's W4A8 solution on a single machine, reducing deployment costs by half

Volcano Engine has teamed up with NVIDIA to launch an efficient deployment solution for DeepSeek-R1, achieving a double breakthrough in cost and efficiency.
Core content:
1. Overview of DeepSeek-R1 model parameters and deployment challenges
2. Analysis of W4A8 quantization technology for TensorRT-LLM reasoning framework
3. ECS deployment guide: Three steps to quickly implement DeepSeek-R1 model reasoning service
DeepSeek-R1 has created a sensation in the AI market as soon as it was launched due to its excellent deep reasoning capabilities.
However, DeepSeek-R1 contains 671B of model parameters. Based on FP8 data accuracy, the weight memory consumption is close to 700GB. A single GPU card with 96GB of memory also requires 16 cards for multi-card deployment (2 ecs.hpcpni3ln.45xlarge instances). The data transmission efficiency between multi-machine reasoning also faces great challenges. How to improve deployment efficiency and reduce reasoning costs while ensuring model performance has become a key challenge limiting the large-scale deployment and application of DeepSeek-R1 large models.
Volcano Engine NVIDIA Cooperation Program: NVIDIA TensorRT-LLM Quantization Reduces Inference Cost
To address the above challenges, Volcano Engine has worked closely with NVIDIA to launch the DeepSeek-R1 acceleration optimization solution based on the TensorRT-LLM inference framework. It uses W4A8 quantization technology to achieve performance breakthroughs, significantly reducing model storage requirements and computational complexity while retaining model accuracy to the maximum extent possible:
4-bit weight quantization (W4): uses a mixed-precision dynamic quantization algorithm to compress the weight storage space to about half of the original size while retaining the model accuracy; 8-bit activation quantization (A8): FP8 is used for activation parameter data precision to ensure the accuracy of model inference;
Based on the W4A8 optimization solution of TensorRT-LLM, DeepSeek-R1 can be deployed on the Volcano Engine ecs.hpcpni3ln.45xlarge instance specification (including 8 GPUs with 96GB video memory). Compared with the non-quantized solution, without affecting the effect of the model itself (refer to the comparison of MMLU and MATH-500 benchmark data), the token throughput of the quantized solution is increased by 100%, and the required hardware resources and costs are halved.
ECS deployment guide: three steps to achieve efficient implementation
Step 1: Environmental preparation
Before officially deploying the DeepSeek-R1 W4A8 model inference service, you need to complete the instance creation and environment preparation of the ECS cloud server. First, enter the Volcano Engine ECS console to create a cloud server of ecs.hpcpni3ln.45xlarge specification (an invitation instance specification, which needs to be whitelisted. Please contact your account manager to apply);
To improve deployment efficiency, we recommend that you select a system image with pre-installed GPU drivers (for example, Ubuntu 22.04 with GPU Driver 535.161.08 and doca).
Log in to the target instance and install Docker and other related container environments. If you already have a Docker environment, you can skip steps 2 to 4.
Confirm the docker environment and runtime by following the command:
root@iv-ydt4e4fh1cbw80bh8228:~# docker info | grep -i runtimeRuntimes: io.containerd.runc.v2 nvidia runcDefault Runtime: runc
Installation Steps
sudo apt updatessudo apt install ca-certificates curl gnupg lsb-releasesudo mkdir -p /etc/apt/keyringscurl -fsSL https://mirrors.ivolces.com/docker/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpgecho "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://mirrors.ivolces.com/docker/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/nullsudo apt updatessudo apt install docker-ce docker-ce-cli containerd.io docker-compose-plugin -y
curl -s https://mirrors.ivolces.com/nvidia_all/ubuntu2204/x86_64/3bf863cc.pub | sudo apt-key add -cat <<EOF>/etc/apt/sources.list.d/nvidia.listdeb http://mirrors.ivolces.com/nvidia_all/ubuntu2204/x86_64/ /EOFapt updateapt install nvidia-container-toolkit -y
sudo nvidia-ctk runtime configure --runtime=dockersudo systemctl restart docker
docker pull ai-containers-cn-beijing.cr.volces.com/deeplearning/tensorrt-llm:quant_v3
ounter(lineounter(lineounter(linedocker run --gpus all -itd --net=host --shm-size=100g --ulimit memlock=-1 --ulimit stack=67108864 --privileged --ipc=host --security-opt seccomp=unconfined --cap-add=ALL -v /var/run/nvidia-topologyd/:/var/run/nvidia-topologyd/ ${IMAGE} /bin/bash
Step 2: Model Quantization Preparation
oniond download model DeepSeek-R1-W4AFP8
❝oniond is a self-developed model downloader of Volcano. You can use oniond list models to view the models currently supported for pulling;
❞
a. Execute the following command to generate the quantization coefficients of activations through the NVIDIA TensorRT Model Optimizer tool;
oniond download model DeepSeek-R1
PATH_OF_DEEPSEEK_R1=/llm-models/DeepSeek-R1/DeepSeek-R1
#Install ModelOpt
git clone https://github.com/NVIDIA/TensorRT-Model-Optimizer/ && cd modelopt
pip install "nvidia-modelopt[all]" -U --extra-index-url https://pypi.nvidia.com
#Conversion model accuracy
python inference/convert.py --hf-ckpt-path $PATH_OF_DEEPSEEK_R1 --save-path ds_r1 --n-experts 256 --model-parallel 8 && cd ..
#Prepare activation file
torchrun --nproc-per-node 8 --master_port=12346 ptq.py --model_path DeepSeek-V3/ds_r1 --config DeepSeek-V3/inference/configs/config_671B.json --quant_cfg FP8_DEFAULT_CFG --output_path ds_r1_fp8_per_tensor_calibration
b. Execute the following command to complete the weight quantization and integration of the DeepSeek-R1 model;
#/app/tensrrt_LLm/model.sh
#!/bin/bash
HF_MODEL_DIR=/path/to/DeepSeek-R1/
OUTPUT_DIR=/workspace/ckpt/
#Here act_scales.safetensorscongs is obtained from the above model
ACT_SCALES=/path/to/DeepSeek-R1-W4AFP8/act_scales.safetensors
#You can also get the file from step a, ACT_SCALES=ds_r1_fp8_per_tensor_calibration
if [ ! -d "convert_logs" ]; then
mkdir convert_logs
fi
pids=()
for i in 0 1 2 3 4 5 6 7
do
python examples/quantization/quantize_mixed_precision_moe.py --model_dir $HF_MODEL_DIR --output_dir $OUTPUT_DIR --act_scales $ACT_SCALES --parts 9 --rank $i > convert_logs/log_ $i 2>&1 &
pids+=($!)
done
python examples/quantization/quantize_mixed_precision_moe.py --model_dir $HF_MODEL_DIR --output_dir $OUTPUT_DIR --act_scales $ACT_SCALES --parts 9 --rank 8 > convert_logs/log_8 2>&1
pids+=($!)
for pid in ${pids[@]} ; do
wait $pid
done
echo "All processes completed!"
Step 3: Performance evaluation and model running
In addition to the model quantization and deployment of DeepSeek-R1, the model effect after W4A8 quantization is also our focus. We compared the performance of the W4A8 quantized version of DeepSeek-R1 with the FP8 data accuracy version within 0.5% based on the MMLU and MATH-500 benchmarks.
After the model is completed, you can quickly verify it with the following command to prove the availability of the model:
python examples/pytorch/quickstart_advanced.py --model_dir $CKPT_PATH --tp_size 8 --moe_ep_size 8 --moe_tp_size 1
This article takes the MMLU test set as an example to introduce how to evaluate the model accuracy. Execute the following command to download the MMLU test data set and complete the evaluation of the model accuracy;
#Download the dataset
wget https://people.eecs.berkeley.edu/~hendrycks/data.tar && tar -xf data.tar
#Accuracy evaluation
trtllm -eval --model $CKPT_PATH --backend pytorch --tp_size 8 --ep_size 8 --kv_cache_free_gpu_memory_fraction 0 . 75 mmlu --dataset_path $PATH_TO_MMLU_DATA
#MMLU weighted average accuracy 86.88(14042)
At the same time, you can also use the following commands to complete the performance evaluation of model throughput and related indicators such as TTFT and TPOT;
#Prepare the dataset
DS_R1_MODEL_PATH =/ path /to/ DeepSeek - R1
python /app/ tensorrt_llm /benchmarks/ cpp / prepare_dataset.py \
-- stdout -- tokenizer ${ DS_R1_MODEL_PATH } \
token - norm - dist \
-- input - mean 1000 -- output - mean 1000 \
-- input - stdev 0 -- output - stdev 0 \
-- num - requests 2000 > /tmp/ synthetic_1000_1000.txt
cat << EOF > /workspace/ extra - llm - api - config.yml
enable_attention_dp: false
pytorch_backend_config:
enable_overlap_scheduler: true
use_cuda_graph: true
cuda_graph_padding_enabled: trues
print_iter_log: true
EOF
trtllm - bench -- model $CKPT_PATH -- model_path $CKPT_PATH throughput -- backend pytorch -- max_batch_size 128 -- max_num_tokens 1127 -- dataset /tmp/ synthetic_1000_1000.txt -- tp 8 -- ep 8 -- kv_cache_free_gpu_mem_fraction 0.8 -- extra_llm_api_options /workspace/ extra - llm - api - config.yml -- concurrency 128 -- num_requests 640 -- streaming -- report_json /workspace/ logs /mtp/ bench_log_2node_bs128_ep8_tp16_conc128_aa_latency.json
Execute the following command to start the model online service;
trtllm-serve $CKPT_PATH --host localhost --port 8000 --backend pytorch --max_batch_size 128 --max_num_tokens 3627 --tp_size 8 --ep_size 8 --kv_cache_free_gpu_memory_fraction 0.8 --extra_llm_api_options /workspace/extra-llm-api-config.yml
At this point, we have officially completed the quantization and deployment of the DeepSeek-R1 model. We can use the model inference service by calling the API through the local curl command.
curl -X POST http://0.0.0.0:8080/v1/chat/completions -H "Content-Type: application/json" -d "model": "/path/to/model", "messages": [ { "role": "user", "content": "Please prove the Riemann hypothesis" } ], "max_tokens": 100, "temperature": 0.7}
By following the above steps, users can achieve efficient deployment of the W4A8 quantized version of DeepSeek-R1 on the Volcano GPU cloud server, effectively lowering the deployment threshold of the DeepSeek-R1 model. While giving full play to the performance advantages brought by W4A8 quantization technology, it helps Volcano Engine users quickly apply advanced AI models to actual business.
❝In the future, Volcano Engine will continue to provide Volcano Engine customers with full-link acceleration solutions from data loading, model training and reasoning to network transmission based on the hardware acceleration capabilities of GPU cloud servers, helping to accelerate the rapid implementation and deployment of businesses on Volcano Engine.
❞