Volcano Engine: Deploy DeepSeek-R1's W4A8 solution on a single machine, reducing deployment costs by half

Written by
Jasper Cole
Updated on:June-13th-2025
Recommendation

Volcano Engine has teamed up with NVIDIA to launch an efficient deployment solution for DeepSeek-R1, achieving a double breakthrough in cost and efficiency.

Core content:
1. Overview of DeepSeek-R1 model parameters and deployment challenges
2. Analysis of W4A8 quantization technology for TensorRT-LLM reasoning framework
3. ECS deployment guide: Three steps to quickly implement DeepSeek-R1 model reasoning service

Yang Fangxian
Founder of 53A/Most Valuable Expert of Tencent Cloud (TVP)

DeepSeek-R1 has created a sensation in the AI ​​market as soon as it was launched due to its excellent deep reasoning capabilities.

However, DeepSeek-R1 contains 671B of model parameters. Based on FP8 data accuracy, the weight memory consumption is close to 700GB. A single GPU card with 96GB of memory also requires 16 cards for multi-card deployment (2 ecs.hpcpni3ln.45xlarge instances). The data transmission efficiency between multi-machine reasoning also faces great challenges. How to improve deployment efficiency and reduce reasoning costs while ensuring model performance has become a key challenge limiting the large-scale deployment and application of DeepSeek-R1 large models.

Volcano Engine NVIDIA Cooperation Program: NVIDIA TensorRT-LLM Quantization Reduces Inference Cost

To address the above challenges, Volcano Engine has worked closely with NVIDIA to launch the DeepSeek-R1 acceleration optimization solution based on the TensorRT-LLM inference framework. It uses W4A8 quantization technology to achieve performance breakthroughs, significantly reducing model storage requirements and computational complexity while retaining model accuracy to the maximum extent possible:

  • 4-bit weight quantization (W4): uses a mixed-precision dynamic quantization algorithm to compress the weight storage space to about half of the original size while retaining the model accuracy;
  • 8-bit activation quantization (A8): FP8 is used for activation parameter data precision to ensure the accuracy of model inference;

Based on the W4A8 optimization solution of TensorRT-LLM, DeepSeek-R1 can be deployed on the Volcano Engine ecs.hpcpni3ln.45xlarge instance specification (including 8 GPUs with 96GB video memory). Compared with the non-quantized solution, without affecting the effect of the model itself (refer to the comparison of MMLU and MATH-500 benchmark data), the token throughput of the quantized solution is increased by 100%, and the required hardware resources and costs are halved.

ECS deployment guide: three steps to achieve efficient implementation

Step 1: Environmental preparation

1. Create a GPU cloud server

Before officially deploying the DeepSeek-R1 W4A8 model inference service, you need to complete the instance creation and environment preparation of the ECS cloud server. First, enter the Volcano Engine  ECS console to create a cloud server of ecs.hpcpni3ln.45xlarge specification (an invitation instance specification, which needs to be whitelisted. Please contact your account manager to apply);

To improve deployment efficiency, we recommend that you select a system image with pre-installed GPU drivers (for example, Ubuntu 22.04 with GPU Driver 535.161.08 and doca).

2. Execute the following command to install and configure the Docker environment;

Log in to the target instance and install Docker and other related container environments. If you already have a Docker environment, you can skip steps 2 to 4.

Confirm the docker environment and runtime by following the command:

root@iv-ydt4e4fh1cbw80bh8228:~# docker info | grep -i runtimeRuntimes: io.containerd.runc.v2 nvidia runcDefault Runtime: runc

Installation Steps

sudo apt updatessudo apt install ca-certificates curl gnupg lsb-releasesudo mkdir -p /etc/apt/keyringscurl -fsSL https://mirrors.ivolces.com/docker/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpgecho "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://mirrors.ivolces.com/docker/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/nullsudo apt updatessudo apt install docker-ce docker-ce-cli containerd.io docker-compose-plugin -y
3. Execute the following command to install NVIDIA Container Toolkit;
curl -s https://mirrors.ivolces.com/nvidia_all/ubuntu2204/x86_64/3bf863cc.pub | sudo apt-key add -cat <<EOF>/etc/apt/sources.list.d/nvidia.listdeb http://mirrors.ivolces.com/nvidia_all/ubuntu2204/x86_64/ /EOFapt updateapt install nvidia-container-toolkit -y
4. Execute the following command to configure Docker so that the container can use the GPU resources of the instance;
sudo nvidia-ctk runtime configure --runtime=dockersudo systemctl restart docker
5. Execute the following commands respectively to pull the container image and start the container image service;
docker pull ai-containers-cn-beijing.cr.volces.com/deeplearning/tensorrt-llm:quant_v3
ounter(lineounter(lineounter(linedocker run --gpus all -itd --net=host --shm-size=100g --ulimit memlock=-1 --ulimit stack=67108864 --privileged --ipc=host --security-opt seccomp=unconfined --cap-add=ALL -v /var/run/nvidia-topologyd/:/var/run/nvidia-topologyd/ ${IMAGE} /bin/bash

Step 2: Model Quantization Preparation

1. In order to improve deployment efficiency, we have completed model quantization and preset the quantized model to the oniond model library. You can directly execute the following command through the oniond tool to quickly download the model file;
oniond download model DeepSeek-R1-W4AFP8

oniond is a self-developed model downloader of Volcano. You can use oniond list models to view the models currently supported for pulling;

2. You can also generate the W4A8 quantization model of DeepSeek-R1 by following the steps below (optional. If you download the model file directly through the oniond tool, you do not need to perform step 2);

a. Execute the following command to generate the quantization coefficients of activations through the NVIDIA TensorRT Model Optimizer tool;

oniond download model DeepSeek-R1
PATH_OF_DEEPSEEK_R1=/llm-models/DeepSeek-R1/DeepSeek-R1
#Install ModelOptgit  clone  https://github.com/NVIDIA/TensorRT-Model-Optimizer/ &&  cd  modeloptpip install  "nvidia-modelopt[all]"  -U --extra-index-url https://pypi.nvidia.com
#Conversion model accuracypython inference/convert.py --hf-ckpt-path  $PATH_OF_DEEPSEEK_R1  --save-path ds_r1 --n-experts 256 --model-parallel 8 &&  cd  ..
#Prepare activation filetorchrun --nproc-per-node 8 --master_port=12346 ptq.py --model_path DeepSeek-V3/ds_r1 --config DeepSeek-V3/inference/configs/config_671B.json --quant_cfg FP8_DEFAULT_CFG --output_path ds_r1_fp8_per_tensor_calibration  

b. Execute the following command to complete the weight quantization and integration of the DeepSeek-R1 model;

#/app/tensrrt_LLm/model.sh
#!/bin/bashHF_MODEL_DIR=/path/to/DeepSeek-R1/OUTPUT_DIR=/workspace/ckpt/
#Here act_scales.safetensorscongs is obtained from the above modelACT_SCALES=/path/to/DeepSeek-R1-W4AFP8/act_scales.safetensors#You can also get the file from step a, ACT_SCALES=ds_r1_fp8_per_tensor_calibration
if  [ ! -d  "convert_logs"  ];  then    mkdir  convert_logsfi
pids=()for  i  in  0 1 2 3 4 5 6 7do    python examples/quantization/quantize_mixed_precision_moe.py --model_dir  $HF_MODEL_DIR  --output_dir  $OUTPUT_DIR  --act_scales  $ACT_SCALES  --parts 9 --rank  $i  > convert_logs/log_ $i  2>&1 &    pids+=($!)done
python examples/quantization/quantize_mixed_precision_moe.py --model_dir  $HF_MODEL_DIR  --output_dir  $OUTPUT_DIR  --act_scales  $ACT_SCALES  --parts 9 --rank 8 > convert_logs/log_8 2>&1pids+=($!)
for  pid  in  ${pids[@]}do    wait  $piddone
echo  "All processes completed!"

Step 3: Performance evaluation and model running

In addition to the model quantization and deployment of DeepSeek-R1, the model effect after W4A8 quantization is also our focus. We compared the performance of the W4A8 quantized version of DeepSeek-R1 with the FP8 data accuracy version within 0.5% based on the MMLU and MATH-500 benchmarks.

After the model is completed, you can quickly verify it with the following command to prove the availability of the model:

python examples/pytorch/quickstart_advanced.py --model_dir $CKPT_PATH --tp_size 8 --moe_ep_size 8 --moe_tp_size 1

This article takes the MMLU test set as an example to introduce how to evaluate the model accuracy. Execute the following command to download the MMLU test data set and complete the evaluation of the model accuracy;

#Download the datasetwget  https://people.eecs.berkeley.edu/~hendrycks/data.tar && tar -xf data.tar#Accuracy evaluationtrtllm -eval --model $CKPT_PATH --backend pytorch --tp_size  8  --ep_size  8  --kv_cache_free_gpu_memory_fraction  0 . 75  mmlu --dataset_path $PATH_TO_MMLU_DATA 
#MMLU weighted average accuracy 86.88(14042)

At the same time, you can also use the following commands to complete the performance evaluation of model throughput and related indicators such as TTFT and TPOT;

 #Prepare the datasetDS_R1_MODEL_PATH =/ path /to/ DeepSeek - R1python  /app/ tensorrt_llm /benchmarks/ cpp / prepare_dataset.py \        -- stdout  -- tokenizer ${ DS_R1_MODEL_PATH } \        token - norm - dist \        -- input - mean  1000  -- output - mean  1000  \        -- input - stdev  0  -- output - stdev  0  \        -- num - requests  2000  >  /tmp/ synthetic_1000_1000.txt
cat  << EOF  >  /workspace/ extra - llm - api - config.ymlenable_attention_dp:  falsepytorch_backend_config:  enable_overlap_scheduler:  true  use_cuda_graph:  true  cuda_graph_padding_enabled: trues  print_iter_log:  trueEOF
trtllm - bench  -- model  $CKPT_PATH  -- model_path  $CKPT_PATH  throughput  -- backend pytorch  -- max_batch_size  128  -- max_num_tokens  1127  -- dataset  /tmp/ synthetic_1000_1000.txt  -- tp  8  -- ep  8  -- kv_cache_free_gpu_mem_fraction  0.8  -- extra_llm_api_options  /workspace/ extra - llm - api - config.yml  -- concurrency  128  -- num_requests  640  -- streaming  -- report_json  /workspace/ logs /mtp/ bench_log_2node_bs128_ep8_tp16_conc128_aa_latency.json

Execute the following command to start the model online service;

trtllm-serve $CKPT_PATH --host localhost --port 8000 --backend pytorch --max_batch_size 128 --max_num_tokens 3627 --tp_size 8 --ep_size 8 --kv_cache_free_gpu_memory_fraction 0.8 --extra_llm_api_options /workspace/extra-llm-api-config.yml

At this point, we have officially completed the quantization and deployment of the DeepSeek-R1 model. We can use the model inference service by calling the API through the local curl command.

curl -X POST http://0.0.0.0:8080/v1/chat/completions -H "Content-Type: application/json" -d "model": "/path/to/model", "messages": [ { "role": "user", "content": "Please prove the Riemann hypothesis" } ], "max_tokens": 100, "temperature": 0.7}

By following the above steps, users can achieve efficient deployment of the W4A8 quantized version of DeepSeek-R1 on the Volcano GPU cloud server, effectively lowering the deployment threshold of the DeepSeek-R1 model. While giving full play to the performance advantages brought by W4A8 quantization technology, it helps Volcano Engine users quickly apply advanced AI models to actual business.

In the future, Volcano Engine will continue to provide Volcano Engine customers with full-link acceleration solutions from data loading, model training and reasoning to network transmission based on the hardware acceleration capabilities of GPU cloud servers, helping to accelerate the rapid implementation and deployment of businesses on Volcano Engine.