DeepSeek-R1 Ascend 910B Full-Blooded Version Deployment Guide

A practical guide to operating domestic deployment to avoid common pitfalls during deployment.
Core content:
1. Detailed explanation of DeepSeek-R1 full-blood version deployment process
2. Model weight downloading skills and whitelist configuration
3. FP8 to FP16 weight conversion steps
Deepseek full-blood version Ascend graphics card deployment guide. There are many related tutorials on the Internet, but in actual operation, I found that there are many pitfalls. Here I record the deployment process, hoping to provide some help to friends who are deploying domestically.
Now many major platforms provide deepseek-r1 full-blooded version of the reasoning service. I saw a more interesting detection on the Internet
Is the prompt of full blood? You can try it
Test the official answer from deepseek:
Let’s take a look at the answer using Ascend deployment:
Ascend officially released a deployment guide, and this article is also based on this tutorial. Although there are many shortcomings, it is still a good reference.
https://www.hiascend.com/software/modelzoo/models/detail/68457b8a51324310aad9a0f55c3e56e3
Model weights
The first step is to download the model weights. For the full-blooded version of R1, if the network speed is not fast enough, it is still very troublesome to download. I tried multiple download channels and finally used the MoLe community . The peak speed reached 80M/s. It took about an hour to download all the weights. The speed is very impressive. You can take a look at the official introduction. It is true and recommended.
https://modelers.cn/updates/zh/modelers/20250213-deepseek%E6%9D%83%E9%87%8D%E4%B8%8B%E8%BD%BD/
You need to import a whitelist when downloading, otherwise the custom location will report an error
from openmind_hub import snapshot_download
snapshot_download(
repo_id= "State_Cloud/DeepSeek-R1-origin" ,
local_dir = "xxx" ,
cache_dir = "xxx" ,
local_dir_use_symlinks= False ,
)
Then comes the weight conversion. You need to convert FP8 to FP16. You can use the weight conversion script of DeepSeek-V3 in Ascend.
The weight of DeepSeek-R1 before conversion is about 640G, and the weight after conversion is about 1.3T. Remember to plan the storage location in advance to avoid interruption.
In addition, I would like to mention here that when deploying, I encountered an error. When loading weights, it seems that soft links are not supported. Therefore, when downloading, you can turn off soft links and set the parameter local_dir_use_symlinks=False.
Regarding the requirements for Ascend machines, BF16 R1 requires at least 4 Atlas 800I A2 (8*64G) servers, and the W8A8 quantitative version requires at least 2 Atlas 800I A2 (8*64G) . When I deployed, I used the quantitative version, which used two Atlas 800T A2
If you do not want to go through the above weight conversion steps and need to deploy the quantitative version of W8A8, you can directly download the converted weights from the community. The download volume has reached 6k+ and can be used.
After downloading the model weights, you need to manage permissions to facilitate subsequent reading:
chown -R 1001:1001 /path-to-weights/DeepSeek-R1
chmod -R 750 /path-to-weights/DeepSeek-R1
Mirror part
Ascend officially released an image that can be deployed directly, making it convenient for developers to start with one click
Mirror link: https://www.hiascend.com/developer/ascendhub/detail/af85b724a7e5469ebd7ea13c3439d48f
The currently provided MindIE image has a pre-installed DeepSeek-R1 model inference script, so there is no need to download the model code.
The mirror here needs to be applied for and can be downloaded after approval
Execute the command:
docker pull swr.cn-south-1.myhuaweicloud.com/ascendhub/mindie:2.0.T3-800I-A2-py311-openeuler24.03-lts
After pulling the image, you need to start the container. You can use the following command, which is slightly different from the official tutorial
docker run -itd --privileged --name=deepseek-r1 --net=host \
--shm-size 500g \
--device=/dev/davinci0 \
--device=/dev/davinci1 \
--device=/dev/davinci2 \
--device=/dev/davinci3 \
--device=/dev/davinci4 \
--device=/dev/davinci5 \
--device=/dev/davinci6 \
--device=/dev/davinci7 \
--device=/dev/davinci_manager \
--device=/dev/hisi_hdc \
--device /dev/devmm_svm \
-v /usr/ local /dcmi:/usr/ local /dcmi \
-v /usr/bin/hccn_tool:/usr/bin/hccn_tool \
-v /usr/ local /sbin:/usr/ local /sbin \
-v /usr/ local /sbin/npu-smi:/usr/ local /sbin/npu-smi \
-v /usr/ local /Ascend/driver:/usr/ local /Ascend/driver \
-v /usr/ local /Ascend/firmware:/usr/ local /Ascend/firmware \
-v /etc/hccn.conf:/etc/hccn.conf \
-v xxxxxx/DeepSeek-R1-weight:/workspace \
swr.cn-south-1.myhuaweicloud.com/ascendhub/mindie:2.0.T3-800I-A2-py311-openeuler24.03-lts \
bash
--name container name, -v mount the downloaded model
What needs to be noted is that the location of the downloaded model weights should be mounted into the container and placed in the workspace directory, so that it can be used when deployed later.
Other mounted disks are regular drivers or tools. Ensure that they can run normally locally. Generally, there is no problem.
Deploy multiple servers, and each server downloads the same model weights . The locations can be different, but you all need to execute the above command to start the container and change the mount disk.
Entering the container
After starting the container, all subsequent operations are performed in the container by default.
The first step is to enter the container. Assume that the container name is deepseek-r1.
docker exec -it deepseek-r1 bash
Container inspection
After entering the container, check the network status of the machine first. If there is a problem, you can first check whether the local machine is normal. If the local machine is normal, but there is a problem in the container, it should be that some directories are not mounted properly. You can ask Teacher G or Teacher D.
# Check physical link
for i in {0..7}; do hccn_tool -i $i -lldp -g | grep Ifname; done
# Check the link status
for i in {0..7}; do hccn_tool -i $i -link -g; done
# Check network health
for i in {0..7}; do hccn_tool -i $i -net_health -g; done
# Check whether the configuration of the detection IP is correct
for i in {0..7}; do hccn_tool -i $i -netdetect -g; done
# Check whether the gateway is configured correctly
for i in {0..7}; do hccn_tool -i $i -gateway -g; done
# Check the consistency of NPU underlying tls verification behavior. It is recommended to set all 0s
for i in {0..7}; do hccn_tool -i $i -tls -g; done | grep switch
# Set the NPU underlying tls verification behavior to 0
for i in {0..7}; do hccn_tool -i $i -tls -s enable 0; done
Configuring multi-machine and multi-card files
This file is critical. After configuration, the subsequent MindIE reasoning framework will also refer to this file for startup, and no additional settings are required.
The configuration is also relatively simple. Use this command to record the IP address of each card.
for i in {0..7}; do hccn_tool -i $i -ip -g; done
Each machine executes once, and a master node is determined
server_count: The total number of servers used, that is, the number of nodes. The first server in server_list is the primary node device_id: the local number of the current card, the value range is [0, the number of local cards) device_ip: The IP address of the current card, which can be obtained through the hccn_tool command rank_id: the global number of the current card, the value range is [0, total number of cards) server_id: the IP address of the current node container_ip: container IP address (required for service-oriented deployment). If there is no special configuration, it is the same as server_id
View the server's IP address
hostname -I
View the IP address of the Docker container
docker inspect container id | grep "IPAddress"
If the return value is empty, it may be using the same network as the host. Check the network mode of the container.
docker inspect container id | grep -i '"NetworkMode"'
If the returned value is "NetworkMode": "host", it means that the container is using the host network . It does not have its own IP, but directly uses the host IP .
Below are the configuration files for the two nodes. Just fill in the IP addresses accordingly.
{
"server_count" : "2" ,
"server_list" : [
{
"device" : [
{ "device_id" : "0" , "device_ip" : "xxxx" , "rank_id" : "0" },
{ "device_id" : "1" , "device_ip" : "xxxx" , "rank_id" : "1" },
{ "device_id" : "2" , "device_ip" : "xxxx" , "rank_id" : "2" },
{ "device_id" : "3" , "device_ip" : "xxxx" , "rank_id" : "3" },
{ "device_id" : "4" , "device_ip" : "xxxx" , "rank_id" : "4" },
{ "device_id" : "5" , "device_ip" : "xxxx" , "rank_id" : "5" },
{ "device_id" : "6" , "device_ip" : "xxxx" , "rank_id" : "6" },
{ "device_id" : "7" , "device_ip" : "xxxx" , "rank_id" : "7" }
],
"server_id" : "xxxx" ,
"container_ip" : "xxxx"
},
{
"device" : [
{ "device_id" : "0" , "device_ip" : "xxxx" , "rank_id" : "8" },
{ "device_id" : "1" , "device_ip" : "xxxx" , "rank_id" : "9" },
{ "device_id" : "2" , "device_ip" : "xxxx" , "rank_id" : "10" },
{ "device_id" : "3" , "device_ip" : "xxxx" , "rank_id" : "11" },
{ "device_id" : "4" , "device_ip" : "xxxx" , "rank_id" : "12" },
{ "device_id" : "5" , "device_ip" : "xxxx" , "rank_id" : "13" },
{ "device_id" : "6" , "device_ip" : "xxxx" , "rank_id" : "14" },
{ "device_id" : "7" , "device_ip" : "xxxx" , "rank_id" : "15" }
],
"server_id" : "xxxx" ,
"container_ip" : "xxxx"
}
],
"status" : "completed" ,
"version" : "1.0"
}
Enable communication environment variables
export ATB_LLM_HCCL_ENABLE=1
export ATB_LLM_COMM_BACKEND= "hccl"
export HCCL_CONNECT_TIMEOUT=7200
export WORLD_SIZE=32
export HCCL_EXEC_TIMEOUT=0
In the config.json file in the weights directory, change model_type to deepseekv2 (all lowercase and no spaces)
Accuracy test
The official accuracy test example does not match the directory in the image I downloaded, and the executionfull_CEval
The test will also report an error, missing the file modeltest path, the actual location in the image is:/usr/local/Ascend/atb-models/tests/modeltest
Test command:
# Need to be executed on all machines at the same time
bash run.sh pa_bf16 [dataset] ([shots]) [batch_size] [model_name] ([is_chat_model]) [weight_dir] [rank_table_file] [world_size] [node_num] [rank_id_start] [master_address]
Performance Testing
The performance test is in the same directory, but it can be executed successfully
Run Command
bash run.sh pa_bf16 performance [[256,256]] 16 deepseekv2 /path/to/weights/DeepSeek-R1 /path/to/xxx/ranktable.json 16 2 0 {Master node IP}
# 0 means starting reasoning from card 0, and subsequent machines start from 8, 16, and 24 respectively.
After running, a csv file will be generated, which saves the indicators of this test, such as
Parameter explanation:
Batch size Input sequence length (In_seq) Output sequence length (Out_seq) Total time First token time Non-first token time Non-first token throughput rate (Throughput) End-to-end throughput (E2E Throughput)
Inference deployment
The above two tests are optional. You can run the performance test and adjust bs to see what kind of effect it can produce.
Before starting, you need to configure the container. For each container, execute:
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export MIES_CONTAINER_IP=container ip address
export RANKTABLEFILE=rank_table_file.json path
export OMP_NUM_THREADS=1
export NPU_MEMORY_FRACTION=0.95
Note that the above path refers to the path within the container, and the IP address of each machine must correspond correctly.
After execution, each machine must modify the service parameters accordingly, that is, the deployment parameter configuration
Because this file is in the container, it needs to be modified with vim, which is troublesome. Here is a method recommended
Copy the file to the host machine using the command:
docker cp image id:/usr/ local /Ascend/mindie/latest/mindie-service/conf/config.json /local directory
In this way, you can modify the json file on the host machine, which is convenient and quick, because each machine has the same configuration. Therefore, after the modification, you can copy a copy on each machine. After the modification, you need to transfer it back to the image and use
docker cp local directory / config.json container id:/usr/local/Ascend/mindie/latest/mindie-service/conf/config.json
This completes the modification of the configuration file. This configuration file needs to be adjusted later. This method saves a lot of trouble.
The following is the official configuration. If you want to deploy a model with faster inference or longer input and output, you need to adjust the parameters of the file accordingly. There is no good suggestion for this part. At present, I set it to 32k, which can be deployed normally and the inference speed is OK.
For detailed parameter description, please refer to this
https://www.hiascend.com/document/detail/zh/mindie/100/mindieservice/servicedev/mindie_service0285.html
{
"Version" : "1.0.0" ,
"LogConfig" :
{
"logLevel" : "Info" ,
"logFileSize" : 20 ,
"logFileNum" : 20 ,
"logPath" : "logs/mindie-server.log"
},
"ServerConfig" :
{
"ipAddress" : "Change to the master node IP" ,
"managementIpAddress" : "Change to the master node IP" ,
"port" : 1025 ,
"managementPort" : 1026 ,
"metricsPort" : 1027 ,
"allowAllZeroIpListening" : false ,
"maxLinkNum" : 1000 , //If there are 4 machines, 300 is recommended
"httpsEnabled" : false ,
"fullTextEnabled" : false ,
"tlsCaPath" : "security/ca/" ,
"tlsCaFile" : [ "ca.pem" ],
"tlsCert" : "security/certs/server.pem" ,
"tlsPk" : "security/keys/server.key.pem" ,
"tlsPkPwd" : "security/pass/key_pwd.txt" ,
"tlsCrlPath" : "security/certs/" ,
"tlsCrlFiles" : [ "server_crl.pem" ],
"managementTlsCaFile" : [ "management_ca.pem" ],
"managementTlsCert" : "security/certs/management/server.pem" ,
"managementTlsPk" : "security/keys/management/server.key.pem" ,
"managementTlsPkPwd" : "security/pass/management/key_pwd.txt" ,
"managementTlsCrlPath" : "security/management/certs/" ,
"managementTlsCrlFiles" : [ "server_crl.pem" ],
"kmcKsfMaster" : "tools/pmt/master/ksfa" ,
"kmcKsfStandby" : "tools/pmt/standby/ksfb" ,
"inferMode" : "standard" ,
"interCommTLSEnabled" : false ,
"interCommPort" : 1121 ,
"interCommTlsCaPath" : "security/grpc/ca/" ,
"interCommTlsCaFiles" : [ "ca.pem" ],
"interCommTlsCert" : "security/grpc/certs/server.pem" ,
"interCommPk" : "security/grpc/keys/server.key.pem" ,
"interCommPkPwd" : "security/grpc/pass/key_pwd.txt" ,
"interCommTlsCrlPath" : "security/grpc/certs/" ,
"interCommTlsCrlFiles" : [ "server_crl.pem" ],
"openAiSupport" : "vllm"
},
"BackendConfig" : {
"backendName" : "mindieservice_llm_engine" ,
"modelInstanceNumber" : 1 ,
"npuDeviceIds" : [[ 0 , 1 , 2 , 3 , 4 , 5 , 6 , 7 ]],
"tokenizerProcessNumber" : 8 ,
"multiNodesInferEnabled" : true ,
"multiNodesInferPort" : 1120 ,
"interNodeTLSEnabled" : false ,
"interNodeTlsCaPath" : "security/grpc/ca/" ,
"interNodeTlsCaFiles" : [ "ca.pem" ],
"interNodeTlsCert" : "security/grpc/certs/server.pem" ,
"interNodeTlsPk" : "security/grpc/keys/server.key.pem" ,
"interNodeTlsPkPwd" : "security/grpc/pass/mindie_server_key_pwd.txt" ,
"interNodeTlsCrlPath" : "security/grpc/certs/" ,
"interNodeTlsCrlFiles" : [ "server_crl.pem" ],
"interNodeKmcKsfMaster" : "tools/pmt/master/ksfa" ,
"interNodeKmcKsfStandby" : "tools/pmt/standby/ksfb" ,
"ModelDeployConfig" :
{
"maxSeqLen" : 10000 ,
"maxInputTokenLen" : 2048 ,
"truncation" : true ,
"ModelConfig" : [
{
"modelInstanceType" : "Standard" ,
"modelName" : "deepseekr1" ,
"modelWeightPath" : "/home/data/dsR1_base_step178000" ,
"worldSize" : 8 ,
"cpuMemSize" : 5 ,
"npuMemSize" : -1 ,
"backendType" : "atb" ,
"trustRemoteCode" : false
}
]
},
"ScheduleConfig" :
{
"templateType" : "Standard" ,
"templateName" : "Standard_LLM" ,
"cacheBlockSize" : 128 ,
"maxPrefillBatchSize" : 8 ,
"maxPrefillTokens" : 2048 ,
"prefillTimeMsPerReq" : 150 ,
"prefillPolicyType" : 0 ,
"decodeTimeMsPerReq" : 50 ,
"decodePolicyType" : 0 ,
"maxBatchSize" : 8 ,
"maxIterTimes" : 1024 ,
"maxPreemptCount" : 0 ,
"supportSelectBatch" : false ,
"maxQueueDelayMicroseconds" : 5000
}
}
}
Start the service
The startup command is also relatively simple
cd /usr/ local /Ascend/mindie/latest/mindie-service
nohup ./bin/mindieservice_daemon > /workspace/output.log 2>&1 &
It is best to run the command to start the service in the background so that you can view the log. Otherwise, after closing the terminal, although the service is not down, the log cannot be found, which is not convenient for debugging.
After the command is executed, all the parameters used for this startup will be printed first, and then the following output will appear:
Daemon start success!
The service is considered to have started successfully.
At this point, the deployment can be considered successful, and there is one last step of testing:
curl -X POST http://{ip}:{port}/v1/chat/completions \
-H "Accept: application/json" \
-H "Content-Type: application/json" \
-d '{
"model": "DeepSeek-R1",
"messages": [{
"role": "user",
"content": "Hello"
}],
"max_tokens": 20,
"presence_penalty": 1.03,
"frequency_penalty": 1.0,
"seed": null,
"temperature": 0.5,
"top_p": 0.95,
"stream": true
}'
Note that HTTPS communication is not enabled in the official tutorial. Use http instead of https for subsequent calls.
Using https requires configuring the service certificate, private key and other certificate files required to enable HTTPS communication
The above output can be seen, even if the deployment is successful
Finally, we need to adapt the OpenAI-style reasoning interface, which can be referred to
https://www.hiascend.com/document/detail/zh/mindie/10RC3/mindieservice/servicedev/mindie_service0076.html
Some points have to be mentioned:
The entire deployment process basically follows the official tutorial, but there are pitfalls at every step, all kinds of weird problems, mainly because I can't find the logs. I searched through the directories in the container, but there are few substantive error contents. Generally, I can find some in /root/mindie. It's also rare to find the corresponding problems on the Internet. It would be great if there was a discussion space for deployment posts to help avoid pitfalls.
In addition, some parts of this tutorial are not consistent (maybe I did not operate it correctly), which is very strange. For example, the test directory was very confusing at first, and I checked it several times before I found it in another directory.
There were still some problems that could not be troubleshooted, and finally I asked Huawei teachers to solve them. I am very grateful for the quick support and I hope that domestic products will get better and better.
Deployment problem, failure to load tokenizer Solution: Check whether the tokenizer.json file is consistent with the official website, and check the permissions to see if it can be read normally
There is also a problem, such as the following
Just looking at the log, it seems that I can't figure out what the cause is. The final solution is: upgrade the driver!
There are similar problems caused by hccn. After upgrading to 24.1.0, many problems are naturally solved. If you encounter problems that are difficult to solve, don't doubt yourself..
cann No need to upgrade, just upgrade Ascend NPU firmware and driver
Select the Euler system. Note that you can select the 800I A2 inference server here , which has a faster inference speed.
After downloading these two files, install them according to the normal process.
https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/800alpha003/softwareinst/instg/instg_0004.html?Mode=PmIns&OS=Ubuntu&Software=cannToolKit