DeepSeek-R1 Ascend 910B Full-Blooded Version Deployment Guide

Written by
Silas Grey
Updated on:July-16th-2025
Recommendation

A practical guide to operating domestic deployment to avoid common pitfalls during deployment.

Core content:
1. Detailed explanation of DeepSeek-R1 full-blood version deployment process
2. Model weight downloading skills and whitelist configuration
3. FP8 to FP16 weight conversion steps

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)

Deepseek full-blood version Ascend graphics card deployment guide. There are many related tutorials on the Internet, but in actual operation, I found that there are many pitfalls. Here I record the deployment process, hoping to provide some help to friends who are deploying domestically.

Now many major platforms provide deepseek-r1 full-blooded version of the reasoning service. I saw a more interesting detection on the Internet

Is the prompt of full blood? You can try it

Test the official answer from deepseek:

Let’s take a look at the answer using Ascend deployment:


Ascend officially released a deployment guide, and this article is also based on this tutorial. Although there are many shortcomings, it is still a good reference.

https://www.hiascend.com/software/modelzoo/models/detail/68457b8a51324310aad9a0f55c3e56e3

Model weights

The first step is to download the model weights. For the full-blooded version of R1, if the network speed is not fast enough, it is still very troublesome to download. I tried multiple download channels and finally used the MoLe community . The peak speed reached 80M/s. It took about an hour to download all the weights. The speed is very impressive. You can take a look at the official introduction. It is true and recommended.

https://modelers.cn/updates/zh/modelers/20250213-deepseek%E6%9D%83%E9%87%8D%E4%B8%8B%E8%BD%BD/


You need to import a whitelist when downloading, otherwise the custom location will report an error

from  openmind_hub  import  snapshot_download
snapshot_download(
            repo_id= "State_Cloud/DeepSeek-R1-origin" ,
            local_dir = "xxx" ,
            cache_dir = "xxx" ,
            local_dir_use_symlinks= False ,
            )

Then comes the weight conversion. You need to convert FP8 to FP16. You can use the weight conversion script of DeepSeek-V3 in Ascend.

 The weight of DeepSeek-R1 before conversion is about 640G, and the weight after conversion is about 1.3T. Remember to plan the storage location in advance to avoid interruption.

In addition, I would like to mention here that when deploying, I encountered an error. When loading weights, it seems that soft links are not supported. Therefore, when downloading, you can turn off soft links and set the parameter local_dir_use_symlinks=False.

Regarding the requirements for Ascend machines, BF16 R1 requires at least 4 Atlas 800I A2 (8*64G) servers, and the W8A8 quantitative version requires at least 2 Atlas 800I A2 (8*64G)  . When I deployed, I used the quantitative version, which used two Atlas 800T A2

If you do not want to go through the above weight conversion steps and need to deploy the quantitative version of W8A8, you can directly download the converted weights from the community. The download volume has reached 6k+ and can be used.

After downloading the model weights, you need to manage permissions to facilitate subsequent reading:

chown -R 1001:1001 /path-to-weights/DeepSeek-R1
chmod -R 750 /path-to-weights/DeepSeek-R1

Mirror part

Ascend officially released an image that can be deployed directly, making it convenient for developers to start with one click

Mirror link: https://www.hiascend.com/developer/ascendhub/detail/af85b724a7e5469ebd7ea13c3439d48f

The currently provided MindIE image has a pre-installed DeepSeek-R1 model inference script, so there is no need to download the model code.

The mirror here needs to be applied for and can be downloaded after approval

Execute the command:

docker pull swr.cn-south-1.myhuaweicloud.com/ascendhub/mindie:2.0.T3-800I-A2-py311-openeuler24.03-lts

After pulling the image, you need to start the container. You can use the following command, which is slightly different from the official tutorial

docker run -itd --privileged --name=deepseek-r1 --net=host \
   --shm-size 500g \
   --device=/dev/davinci0 \
   --device=/dev/davinci1 \
   --device=/dev/davinci2 \
   --device=/dev/davinci3 \
   --device=/dev/davinci4 \
   --device=/dev/davinci5 \
   --device=/dev/davinci6 \
   --device=/dev/davinci7 \
   --device=/dev/davinci_manager \
   --device=/dev/hisi_hdc \
   --device /dev/devmm_svm \
    -v /usr/ local /dcmi:/usr/ local /dcmi \
    -v /usr/bin/hccn_tool:/usr/bin/hccn_tool \
    -v /usr/ local /sbin:/usr/ local /sbin \
    -v /usr/ local /sbin/npu-smi:/usr/ local /sbin/npu-smi \
     -v /usr/ local /Ascend/driver:/usr/ local /Ascend/driver \
     -v /usr/ local /Ascend/firmware:/usr/ local /Ascend/firmware \
   -v /etc/hccn.conf:/etc/hccn.conf \
   -v xxxxxx/DeepSeek-R1-weight:/workspace \
   swr.cn-south-1.myhuaweicloud.com/ascendhub/mindie:2.0.T3-800I-A2-py311-openeuler24.03-lts \
    bash

--name container name, -v mount the downloaded model

What needs to be noted is that the location of the downloaded model weights should be mounted into the container and placed in the workspace directory, so that it can be used when deployed later. 

Other mounted disks are regular drivers or tools. Ensure that they can run normally locally. Generally, there is no problem.

Deploy multiple servers, and each server downloads the same model weights . The locations can be different, but you all need to execute the above command to start the container and change the mount disk.

Entering the container

After starting the container, all subsequent operations are performed in the container by default.

The first step is to enter the container. Assume that the container name is deepseek-r1.

docker  exec  -it deepseek-r1 bash

Container inspection

After entering the container, check the network status of the machine first. If there is a problem, you can first check whether the local machine is normal. If the local machine is normal, but there is a problem in the container, it should be that some directories are not mounted properly. You can ask Teacher G or Teacher D.

# Check physical link
for  i  in  {0..7};  do  hccn_tool -i  $i  -lldp -g | grep Ifname;  done 
# Check the link status
for  i  in  {0..7};  do  hccn_tool -i  $i  -link -g;  done
# Check network health
for  i  in  {0..7};  do  hccn_tool -i  $i  -net_health -g;  done
# Check whether the configuration of the detection IP is correct
for  i  in  {0..7};  do  hccn_tool -i  $i  -netdetect -g;  done
# Check whether the gateway is configured correctly
for  i  in  {0..7};  do  hccn_tool -i  $i  -gateway -g;  done
# Check the consistency of NPU underlying tls verification behavior. It is recommended to set all 0s
for  i  in  {0..7};  do  hccn_tool -i  $i  -tls -g;  done  | grep switch
# Set the NPU underlying tls verification behavior to 0
for  i  in  {0..7}; do  hccn_tool -i  $i  -tls -s  enable  0; done

Configuring multi-machine and multi-card files

This file is critical. After configuration, the subsequent MindIE reasoning framework will also refer to this file for startup, and no additional settings are required. 

The configuration is also relatively simple. Use this command to record the IP address of each card.

for  i  in  {0..7}; do  hccn_tool -i  $i  -ip -g;  done

Each machine executes once, and a master node is determined

  • server_count: The total number of servers used, that is, the number of nodes. The first server in server_list is the primary node
  • device_id: the local number of the current card, the value range is [0, the number of local cards)
  • device_ip: The IP address of the current card, which can be obtained through the hccn_tool command
  • rank_id: the global number of the current card, the value range is [0, total number of cards)
  • server_id: the IP address of the current node
  • container_ip: container IP address (required for service-oriented deployment). If there is no special configuration, it is the same as server_id

View the server's IP address

hostname -I

View the IP address of the Docker container

docker inspect container id | grep  "IPAddress"

If the return value is empty, it may be using the same network as the host. Check the network mode of the container.

docker inspect container id | grep -i  '"NetworkMode"'

If the returned value is "NetworkMode": "host", it means that the container is using  the host network . It does not have its own IP, but directly uses  the host IP .

Below are the configuration files for the two nodes. Just fill in the IP addresses accordingly.

{
    "server_count""2"
    "server_list" : [
        {
            "device" : [
                { "device_id""0" ,   "device_ip""xxxx" ,   "rank_id""0" },
                { "device_id""1" ,   "device_ip""xxxx""rank_id""1" },
                { "device_id""2" ,   "device_ip""xxxx" ,   "rank_id""2" },
                { "device_id""3" ,   "device_ip""xxxx" ,   "rank_id""3" },
                { "device_id""4" ,   "device_ip""xxxx" ,    "rank_id""4" },
                { "device_id""5" ,   "device_ip""xxxx" ,    "rank_id""5" },
                { "device_id""6" ,   "device_ip""xxxx" ,   "rank_id""6" },
                { "device_id""7" ,   "device_ip""xxxx""rank_id""7" }
                ],
                "server_id""xxxx" ,
                "container_ip""xxxx"
        },
        {
            "device" : [
                { "device_id""0" ,   "device_ip""xxxx" ,   "rank_id""8" },
                { "device_id""1" ,   "device_ip""xxxx""rank_id""9" },
                { "device_id""2" ,   "device_ip""xxxx" ,   "rank_id""10" },
                { "device_id""3" ,   "device_ip""xxxx" ,   "rank_id""11" },
                { "device_id""4" ,   "device_ip""xxxx" ,    "rank_id""12" },
                { "device_id""5" ,   "device_ip""xxxx" ,    "rank_id""13" },
                { "device_id""6" ,   "device_ip""xxxx" ,   "rank_id""14" },
                { "device_id""7" ,   "device_ip""xxxx""rank_id""15" }
            ],
            "server_id""xxxx" ,
            "container_ip""xxxx"
        }
    ],
    "status""completed" ,
    "version""1.0"
}

Enable communication environment variables

export  ATB_LLM_HCCL_ENABLE=1
export  ATB_LLM_COMM_BACKEND= "hccl"
export  HCCL_CONNECT_TIMEOUT=7200
export  WORLD_SIZE=32
export  HCCL_EXEC_TIMEOUT=0
  • In the config.json file in the weights directory, change model_type to deepseekv2 (all lowercase and no spaces)

Accuracy test

The official accuracy test example does not match the directory in the image I downloaded, and the executionfull_CEvalThe test will also report an error, missing the file modeltest path, the actual location in the image is:/usr/local/Ascend/atb-models/tests/modeltest

Test command:

# Need to be executed on all machines at the same time
bash run.sh pa_bf16 [dataset] ([shots]) [batch_size] [model_name] ([is_chat_model]) [weight_dir] [rank_table_file] [world_size] [node_num] [rank_id_start] [master_address]

Performance Testing

The performance test is in the same directory, but it can be executed successfully

Run Command

bash run.sh pa_bf16 performance [[256,256]] 16 deepseekv2 /path/to/weights/DeepSeek-R1 /path/to/xxx/ranktable.json 16 2 0 {Master node IP}
# 0 means starting reasoning from card 0, and subsequent machines start from 8, 16, and 24 respectively.

After running, a csv file will be generated, which saves the indicators of this test, such as

Model
Batchsize
In_seq
Out_seq
Total time(s)
First token time(ms)
Non-first token time(ms)
Non-first token Throughput(Tokens/s)
E2E Throughput(Tokens/s)
Non-first token Throughput Average(Tokens/s)
E2E Throughput Average(Tokens/s)
deepseekv2
16
256
256
18.6202795506
478.01
71.03
225.25693369
219.9752151346
225.25693369
219.9752151346

Parameter explanation:

  • Batch size
  • Input sequence length (In_seq)
  • Output sequence length (Out_seq)
  • Total time
  • First token time
  • Non-first token time
  • Non-first token throughput rate (Throughput)
  • End-to-end throughput (E2E Throughput)

Inference deployment

The above two tests are optional. You can run the performance test and adjust bs to see what kind of effect it can produce.

Before starting, you need to configure the container. For each container, execute:

export  PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export  MIES_CONTAINER_IP=container ip address
export  RANKTABLEFILE=rank_table_file.json path

export  OMP_NUM_THREADS=1
export  NPU_MEMORY_FRACTION=0.95

Note that the above path refers to the path within the container, and the IP address of each machine must correspond correctly.

After execution, each machine must modify the service parameters accordingly, that is, the deployment parameter configuration 

Because this file is in the container, it needs to be modified with vim, which is troublesome. Here is a method recommended

Copy the file to the host machine using the command:

docker cp image id:/usr/ local /Ascend/mindie/latest/mindie-service/conf/config.json /local directory

In this way, you can modify the json file on the host machine, which is convenient and quick, because each machine has the same configuration. Therefore, after the modification, you can copy a copy on each machine. After the modification, you need to transfer it back to the image and use

docker cp local directory /  config.json container id:/usr/local/Ascend/mindie/latest/mindie-service/conf/config.json

This completes the modification of the configuration file. This configuration file needs to be adjusted later. This method saves a lot of trouble.

The following is the official configuration. If you want to deploy a model with faster inference or longer input and output, you need to adjust the parameters of the file accordingly. There is no good suggestion for this part. At present, I set it to 32k, which can be deployed normally and the inference speed is OK.

For detailed parameter description, please refer to this

https://www.hiascend.com/document/detail/zh/mindie/100/mindieservice/servicedev/mindie_service0285.html

{
    "Version"  :  "1.0.0" ,
    "LogConfig"  :
    {
        "logLevel"  :  "Info" ,
        "logFileSize"  :  20 ,
        "logFileNum"  :  20 ,
        "logPath"  :  "logs/mindie-server.log"
    },

    "ServerConfig"  :
    {
        "ipAddress"  :  "Change to the master node IP" ,
        "managementIpAddress"  :  "Change to the master node IP" ,
        "port"  :  1025 ,
        "managementPort"  :  1026 ,
        "metricsPort"  :  1027 ,
        "allowAllZeroIpListening"  :  false ,
        "maxLinkNum"  :  1000//If there are 4 machines, 300 is recommended
        "httpsEnabled"  :  false ,
        "fullTextEnabled"  :  false ,
        "tlsCaPath"  :  "security/ca/" ,
        "tlsCaFile"  : [ "ca.pem" ],
        "tlsCert"  :  "security/certs/server.pem" ,
        "tlsPk"  :  "security/keys/server.key.pem" ,
        "tlsPkPwd"  :  "security/pass/key_pwd.txt" ,
        "tlsCrlPath"  :  "security/certs/" ,
        "tlsCrlFiles"  : [ "server_crl.pem" ],
        "managementTlsCaFile"  : [ "management_ca.pem" ],
        "managementTlsCert"  :  "security/certs/management/server.pem" ,
        "managementTlsPk"  :  "security/keys/management/server.key.pem" ,
        "managementTlsPkPwd"  :  "security/pass/management/key_pwd.txt" ,
        "managementTlsCrlPath"  :  "security/management/certs/" ,
        "managementTlsCrlFiles"  : [ "server_crl.pem" ],
        "kmcKsfMaster"  :  "tools/pmt/master/ksfa" ,
        "kmcKsfStandby"  :  "tools/pmt/standby/ksfb" ,
        "inferMode"  :  "standard" ,
        "interCommTLSEnabled"  :  false ,
        "interCommPort"  :  1121 ,
        "interCommTlsCaPath"  :  "security/grpc/ca/" ,
        "interCommTlsCaFiles"  : [ "ca.pem" ],
        "interCommTlsCert"  :  "security/grpc/certs/server.pem" ,
        "interCommPk"  :  "security/grpc/keys/server.key.pem" ,
        "interCommPkPwd"  :  "security/grpc/pass/key_pwd.txt" ,
        "interCommTlsCrlPath"  :  "security/grpc/certs/" ,
        "interCommTlsCrlFiles"  : [ "server_crl.pem" ],
        "openAiSupport"  :  "vllm"
    },

    "BackendConfig"  : {
        "backendName"  :  "mindieservice_llm_engine" ,
        "modelInstanceNumber"  :  1 ,
        "npuDeviceIds"  : [[ 0 , 1 , 2 , 3 , 4 , 5 , 6 , 7 ]],
        "tokenizerProcessNumber"  :  8 ,
        "multiNodesInferEnabled"  :  true ,
        "multiNodesInferPort"  :  1120 ,
        "interNodeTLSEnabled"  :  false ,
        "interNodeTlsCaPath"  :  "security/grpc/ca/" ,
        "interNodeTlsCaFiles"  : [ "ca.pem" ],
        "interNodeTlsCert"  :  "security/grpc/certs/server.pem" ,
        "interNodeTlsPk"  :  "security/grpc/keys/server.key.pem" ,
        "interNodeTlsPkPwd"  :  "security/grpc/pass/mindie_server_key_pwd.txt" ,
        "interNodeTlsCrlPath"  :  "security/grpc/certs/" ,
        "interNodeTlsCrlFiles"  : [ "server_crl.pem" ],
        "interNodeKmcKsfMaster"  :  "tools/pmt/master/ksfa" ,
        "interNodeKmcKsfStandby"  :  "tools/pmt/standby/ksfb" ,
        "ModelDeployConfig"  :
        {
            "maxSeqLen"  :  10000 ,
            "maxInputTokenLen"  :  2048 ,
            "truncation"  :  true ,
            "ModelConfig"  : [
                {
                    "modelInstanceType"  :  "Standard" ,
                    "modelName"  :  "deepseekr1" ,
                    "modelWeightPath"  :  "/home/data/dsR1_base_step178000" ,
                    "worldSize"  :  8 ,
                    "cpuMemSize"  :  5 ,
                    "npuMemSize"  :  -1 ,
                    "backendType"  :  "atb" ,
                    "trustRemoteCode"  :  false
                }
            ]
        },

        "ScheduleConfig"  :
        {
            "templateType"  :  "Standard" ,
            "templateName"  :  "Standard_LLM" ,
            "cacheBlockSize"  :  128 ,

            "maxPrefillBatchSize"  :  8 ,
            "maxPrefillTokens"  :  2048 ,
            "prefillTimeMsPerReq"  :  150 ,
            "prefillPolicyType"  :  0 ,

            "decodeTimeMsPerReq"  :  50 ,
            "decodePolicyType"  :  0 ,

            "maxBatchSize"  :  8 ,
            "maxIterTimes"  :  1024 ,
            "maxPreemptCount"  :  0 ,
            "supportSelectBatch"  :  false ,
            "maxQueueDelayMicroseconds"  :  5000
        }
    }
}

Start the service

The startup command is also relatively simple

cd  /usr/ local /Ascend/mindie/latest/mindie-service
nohup ./bin/mindieservice_daemon > /workspace/output.log 2>&1 &

It is best to run the command to start the service in the background so that you can view the log. Otherwise, after closing the terminal, although the service is not down, the log cannot be found, which is not convenient for debugging.

After the command is executed, all the parameters used for this startup will be printed first, and then the following output will appear:

Daemon start success!

The service is considered to have started successfully.

At this point, the deployment can be considered successful, and there is one last step of testing:

curl -X POST http://{ip}:{port}/v1/chat/completions \
     -H  "Accept: application/json"  \
     -H  "Content-Type: application/json"  \
     -d  '{
       "model": "DeepSeek-R1",
       "messages": [{
         "role": "user",
         "content": "Hello"
       }],
       "max_tokens": 20,
       "presence_penalty": 1.03,
       "frequency_penalty": 1.0,
       "seed": null,
       "temperature": 0.5,
       "top_p": 0.95,
       "stream": true
     }'


Note that HTTPS communication is not enabled in the official tutorial. Use http instead of https for subsequent calls.

Using https requires configuring the service certificate, private key and other certificate files required to enable HTTPS communication

The above output can be seen, even if the deployment is successful

Finally, we need to adapt the OpenAI-style reasoning interface, which can be referred to

https://www.hiascend.com/document/detail/zh/mindie/10RC3/mindieservice/servicedev/mindie_service0076.html

Some points have to be mentioned:

The entire deployment process basically follows the official tutorial, but there are pitfalls at every step, all kinds of weird problems, mainly because I can't find the logs. I searched through the directories in the container, but there are few substantive error contents. Generally, I can find some in /root/mindie. It's also rare to find the corresponding problems on the Internet. It would be great if there was a discussion space for deployment posts to help avoid pitfalls.

In addition, some parts of this tutorial are not consistent (maybe I did not operate it correctly), which is very strange. For example, the test directory was very confusing at first, and I checked it several times before I found it in another directory.

There were still some problems that could not be troubleshooted, and finally I asked Huawei teachers to solve them. I am very grateful for the quick support and I hope that domestic products will get better and better.

Deployment problem, failure to load tokenizer Solution: Check whether the tokenizer.json file is consistent with the official website, and check the permissions to see if it can be read normally 


There is also a problem, such as the following 

 

Just looking at the log, it seems that I can't figure out what the cause is. The final solution is: upgrade the driver! 

There are similar problems caused by hccn. After upgrading to 24.1.0, many problems are naturally solved. If you encounter problems that are difficult to solve, don't doubt yourself.. 

cann No need to upgrade, just upgrade Ascend NPU firmware and driver 

Select the Euler system. Note that you can select the 800I A2 inference server here , which has a faster inference speed. 

After downloading these two files, install them according to the normal process. 

https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/800alpha003/softwareinst/instg/instg_0004.html?Mode=PmIns&OS=Ubuntu&Software=cannToolKit