Deployment and maintenance of the SRE-specific large model

Written by
Caleb Hayes
Updated on:June-29th-2025
Recommendation

A revolutionary tool for operation and maintenance SRE is here, a 7B parameter exclusive large model to help automate operation and maintenance.

Core content:
1. Introduction to the 7B parameter SRE exclusive large model based on DeepSeek architecture
2. Alibaba Cloud GPU server deployment environment construction guide
3. Docker image deployment and dependent component installation steps

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)
A while ago, a former colleague told me that he had fine-tuned a large model exclusive to the SRE field ( a 7B parameter model based on the DeepSeek architecture, fine-tuned by LoRA, and designed specifically for tasks in the operation and maintenance field. It has enhanced three capabilities: automated script generation, system monitoring and analysis, and troubleshooting and root cause location ). I plan to deploy it and experience it during the two-day holiday.
Preparation
  • We chose Alibaba Cloud GPU server as the deployment environment because the local Mac computer could not run it.
  • Recommended GPU configuration: system disk at least 100 GB, memory 60 GB.
  • Package and install dependent components through docker images. The component information is as follows:

Installation Steps
1. Create a GPU instance and install the Tesla driver

The instance supports the following GPU instance families:

  • gn6e、ebmgn6e

  • gn7i, ebmgn7i, ebmgn7ix

  • gn7e, ebmgn7e, ebmgn7ex

  • ebmgn8v、ebmgn8is

Mirror : SelectUbuntu 20.04operating system as an example.GPUUse on instancevLLMContainer image, which needs to be installed on the instance in advanceTeslaDriver and the driver version should be535or higher, it is recommended that youECSConsole PurchaseGPUWhen you install the instance, select InstallGPUdrive .


2. Remotely connect to GPU instances. We recommend an AI terminal artifact Warp (strongly recommended).
3. Execute the following command to installDockerenvironment.
sudo apt-get updatessudo apt-get -y install ca-certificates curlsudo install -m 0755 -d /etc/apt/keyringssudo curl -fsSL http://mirrors.cloud.aliyuncs.com/docker-ce/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.ascsudo chmod a+r /etc/apt/keyrings/docker.ascecho \ "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] http://mirrors.cloud.aliyuncs.com/docker-ce/linux/ubuntu \ $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \ sudo tee /etc/apt/sources.list.d/docker.list > /dev/nullsudo apt-get updatesudo apt-get install -y docker-ce docker-ce-cli containerd.io

4. Execute the following command to checkDockerWhether the installation is successful.

docker -v

5. Execute the following command to installnvidia-container-toolkit.

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit


6. SettingsDockerStartup and restartDockerServe.

sudo systemctl enable docker 
sudo systemctl restart docker
7. Execute the following command to viewDockerWhether it has been started.
sudo systemctl status docker

8. Execute the following command to pullvLLMMirror image.

sudo docker pull egs-registry.cn-hangzhou.cr.aliyuncs.com/egs/vllm:0.8.2-pytorch2.6-cu124-20250328

9. Execute the following command to runvLLMcontainer.

sudo docker run -d -t --net=host --gpus all \ --privileged \ --ipc=host \ --name vllm \ -v /root:/root \ egs-registry.cn-hangzhou.cr.aliyuncs.com/egs/vllm:0.8.2-pytorch2.6-cu124-20250328

10. Execute the following command to viewvLLMWhether the container is started successfully.

docker ps
Verification steps
1. Install the git command and download the large model locally.
apt install git-lfscd /root
git lfs  clone  https://www.modelscope.cn/phpcool/DeepSeek-R1-Distill-SRE-Qwen-7B.git
2. Enter the vLLM container
docker exec -it vllm bash
3. Start the vLLM inference service
vllm serve /root/DeepSeek-R1-Distill-SRE-Qwen-7B --tensor-parallel-size 1 --max-model-len 2048 --enforce-eager

As shown below, the vLLM inference service has been started.



4. Test it through curl on this GPU server
curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "/root/DeepSeek-R1-Distill-SRE-Qwen-7B", "messages": [ {"role": "system", "content": "You are an intelligent operation and maintenance assistant."}, {"role": "user", "content": "How to optimize the storage performance of the server to increase data reading and writing speed?" } ]}'
return:
Summarize
The big model has been deployed. We will use some actual logs to simulate online failures and test the model's ability to identify the root cause of failures. We will share more in the future.