【AIOps】Prometheus/Nightingale access to DeepSeek large model

Written by
Clara Bennett
Updated on:July-09th-2025
Recommendation

AIOps practice new breakthrough, deep integration of monitoring and fault analysis, improve operation and maintenance efficiency.

Core content:
1. Prometheus and Nightingale are combined to achieve efficient fault alarm
2. DeepSeek large model access, automatic analysis of fault causes and treatment suggestions
3. Alarm notification template and media settings to achieve automated notification process

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)

Preface

I have previously written about using Nightingale as Prometheus Articles on the alarm engine, based on which access DeepSeek Conduct fault analysis, provide possible causes of faults and troubleshooting methods, and greatly reduce the mean troubleshooting time (MTTR).

Workflow:PrometheusPeriodic CollectionExporterThe indicators are stored locally and Nightingale queries them periodically.PrometheusWhether the indicators in the system meet the fault alarm rules, the fault information will be sent to DeepSeek ,DeepSeekBy analyzing the fault causes and handling suggestions, splicing the plaintiff's alarm information andDeepSeekThe analysis results are sent to the user. For those who do not use Nightingale, you can directly usewebhookto execute the script.

"flow chart"


1. Create a new notification template

Alarm Notification --> Notification templates --> Added, Addedaiopstemplate

#### {{if .IsRecovered}}<font color="#008800">? {{.RuleName}}Recovered</font>{{else}}<font color="#FF0000">? {{.RuleName}}Alarm</font>{{end}}

---
**Level Status** : {{if .IsRecovered}} < font color = "#008800" >  S{{.Severity}} </ font > {{else}} < font color = "#FF0000" >  S{{.Severity}} </ font > {{end}}     
{{if eq (index .TagsMap "job") "web_status"}}   
**Owned by company** : {{index .TagsMap "company"}}   
**Project name** : {{index .TagsMap "project_cn"}}   
**System name** : {{index .TagsMap "name"}}   
{{if .IsRecovered}} **Recovered content** : {{index .TagsMap "name"}} Currently recovered to normal!   
{{else}} **Warning content** : {{index .TagsMap "name"}} is currently inaccessible!   
{{end}}
**System address** : [ {{index .TagsMap "instance"}} ]( {{index .TagsMap "instance"}} )   
{{end}}
{{if .IsRecovered}} **Trigger time** : {{timeformat .FirstTriggerTime}}   
**Recovery time** : {{timeformat .LastEvalTime}}{{else}} **Trigger time** : {{timeformat .FirstTriggerTime}}{{end}}     

2. Create a new notification medium

Alarm Notification --> Notification Settings --> Notification Medium --> Add to

  • name:aiops
  • Logo:aiops

3. Configure the notification script

Alarm Notification --> Notification Settings --> Notification Script --> Using Scripts

#!/usr/bin/env python
# -*- coding: UTF-8 -*-
import  sys
import  json
import  requests

class Sender (object) : 
    @classmethod
    def send_email (cls, payload) : 
        # already done in go code
        pass

    @classmethod
    def send_wecom (cls, payload) : 
        # already done in go code
        pass
# DingTalk Robot
    DINGTALK_URL =  "https://oapi.dingtalk.com/robot/send?access_token=XXXXXXXXXX"
# DeepSeek key
    DEEPSEEK_URL =  "https://api.deepseek.com/v1/chat/completions"
    DEEPSEEK_KEY =  "sk-XXXXXXXXXX"

    @classmethod
    def call_deepseek (cls, message) : 
        headers = {
            "Content-Type""application/json" ,
            "Authorization"f"Bearer  {cls.DEEPSEEK_KEY} "
        }
        
        data = {
            "model""deepseek-chat" ,
            "messages" : [{
                "role""user" ,
                "content"f"""
Warning message: {message} 
You are an expert in the field of operation and maintenance. Please analyze the alarm information and give possible causes, handling suggestions and urgency.
Layout requirements: AI fault analysis title should be blue h4 size, with concise language and emphasis on key points
"""

            }]
        }

        try :
            response = requests.post(cls.DEEPSEEK_URL, headers=headers, json=data)
            response.raise_for_status()
            return  response.json()[ 'choices' ][ 0 ][ 'message' ][ 'content' ]
        except  Exception  as  e:
            print( f"Deepseek API error:  {str(e)} " )
            return "Unable to obtain processing suggestions"

    @classmethod
    def send_dingtalk (cls, payload) : 
        original_message = payload.get( 'tpls' ).get( "dingtalk""dingtalk not found" )
        analysis = cls.call_deepseek(original_message)
        
        final_message =  f""" {original_message}

---
{analysis}
"""

        
        headers = {
            "Content-Type""application/json;charset=utf-8"
        }
        
        body = {
            "msgtype""markdown" ,
            "markdown" : {
                "title""Alarm Notification" ,
                "text" : final_message
            }
        }

        response = requests.post(cls.DINGTALK_URL, headers=headers, data=json.dumps(body))
        print( f"notify_dingtalk: status_code= {response.status_code}  response_text= {response.text} " )

    @classmethod
    def send_mm (cls, payload) : 
        # already done in go code
        pass

    @classmethod
    def send_sms (cls, payload) : 
        pass

    @classmethod
    def send_voice (cls, payload) : 
        pass

def main () : 
    payload = json.load(sys.stdin)
    with  open( ".payload"'w'as  f:
        f.write(json.dumps(payload, indent= 4 ))
    for  ch  in  payload.get( 'event' ).get( 'notify_channels' ):
        send_func_name =  "send_{}" .format(ch.strip())
        if not  hasattr(Sender, send_func_name):
            print( "function: {} not found" , send_func_name)
            continue
        send_func = getattr(Sender, send_func_name)
        send_func(payload)

def hello () : 
    print( "hello nightingale" )

if  __name__ ==  "__main__" :
    if  len(sys.argv) ==  1 :
        main()
    elif  sys.argv[ 1 ] ==  "hello" :
        hello()
    else :
        print( "I am confused" )

  • It takes a certain amount of time for DeepSeek to analyze the answer. The recommended timeout is 120s.

4. Alert rules enable AIOps notification method

Alarm Management --> Alarm rules , the original rules are checkedaiopsNotification Medium


5. Alarm test

  • Plaintiff police information

  • Alarm information after accessing the large model

Here we simply send the alarm information to DeepSeek The big model performs the analysis and sends the results to us. The results of the analysis usually cannot fully meet our expectations. The next step is to perform "RAG" (retrieval enhanced generation), which searches the external knowledge base and inputs the relevant content in the knowledge base as prompts to the big model, so as to give analysis results that are more in line with our expectations.AIOpsDuring the construction process, the construction of the operation and maintenance knowledge base will be very important. After the results of "RAG" meet our expectations, we will combine the workflow,AI AgentThe method is to perform alarm convergence, fault self-healing and other operations. Related open source tools--- dify