【AIOps】Prometheus/Nightingale access to DeepSeek large model
Written by
Clara Bennett
Updated on:July-09th-2025
Recommendation
AIOps practice new breakthrough, deep integration of monitoring and fault analysis, improve operation and maintenance efficiency.
Core content:
1. Prometheus and Nightingale are combined to achieve efficient fault alarm
2. DeepSeek large model access, automatic analysis of fault causes and treatment suggestions
3. Alarm notification template and media settings to achieve automated notification process
Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)
Preface
I have previously written about using Nightingale as Prometheus Articles on the alarm engine, based on which access DeepSeek Conduct fault analysis, provide possible causes of faults and troubleshooting methods, and greatly reduce the mean troubleshooting time (MTTR).
Workflow:PrometheusPeriodic CollectionExporterThe indicators are stored locally and Nightingale queries them periodically.PrometheusWhether the indicators in the system meet the fault alarm rules, the fault information will be sent to DeepSeek ,DeepSeekBy analyzing the fault causes and handling suggestions, splicing the plaintiff's alarm information andDeepSeekThe analysis results are sent to the user. For those who do not use Nightingale, you can directly usewebhookto execute the script.
#!/usr/bin/env python # -*- coding: UTF-8 -*- import sys import json import requests class Sender (object) : @classmethod def send_email (cls, payload) : # already done in go code pass @classmethod def send_wecom (cls, payload) : # already done in go code pass # DingTalk Robot DINGTALK_URL = "https://oapi.dingtalk.com/robot/send?access_token=XXXXXXXXXX" # DeepSeek key DEEPSEEK_URL = "https://api.deepseek.com/v1/chat/completions" DEEPSEEK_KEY = "sk-XXXXXXXXXX" @classmethod def call_deepseek (cls, message) : headers = { "Content-Type" : "application/json" , "Authorization" : f"Bearer {cls.DEEPSEEK_KEY} " } data = { "model" : "deepseek-chat" , "messages" : [{ "role" : "user" , "content" : f""" Warning message: {message} You are an expert in the field of operation and maintenance. Please analyze the alarm information and give possible causes, handling suggestions and urgency. Layout requirements: AI fault analysis title should be blue h4 size, with concise language and emphasis on key points """ }] } try : response = requests.post(cls.DEEPSEEK_URL, headers=headers, json=data) response.raise_for_status() return response.json()[ 'choices' ][ 0 ][ 'message' ][ 'content' ] except Exception as e: print( f"Deepseek API error: {str(e)} " ) return "Unable to obtain processing suggestions" @classmethod def send_dingtalk (cls, payload) : original_message = payload.get( 'tpls' ).get( "dingtalk" , "dingtalk not found" ) analysis = cls.call_deepseek(original_message) final_message = f""" {original_message} --- {analysis} """ headers = { "Content-Type" : "application/json;charset=utf-8" } body = { "msgtype" : "markdown" , "markdown" : { "title" : "Alarm Notification" , "text" : final_message } } response = requests.post(cls.DINGTALK_URL, headers=headers, data=json.dumps(body)) print( f"notify_dingtalk: status_code= {response.status_code} response_text= {response.text} " ) @classmethod def send_mm (cls, payload) : # already done in go code pass @classmethod def send_sms (cls, payload) : pass @classmethod def send_voice (cls, payload) : pass def main () : payload = json.load(sys.stdin) with open( ".payload" , 'w' ) as f: f.write(json.dumps(payload, indent= 4 )) for ch in payload.get( 'event' ).get( 'notify_channels' ): send_func_name = "send_{}" .format(ch.strip()) if not hasattr(Sender, send_func_name): print( "function: {} not found" , send_func_name) continue send_func = getattr(Sender, send_func_name) send_func(payload) def hello () : print( "hello nightingale" ) if __name__ == "__main__" : if len(sys.argv) == 1 : main() elif sys.argv[ 1 ] == "hello" : hello() else : print( "I am confused" )
It takes a certain amount of time for DeepSeek to analyze the answer. The recommended timeout is 120s.
4. Alert rules enable AIOps notification method
Alarm Management --> Alarm rules , the original rules are checkedaiopsNotification Medium
5. Alarm test
Plaintiff police information
Alarm information after accessing the large model
Here we simply send the alarm information to DeepSeek The big model performs the analysis and sends the results to us. The results of the analysis usually cannot fully meet our expectations. The next step is to perform "RAG" (retrieval enhanced generation), which searches the external knowledge base and inputs the relevant content in the knowledge base as prompts to the big model, so as to give analysis results that are more in line with our expectations.AIOpsDuring the construction process, the construction of the operation and maintenance knowledge base will be very important. After the results of "RAG" meet our expectations, we will combine the workflow,AI AgentThe method is to perform alarm convergence, fault self-healing and other operations. Related open source tools--- dify