Recommend an AIOPS platform that supports the Deepseek model

Written by
Audrey Miles
Updated on:July-16th-2025
Recommendation

Explore the new realm of AIOps platform and improve the efficiency of operation and maintenance.

Core content:
1. Introduction to Keep platform: open source AI-driven monitoring and alarm solution
2. AIOps core functions: anomaly detection, root cause analysis, alarm noise reduction, automatic repair
3. Keep architecture design: data collection, processing storage, AI engine and alarm notification module details

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)


introduce

Keep is an open source AI-driven monitoring and alerting platform that aims to simplify operation and maintenance through automation and intelligent means, helping teams to manage and monitor complex infrastructure and applications more efficiently. It combines traditional monitoring tools with modern artificial intelligence technology to implement the core functions of AIOps (artificial intelligence operation and maintenance), such as anomaly detection, root cause analysis, alarm noise reduction and automated repair.

Keep's core goal is to reduce the burden on the operation and maintenance team through AI technology, improve the reliability and maintainability of the system, and reduce the risk of false positives and false negatives. It also supports the deepseek model.

Architecture

Keep's architecture design revolves around AI-driven monitoring and alerting, and is mainly divided into the following core modules:

  1. Data Collection Layer:

  • Supports collecting data from multiple monitoring tools and log systems, such as Prometheus, Grafana, Datadog, AWS CloudWatch, Elasticsearch, etc.

  • Provides a flexible plug-in mechanism to facilitate the integration of new data sources.

  • Data Processing & Storage Layer:

    • Clean, standardize, and aggregate the collected data.

    • Supports multiple storage backends, such as Elasticsearch, InfluxDB, PostgreSQL, etc., for storing historical data and real-time data.

  • AI Engine:

    • Anomaly detection: Automatically detect unusual behavior in data using machine learning algorithms such as time series analysis, clustering, deep learning, etc.

    • Root cause analysis: Quickly locate the root cause of the problem through causal inference and correlation analysis.

    • Alarm noise reduction: Use AI technology to classify and prioritize alarms to reduce false alarms and duplicate alarms.

    • Predictive analytics: Predict future system behavior based on historical data to identify potential problems in advance.

  • Alerting & Notification Layer:

    • Generate alerts based on the analysis results of the AI ​​engine.

    • Supports multiple notification channels, such as Slack, Email, PagerDuty, Webhook, etc.

  • Automation Layer:

    • Provides automated scripts and an action framework to support automatic repair of detected issues.

    • For example, automatically restart services, expand resources, clean up logs, etc.

  • Visualization and User Interface (UI & Dashboard):

    • Provides intuitive dashboards and charts to display monitoring data and AI analysis results.

    • Supports custom dashboards and reports, allowing users to adjust views according to their needs.

  • API Gateway:

    • Provides RESTful API for easy integration with other systems.

    • Supports automated scripts and third-party tool calls.


    Main application scenarios

    1. Anomaly Detection:

    • Use time series analysis algorithms (such as ARIMA, Prophet) or deep learning models (such as LSTM) to detect unusual behavior in indicators.

    • For example, detecting sudden spikes in CPU usage, abnormal increases in request latency, and so on.

  • Root Cause Analysis:

    • Quickly locate the root cause of the problem through causal inference and correlation analysis.

    • For example, when database response time increases, automatically analyze whether it is related to network latency, disk I/O, or query load.

  • Alarm noise reduction:

    • Use classification algorithms (such as random forest and SVM) to classify alarms and filter out low-priority alarms.

    • For example, repeated alerts or known issues can be marked as “handled” to reduce interference to the operation and maintenance team.

  • Predictive Analytics:

    • Predict future system behavior based on historical data and identify potential problems in advance.

    • For example, predict that disk space will run out within the next 24 hours and issue an alert in advance.

  • Automated fixes:

    • Use the rules engine and scripting framework to automatically fix detected issues.

    • For example, when a service is detected to be unavailable, it automatically restarts the service or switches to a standby node.


    Keep implements the core capabilities of AIOps through:

    1. Data driven:

    • Collect and analyze large amounts of monitoring data, logs, and indicators to provide a basis for training and reasoning of AI models.

  • Machine Learning vs Deep Learning:

    • Automatically detect anomalies, analyze root causes, and predict future behavior using machine learning algorithms and deep learning models.

  • Automation and Orchestration:

    • Provides automated scripts and an action framework to support automatic repair of detected issues.

  • Intelligent alarm management:

    • AI technology is used to classify, filter and prioritize alarms to reduce false positives and duplicate alarms.

  • Continuous Optimization:

    • Continuously optimize AI models through feedback mechanisms to improve the accuracy of anomaly detection and root cause analysis.

    Summarize

    Keep is a powerful AIOps platform that uses AI technology to implement anomaly detection, root cause analysis, alarm noise reduction, and automated repair. It is applicable to a variety of complex infrastructures and application scenarios, helping operation and maintenance teams to manage and monitor systems more efficiently and improve system reliability and maintainability. Whether it is an e-commerce platform, financial system or IoT device, Keep can provide intelligent monitoring solutions to help users discover and solve problems in a timely manner and ensure the stable operation of the business.


address
Project address: https://github.com/keephq/keep