Exploration of DeepSeek's application scenarios in the field of operation and maintenance

Written by
Audrey Miles
Updated on:July-08th-2025
Recommendation

Good news for operation and maintenance engineers, DeepSeek allows AI to be implemented to solve practical pain points.

Core content:
1. Log analysis: locate problems with one click and improve fault response speed
2. Fault prediction: early warning to reduce the risk of downtime during promotions
3. Automatic blame: scientifically divide the pot and quickly locate the root cause of the problem

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)

The implementation of DeepSeek in the field of operation and maintenance is not to create a bunch of "high-end" AI concepts, but to directly solve the pain points that engineers curse every day .

Let’s talk about some practical application scenarios:

1. Log analysis: from "finding a needle in a haystack" to "locating with one click"

  • Pain point : In the middle of the night, the alarm group was flooded with 1,000 logs, all of which were "ERROR", but no one knew which one was the real murderer.
  • What DeepSeek does :
    • Automatically label the logs according to categories such as "database crash", "code error", and "network failure".
    • Real case : After a game company launched a new version, the system frequently crashed. It originally took 5 people 3 hours to check the logs. Now the system directly marks "Redis connection pool exhausted" and it can be fixed in 10 minutes.
    • Core technology : NLP model (similar to ChatGPT log reading) + historical fault library matching.

2. Fault prediction: from “firefighters” to “preemptive mine removal”

  • Pain point : Every time there is a big promotion, the system will go down, and the operation and maintenance staff can only stay up all night to watch, just like buying lottery tickets.
  • What DeepSeek does :
    • Analyze historical monitoring data (CPU, memory, slow queries) and issue a 48-hour advance warning that "the database cannot handle the Double 11 traffic."
    • Real effect : An e-commerce company expanded the MySQL cluster in advance, with zero failures during the promotion period and hired three fewer temporary operation and maintenance personnel.
    • Core technology : time series prediction algorithm (similar to stock K-line analysis) + business traffic correlation analysis.

3. Automatically passing the buck: From “arguing” to “scientifically dividing the buck”

  • Pain point : The system crashed, and the development, operation, and network departments blamed each other, and there was no conclusion after a two-hour meeting.
  • What DeepSeek does :
    • Root cause: The order service code did not handle Redis timeout.
    • Collateral impact: The payment service was dragged down due to the retry mechanism.
    • Automatically generate a "responsibility report" based on the log timeline and service call relationship:
    • Real case : The time for troubleshooting at a bank was reduced from 3 days to 20 minutes.
    • Core technology : call chain analysis + root cause location algorithm (similar to criminal investigation and case solving).

4. Cost optimization: from “brainless server purchase” to “precise cost saving”

  • Pain point : Server resources are either overwhelmed or idle, and the boss complains every day that it is a waste of money.
  • What DeepSeek does :
    • During the daytime traffic peak, more machines are turned on, and the traffic is reduced to the minimum at night.
    • Real data : A video company saves 20 million in server costs each year.
    • Analyze business traffic patterns and automatically adjust the number of cloud servers:
    • Core technology : elastic scaling algorithm + multi-cloud price comparison (automatically choose AWS or Alibaba Cloud which is cheaper).

5. New employee training: from “hands-on teaching” to “AI training”

  • Pain point : New employees don’t even understand the system architecture, and old employees work as customer service every day.

  • What DeepSeek does :

    • Q: "What should I do if the order service is down?" → Automatic reply: "1. Check the MySQL connection pool 2. Check the gateway current limit configuration..."
    • Build an "Operation and Maintenance Knowledge Base Question and Answer Robot":
    • Real effect : The training period for new employees in a large factory to independently handle faults was reduced from 3 months to 2 weeks.
    • Core technology : knowledge graph + fault case library retrieval.
  • Effect example :

Newbie: What to do if MySQL connection fails?  
AI:  
1. Check the whitelist: /etc/mysql/allowlist.conf  
2. Check the connection pool configuration: spring.datasource.max-active=50  
3. Similar historical issues: 2023-07-01 due to firewall interception (Ticket #12345)  

6. Security Operation and Maintenance: From “being scolded for fixing vulnerabilities” to “no-brainer repairs”

  • Pain point : To fix a vulnerability, the system must be shut down first, and the business side would like to kill the operation and maintenance staff.
  • What DeepSeek does :
    • For example, K8s nodes can be automatically patched at 3 a.m. without any business awareness.
    • Automatically detect vulnerabilities and select grayscale updates during off-peak hours:
    • Real case : A government cloud repaired a Log4j vulnerability. Traditionally, it would take 2 hours to shut down the service, but now it can be completed in 10 minutes with a rolling update.
    • Core technology : vulnerability impact analysis + intelligent scheduling algorithm.

Explain the essence

Whether these solutions can be implemented depends on "using AI to solve small problems" rather than "disrupting operations and maintenance" :

  • We do not pursue 100% accuracy : if log classification can cover 80% of common problems, we will save a lot of effort and be satisfied.
  • Fits the existing tool chain : ELK/Prometheus/K8s are natively supported, and it is unnecessary to reinvent the wheel.
  • Engineers lead the design : Let operations define their own rules (such as "which operations require manual confirmation"), AI only serves as an auxiliary, and people are the masters.

What is the actual effect?

  • Troubleshooting time from an average of 2 hours to 15 minutes
  • Reduce server costs by 40%
  • Newcomers can work independently in 3 days

What kind of AI do we operators need?

  • No bragging, just doing the dirty and hard work .

What are our expectations?

  • No need to manually search logs → Let AI classify them
  • No need to argue about taking the blame → AI directly issues a responsibility report
  • Buying servers is not a waste → AI is more accurate than accounting
  • No need to stay up late to fix bugs → AI can fix them secretly on its own