Is it currently appropriate for enterprises to introduce big model-driven intelligent operation and maintenance?

Written by
Audrey Miles
Updated on:June-19th-2025
Recommendation

Is intelligent operation and maintenance transformation suitable for your enterprise? This article deeply explores the selection and practice of intelligent operation and maintenance driven by big models.

Core content:
1. Limitations and challenges of traditional IT operation and maintenance
2. Advantages and application scenarios of big model intelligent operation and maintenance
3. Considerations and practical suggestions for enterprises to introduce intelligent operation and maintenance

Yang Fangxian
Founder of 53A/Most Valuable Expert of Tencent Cloud (TVP)


Traditional IT operation and maintenance methods usually rely on manual monitoring and rule-driven methods to identify problems, troubleshoot and optimize services. However, as the IT infrastructure of enterprises becomes increasingly complex, traditional operation and maintenance methods have gradually revealed their shortcomings of inefficiency, delayed response and difficulty in processing massive amounts of data. These include system environment operation and maintenance, business environment operation and maintenance, business program operation and maintenance, etc., and most enterprises have distributed systems, microservice architectures and multi-cloud environments, which have brought huge challenges to operation and maintenance. Traditional operation and maintenance methods are not only difficult to fully capture the operating status of the system, but may also lead to more serious downtime or accidents due to human misjudgment.

With the rise of big models, intelligent operation and maintenance has become a solution that everyone is looking for. By pre-setting solutions, it is expected to conduct real-time analysis of operation and maintenance data, detect anomalies, locate root causes, and make advance predictions. This intelligent operation and maintenance model has the following advantages: from passive response to active prevention, from single-point problem diagnosis to global system optimization, and from tedious operations to automated decision-making. For example, big models can predict potential problems based on historical data and real-time logs, generate a variety of solutions for operation and maintenance personnel to refer to, greatly shorten the problem-solving time, and even directly execute many pre-set troubleshooting operations through Agents.

However, not all enterprises are suitable for introducing intelligent operation and maintenance systems. The deployment of intelligent operation and maintenance often requires enterprises to have a certain technical foundation (such as data accumulation, computing power support and professional teams), and requires clear business pain points and application scenarios. Therefore, when deciding whether to introduce intelligent operation and maintenance, enterprises must conduct a comprehensive assessment based on their own IT environment, business needs and cost-benefit ratio.

This article is one of the topics discussed in the community activity "How to choose enterprise intelligent operation and maintenance scenarios and implement practical empowerment training under the trend of large language models". It can be inspiring for enterprise decision-making.



Topic moderator Xianshou, algorithm engineer of Suning.com :
With the rapid development of artificial intelligence technology, especially large models, the operation and maintenance teams of enterprises have also turned their attention to the field of large models. More and more companies are beginning to pay attention to intelligent operation and maintenance. This IT operation and maintenance method combined with the capabilities of large models can not only improve the accuracy of system monitoring, but also help companies shift from "passive response" to "active optimization" through predictive analysis, automated decision-making and other means. This raises a question, is every company suitable for introducing large models to carry out intelligent operation and maintenance reforms? From the online empowerment training activities held in the community this time, we can see that everyone has their own valuable insights and rich practical experience. I hope we can find answers to our own questions or be inspired.
Peer discussion is as follows:
● Ye Chuang is  an application operator  of a city commercial bank :
With the current O&M standardization level and scale of most companies, the input-output ratio of intelligent O&M is not high. Before the big model, AIOps could only be implemented technically in terms of dynamic thresholds and fault location. With the IT scale of conventional financial companies, dynamic thresholds have not been significantly improved compared to threshold settings based on expert experience, and high-level threshold setting methods such as year-on-year and month-on-month changes. On the one hand, the traffic fluctuations of most systems are not large, and on the other hand, the scale is small, which is within the scope that humans can handle, and the judgment is more accurate. Fault data belongs to small data samples, and the training and tuning of the algorithm is not obvious. The benefits of cloud vendors doing intelligent O&M are relatively high. Others are more like scientific research rather than actual production applications.
There are two foundations for applying big models at the operation and maintenance level. One is that the operation and maintenance data is good enough, for example, CMDB and observable data are reliable enough. Many companies do operation and maintenance, but these two areas are either not accurate enough or have low coverage. The operation and maintenance documents must be sufficient and online (offline documents can also be fed to the vector database), so that questions and answers at the operation and maintenance and IT service levels can be completed, and routine operation and maintenance tasks can be assisted. The second is that the operation and maintenance capabilities are API-based, so that intelligent agents based on big models can dispatch various operation and maintenance tool capabilities. If these two aspects are not done well, the introduction of big models can only play a limited role. Prioritize laying a good foundation. At the personal level, it is enough to use general big models such as ChatGPT to improve your own operation and maintenance efficiency. Documentation, script writing, and problem query are basically sufficient. There is no need to rush to the application at the organizational level. If the basic data is not good, the performance of private big models will not be as good as that of general big models.
● Xianshou  Suning.com algorithm engineer:
Whether an enterprise needs to introduce intelligent operation and maintenance needs to be measured and considered from the following aspects:
1. Reserve of intelligent personnel
In the field of traditional operation and maintenance, there are more experts in a certain independent field. Now, the intelligent operation and maintenance of large models needs to be introduced, which requires additional costs for the knowledge reserve and use of large models. For example, the applicable fields and characteristics of each large model on the market, the analysis of the scenarios of intelligent operation and maintenance of their own enterprises, and data preparation, understanding and scenario-based use of large model prompts, COT, Agents, etc.
2. Overall cost investment of large models
The investment in large models can be divided into: model deployment (private deployment or the cost of calling cloud large models), recruitment of personnel with large model capabilities and operation and maintenance capabilities, and investment in products, research and development, and operation and maintenance for the overall system transformation. In comparison, only the recruitment of personnel with corresponding operation and maintenance capabilities was required before.
3. Are the benefits of large models substantial?
Before the big model was used, operation and maintenance was called manual operation and maintenance. That is, any problem required remote login for troubleshooting, maintenance, upgrades, etc. The big model can replace this part. Assuming that the profit of manual operation and maintenance was 50 points before, can the intelligent operation and maintenance of the big model reach 60 points or only 40 points? This requires research.
● Chen Pingchun, insurance system architect:
There are several judgment criteria for reference:
1. Scenario: The current operation and maintenance model cannot meet the current business scenario requirements. At the same time, the operation and maintenance requirements can be broken down into tasks that can be completed using a large model or part of the process can be completed using a large model.
2. Response time: As the intermediate link of intelligent operation and maintenance, the response time of the large model can meet the needs of overall intelligent operation and maintenance.
3. Cost: The efficiency or value improvement brought by intelligent operation and maintenance is greater than the cost of using the big model itself.
● Gu Huangliang,  technical director of a financial enterprise :
Generally speaking, if automated O&M has achieved most O&M capabilities, there is no need to implement intelligent O&M in the production environment:
1. The cost investment is too high.
2. Without massive data and corpus, the results will not be good.
3. The cost of learning cannot be practiced in actual projects and has little value.
If an enterprise has the following characteristics, it needs to promote intelligent operation and maintenance immediately:
1. There are many systems and the business scale is very large.
2. Massive amounts of data.
3. The personnel capabilities are in place.

As for how to introduce it, it is not recommended to figure it out or research it yourself. Instead, you should adopt the open source or some free versions and try them out first, get the process working, and then go deeper.


Peer Exchange Consensus

In the context of the rapid development of large model technology, intelligent operation and maintenance is a good discussion point and the starting point for the transformation of enterprise operation and maintenance. However, whether it is necessary to introduce intelligent operation and maintenance, and how to successfully implement intelligent operation and maintenance, each enterprise needs to evaluate and analyze according to its own actual situation. This event provided an opportunity for in-depth exchanges for the participating peers. Everyone gave the following evaluation principles for whether the enterprise needs to introduce intelligent operation and maintenance, which can be summarized as follows:

1. Basic conditions for introducing intelligent operation and maintenance. The effective application of intelligent operation and maintenance depends on the maturity of the enterprise's operation and maintenance foundation. It mainly includes two aspects: one is the quality and coverage of operation and maintenance data, and the other is the API-based operation and maintenance capabilities. If the company's CMDB (configuration management database) and observability data are not accurate enough, or the operation and maintenance documents are not online, the application effect of the big model will be greatly reduced. For most small and medium-sized enterprises, these infrastructures are not yet complete, and blindly introducing intelligent operation and maintenance may result in a low input-output ratio. Therefore, enterprises should give priority to laying a solid foundation for basic data and tool capabilities before considering the introduction of big models.

2. The input-output ratio of intelligent operation and maintenance. The introduction of intelligent operation and maintenance requires weighing costs and benefits. On the one hand, enterprises need to invest a lot of resources, including private deployment, calling large cloud models, personnel recruitment and training, and IT system transformation. On the other hand, the benefits are reflected in improving operation and maintenance efficiency, reducing labor costs, and faster fault response speed. However, if the enterprise operation and maintenance scale is small and the business scenarios are relatively simple, traditional automated operation and maintenance can still meet the needs, and the large model may not be able to significantly increase the value, or even result in a situation where "the benefits are not as good as the investment."

3. Applicable scenarios and decision-making basis. First, the complexity of operation and maintenance requirements is the key. If the company has a large number of systems, a huge business scale, and involves massive data processing, then intelligent operation and maintenance can provide significant value. Secondly, intelligent operation and maintenance needs to meet the response speed requirements, that is, it is more efficient than traditional operation and maintenance in complex scenarios. Finally, a comprehensive assessment of the cost-benefit ratio: if the value increase of intelligent operation and maintenance is significantly greater than its cost, it is worth introducing. Enterprises can first verify whether the process is effective by trying open source or free versions of intelligent operation and maintenance tools, and then decide on further investment and in-depth application.

In general, intelligent operation and maintenance driven by big models is not suitable for all enterprises. Small and medium-sized enterprises and enterprises with simple operation and maintenance requirements can give priority to using general big models to improve personal operation and maintenance efficiency, such as document writing and problem query, without immediately rising to the organizational level. For enterprises with large business scale, sufficient data volume and complete personnel reserves, they can gradually try intelligent operation and maintenance of big models and slowly upgrade and iterate.