ByteBrain Team FSE25 | Automated Oncall Upgrade Based on LLM

Written by

Jasper Cole

Updated on:June-22nd-2025

Preface

FSE 2025 (The ACM International Conference on the Foundations of Software Engineering), one of the top academic conferences in the field of software engineering, is expected to be held in Trondheim, Norway in June 2025. The ByteBrain team's paper "TickIt: Leveraging Large Language Models for Automated Ticket Escalation" was successfully selected (https://arxiv.org/abs/2504.08475).

background

At a time when cloud computing technology is booming, for Volcano Engine, work orders/Oncalls have become a key bridge for communication between customers and technical support & SRE teams. As the scale of cloud services continues to expand, thousands of Oncalls are generated every day. These Oncalls are usually in the form of natural language, covering various complex issues such as usage consultation, functional requirements, system failures, etc.

In the traditional manual escalation mode, the oncall operator relies on personal experience to judge whether the work order is serious and then decides whether it should be further escalated. This process relies heavily on the experience and judgment of the oncall operator, and it is difficult to form a unified standard. In past case studies and fault replays, we found that due to human negligence, some serious problems were not escalated in time, resulting in the risk of reduced stability, which may also have a negative impact on Volcano Engine's customer satisfaction.

How to promptly identify and escalate these oncalls when faced with urgent issues has become the key to improving customer satisfaction and ensuring service quality. To address this issue, we proposed TickIt, which aims to identify urgent oncalls that report serious issues and promptly escalate them to production research/stability/fault emergency response teams.

challenge

Oncall issues are highly diverse, and different types of issues need to be handled by people with different professional backgrounds. For example, system failures require R&D & SRE to quickly locate and repair them to reduce service interruption time; customer complaints and negative emotions require account managers to comfort customers and solve problems in a timely manner, thereby improving customer satisfaction. Furthermore, oncall issues can be further subdivided, such as determining the size of their impact and whether they are detrimental to the business.

Existing feature engineering-based analysis methods have limited semantic understanding of oncall content, and it is difficult to accurately identify key issues in actual applications, resulting in important oncalls not being upgraded in a timely manner. In addition, the severity of an oncall may also be gradually clarified (dynamically) during the conversation, while some existing methods only perform one-time classification, ignoring the constantly updated information in the conversation, and are unable to identify situations that need to be upgraded online in a timely manner.

In addition, it is equally important to explore the relationship between oncalls. When a problem affects multiple customers, multiple similar oncalls will be generated. If the relationship between these oncalls can be captured and analyzed in a timely manner, it will help to more comprehensively assess the severity and scope of the problem. For production research, these oncalls can be merged and processed together, so as to solve the problem more focused.

Thanks to the powerful ability of large language model (LLM) in natural language understanding, we use it to assist in understanding text information in Oncall, but simply using LLM cannot really effectively solve the above challenges. In this paper, we propose an LLM-based Oncall analysis method - TickIt, which can dynamically track the information in Oncall, and use LLM to deeply understand the semantic content of Oncall conversations, identify serious problems in time and escalate them. At the same time, TickIt can also explore the semantic associations between different Oncall problem phenomena, identify potential common problems, and achieve more efficient problem handling.

TickIt Method Design

TickIt uses ByteDance's Doubao model, aiming to achieve efficient and accurate oncall escalation tasks with the powerful natural language processing capabilities of the large language model (LLM). The framework mainly includes three core functional modules: multi-class escalation, escalation deduplication, and category-guided fine-tuning.

Oncall upgrade based on multiple categories

In the Oncall upgrade function, TickIt treats the Oncall upgrade problem as a multi-classification task. Based on the different responsibilities and focus of R&D, SRE, and customer relations, multiple subject categories such as system failure, customer complaints, and asset losses are pre-defined. For ordinary Oncalls, they are uniformly classified as "other" categories (no upgrade processing is required). In order to enable the large language model to better complete the Oncall multi-classification task, TickIt also uses some technologies in System Prompt to improve its classification performance, such as giving it task roles and thinking chains (COT). For example, when judging whether an Oncall is a system failure, the model will analyze the failure phenomenon and impact scope mentioned in the conversation content, and gradually explain the reasons for making the classification decision. This method enhances the logic and interpretability of the classification results, making it easier for people to understand and trust the judgment of the model. In addition, TickIt uses Few-shot learning to assist the model in understanding different Oncall categories. These examples especially illustrate confusing scenarios, thereby helping the model to more accurately distinguish the characteristics of various Oncalls.

The System Prompt format used by TickIt in the upgrade task is shown in the following figure

Repeat Oncall Analysis

Repeated oncall analysis is another feature of TickIt. When an oncall is determined to need to be upgraded, TickIt will check all oncalls in the "Pending" state to determine whether similar problems have been upgraded. To this end, TickIt abstracts the state of the oncall in its life cycle as a finite state machine. When a customer submits an oncall ticket and it is accepted, the oncall object enters the "Active" state. Whenever there is new conversation content in the oncall, the latest conversation record will trigger TickIt to start a new round of analysis, and it will be set to enter the "Analyzing" state. TickIt will use the above-mentioned multi-classification-based upgrade method to determine whether the current oncall needs to be upgraded. If it is classified as "other", its state returns to "Active" and waits for the next round of conversation interaction; if it is classified as a preset serious problem type, it enters the "Pending" state, at which time TickIt will check whether similar oncalls have been upgraded.

When determining whether an oncall is repeated, TickIt first uses a large language model to extract the problem description in the oncall, and uses the doubao-embedding model to convert these problem descriptions into vector representations. The similarity between vectors is calculated through consine similarity, and a threshold parameter ? is used to determine whether the current oncall is similar to the upgraded oncall (? is confirmed through parameter selection experiments, and in this method, ?=0.88). For the oncall that TicketIt determines needs to be upgraded, if there is an upgraded and similar oncall in the past, the current oncall will be associated with the corresponding historical oncall, and no repeated alarm will be issued (only reflected in the associated work order). At the same time, TickIt will rewrite the problem description of the current oncall and the repeated oncall using a large language model, and semantically summarize the common characteristics of this type of problem more comprehensively, avoiding the understanding of the problem due to the limitations of the description of individual oncall work orders.

Class-guided fine-tuning

Fine-tuning of category guidance is the key mechanism for TickIt to continuously optimize accuracy. When an Oncall is upgraded according to the above process, TickIt will send a reminder notification card containing a summary of the Oncall problem. There are three interactive buttons in the notification card, two of which are used to like or dislike; the third button is an Oncall jump link, which can be directly jumped to the relevant chat group after clicking. TickIt will record the interactive behavior of these notification cards as feedback data for automatic upgrades.

Feedback Action	Whether to use it as fine-tuning data	Do you need to rewrite the thought chain?	Priority (1 is the highest priority)
Like	yes	no	1
Click (false alarm)	yes	yes	2
Jump via link	yes	no	3
Not redirected via link	no	--	4

When processing these feedback data, TickIt uses the supervised fine-tuning (SFT) method. A typical SFT data contains both "conversation content" (oncall original information) and "LLM thinking process and category judgment" (LLM output). It is organized into a data set according to the System Prompt used by TickIt for oncall upgrade tasks.

We set different priorities for the four feedback actions to avoid conflicting feedback under the same Oncall. Direct feedback (likes, dislikes) will be included in the SFT dataset. According to our observation, people are usually more inclined to provide some negative feedback (dislikes) when the Oncall upgrade is wrong than positive feedback (likes). Therefore, the number of likes is much smaller than dislikes. It is for this reason that we set the priority of likes to the highest (at least one person thinks the alarm is helpful). For the feedback of dislikes, it is usually considered a false alarm, so its target category is set to "other" (such as no specified category description). In this case, since only the target category is known and the reasoning steps required by COT are lacking, TickIt uses a large language model to complete the supplement of the thinking chain steps under the target classification. Considering the diversity of thinking steps, TickIt will sample three possible thinking chain steps for each Oncall to enrich the dataset.

In this way, TickIt can process user feedback and perform data enhancement based on category guidance, and finally build a high-quality annotated dataset. When a sufficient amount of annotated data is accumulated, TickIt will use the SFT method to optimize the model offline, and then update the online model, thereby continuously improving the model's performance in the upgrade classification task.

Experimental verification of TickIt

TickIt has been fully deployed on the Volcano Engine line and has achieved remarkable results. During this period, TickIt has handled tens of thousands of oncalls. Among the feedback received, about 81% of the feedback indicated that TickIt's upgrade decision was accurate, which also proved the effectiveness and reliability of TickIt in practical applications.

In terms of further Oncall upgrade performance evaluation, we also compared multiple methods based on small language models (SLM) and large language models (LLM). Due to the limited parameter scale, the small language model has a large gap in language understanding ability compared to the LLM, and some non-end-to-end method designs are prone to information loss during information transmission. The LLM-based method showed good accuracy. We verified the impact of different framework designs on model performance through ablation experiments. Using the CoT LLM method, the accuracy and recall rates can reach about 82%. On this basis, combined with the Reflection prompt, the model can self-correct its reasoning and output, and the accuracy is slightly improved to 82.8%. However, since the CoT prompt has enabled the model to fully reason before reaching a conclusion, it is difficult for the model to obtain new key information in the reflection stage to further improve the accuracy and insight of the output, so the reflection technology does not significantly improve the experimental results. After introducing the contextual learning (ICL) prompt, the recall rate of the model is greatly improved to 89.2%. Although the precision is slightly reduced, this result fully demonstrates the powerful generalization ability of the LLM method.

Comparison of Oncall Upgrade Designed in Different Ways

After further supervised fine-tuning (SFT) of LLM, the model performance was significantly improved. Taking the CoT prompt as an example, the recall rate after fine-tuning was greatly improved from 82.1% to 91.2%, while maintaining a high precision of 81.8%, and the F1 score reached 86.2%, which performed best among all the compared methods. This result strongly demonstrates the effectiveness of SFT in leveraging LLM capabilities to improve the performance of Oncall upgrade tasks. However, when SFT was combined with other prompt-based methods such as Reflection and ICL, the performance decreased slightly. This may be because some samples in ICL or data with a similar distribution to the training data were used in the SFT process, so that the model had already learned the corresponding content during offline fine-tuning, resulting in certain conflicts when used in combination.

In the experiment of repeated oncall analysis, the most suitable parameter was explored by adjusting the similarity threshold. When the parameter was between 0.86 and 0.95, the F1 score would first increase and then decrease as the parameter increased. When it was set too high, the strict similarity constraint may cause similar problems to be misclassified into different categories, causing the number of upgraded oncalls in the evaluation results to deviate from the actual situation, and the deviation was not monotonically related to the threshold. In addition, the deduplication method based on the problem phenomenon itself has certain limitations. For oncalls with the same performance but different root causes, incorrect deduplication may occur. In the design of rewriting oncall problems, we also conducted ablation experiments. The experimental results show that TickIt with the rewrite function turned on has an F1 score increase of 1.7% compared with the experimental setting without the function. Further analysis of the data set, only the upgrades containing multiple related oncalls were retained for experiments. The results showed that the rewrite design in TickIt increased its F1 score from 0.706 to 0.749, an increase of 6.1%.

Repeat the parameter selection experiment under Oncall recognition

Ticket Problem Rewriting Ablation Experiment in Repeated Oncall Identification

Summary and Limitations Analysis

TickIt uses a large language model to achieve efficient and automated oncall upgrades, bringing significant efficiency improvements to Volcano Engine. From the perspective of "helping people to intervene in serious oncalls in a timely manner", it helps Volcano Engine shorten the response time for serious problems, reducing the overall MTTR by about 26% and saving manpower investment costs. At the same time, TickIt has also been widely recognized by users.

However, TickIt has also exposed some limitations in practical applications. (Personalized) expressions in conversations may affect the judgment of large language models. For example, some users may exaggerate the impact of the problem, leading to unnecessary upgrades; while others may describe serious problems too blandly, causing TickIt to fail to identify situations that require upgrades in a timely manner. In addition, if the cloud service product associated with Oncall is not specific enough, similar problem descriptions may have different levels of severity due to the involvement of different cloud service products, which can easily lead to misjudgment of the large language model and then incorrect upgrades. In our subsequent work, we will further optimize the actual effect of TickIt to help the stability of the Volcano Engine.