A new framework for evaluating large language model systems: a methodology for constructing micro-indicators

New framework for evaluating large language model systems: how to construct micro indicators?
Core content:
1. Challenges of large language models in production environments
2. Construction of indicator systems from a systematic perspective
3. Practical case: Analysis of problems caused by prompt word modification
Each application scenario in the AI field has its own unique challenges. After the system carries the traffic of the production environment, developers need to start monitoring edge scenarios and special cases.
Systematic perspective: Large language models are viewed as system components rather than independent entities. Their performance and reliability require a complete observability system and protection mechanism, and must be dynamically aligned with user needs and business goals.
Build an indicator alert mechanism that can respond to user issues in a timely manner, and establish an indicator cleanup process to eliminate obsolete monitoring items
Build an indicator system around the business development direction, matching the current stage goals and integrating historical experience and lessons
Don’t overcomplicate things. Use an incremental development model, first build a basic indicator framework, then improve the monitoring infrastructure, and finally achieve a step-by-step improvement in system maturity.
Denys Linkov gave a keynote speech at QCon San Francisco entitled "A Framework for Designing Micro-Indicators for LLM System Evaluation". This article is compiled from the speech, focusing on the unique challenges of evaluating the accuracy of large language models (LLMs), and systematically explains how to continuously optimize LLM model performance by creating, tracking, and dynamically modifying a micro-indicator system.
Have you ever encountered a scenario where you changed the system prompt word but caused problems in the production environment? You ran all the test cases and performed sufficient evaluation before updating the model. Everything seemed to be fine until someone @ you on the Discord server complained that the system was completely down.
This idea of micro-indicators comes from the author's personal experience of modifying the system prompt words on the AI agent platform Voiceflow. Just adjusting the prompt word template for interacting with the model unexpectedly triggered a typical case: when a user was talking to the model in German, the model was able to answer correctly in German in the first four interactions, but in the fifth conversation, the model suddenly switched to English. The customer was very dissatisfied with this and questioned why the model suddenly switched to English in a conversation that was entirely in German. Not only the user was puzzled, but the author was also confused.
The development of an LLM platform, or any type of platform development, is challenging.
When developing LLM applications, what is considered a good model answer? This question is quite philosophical because it is difficult for people to reach a consensus on "good" answers.
LLMs are fascinating and confusing. Their answers always sound convincing, even when they are wrong. Not only do people disagree on what constitutes a "good answer," but sometimes they don't even read the model's output carefully. To assess the quality of answers, developers often use regular expressions or exact matches, compare cosine similarity to a standard dataset, use LLMs, or use traditional data science evaluation metrics.
The story begins with a lesson I learned. The first is the limitation of a single metric. For example, I used the semantic similarity that supports RAG (Retrieval Augmented Generation) for similar phrase retrieval. I compared the matching degree of "I like to eat potatoes" with three short sentences using OpenAI's latest model and two top-ranked open source models. Guess which short sentence has the highest matching degree?
The contrasting sentences are as follows:
I am a potato
I am a human
I am hungry
Figure 1: The challenge of semantic similarity
All three models chose "I am a potato". This result is amazing. Matching "I like to eat potatoes" with "I am a potato" exposes the flaws of models that rely on cosine similarity or semantic similarity. From the actual semantic point of view, "I am hungry" or even "I am a human" is more reasonable than "I am a potato". This case shows that the evaluation indicators are not reliable in all scenarios.
The industry generally uses large language models (such as GPT-4) as automatic evaluation tools. This practice is common in scenarios where large model answers need to be processed in batches but manual review is not desired. However, there are biases in these models: a 2023 study found that GPT-4 had low consistency with human judgment in evaluating short prompts, but had better evaluation results for long prompts. Multiple studies have confirmed this evaluation bias, and the reason behind this bias is that the model has learned a certain human thinking pattern tendency or preference during training.
So would it be more reliable to let humans do the evaluation? Standardized tests provide the answer. A study on SAT essay scoring more than 20 years ago showed that the length of the essay alone could accurately predict the score of the examiner. This exposes a similar tendency of human judges: we tend to pay more attention to superficial indicators such as the length of the essay rather than the actual quality of the content.
How do we define “quality”? Would users prefer to watch cat videos or LLM videos on YouTube? The cat videos on YouTube have 36 million views, while Karpathy’s technical talks have only 4 million views, so we can conclude that “cats are better than LLM, and we should obviously only push cat-related content to users.” Social media may agree with this conclusion, but it also exposes the limitations of relying solely on metrics such as views or accuracy. These evaluation criteria are inherently imperfect, and it is not difficult to understand if you think about it carefully.
By observing how humans tell other humans to do things, we can see that there are significant differences in the precision of instructions for different tasks. For example, when I worked at McDonald's (an experience that shaped my character), the instructions for making fried chicken nuggets were extremely detailed, with cooking times specified in the operating instructions, and a timer beeping if the chicken nuggets were not removed within the specified time. But for tasks like "mop the floor", the instructions are relatively vague. If you don't know what mopping is and don't ask about the specific steps, you will probably make a mess. Real-life task instructions often fall somewhere in between these two.
Figure 2: McDonald’s Operation Guide
These examples all reflect the ambiguity of human instructions: some are general, some are precise and detailed, and many are somewhere in between.
When evaluating model performance, it is important to provide specific feedback. This is often emphasized in many project lectures related to team management. Take McDonald's as an example. Performance evaluation often involves detailed questions such as "how many times did the ice cream cone turn?" I was often troubled for turning it too many times, which may also be the reason why the ice cream machine often malfunctioned. Who knows.
Sometimes this feedback is very specific, and sometimes it’s very vague. I have a similar situation with the LLM evaluation metrics, not because LLM is more humane, but because the feedback framework works in the same way. In performance reviews, vague feedback like “good job” does nothing to improve, and the same is true for LLM evaluations. If it just says “the model is hallucinating”, then I’m probably like “well, what’s the use of this information?”
Let's look at the model from a system perspective. If you have worked on observability (such as writing metrics, tracing, and logs), you will understand the importance of monitoring. This also applies to large language model systems. Models cannot be deployed and then left alone, otherwise the alarm notifications will be enough to make you drink a pot. Observability mainly involves three types of events: logs, metrics, and tracing.
Log: What happened?
Indicators: To what extent?
Tracking: Why did the problem occur?
The granularity of these monitoring methods is from coarse to fine: metrics, logs, and tracing. For LLM, metrics can focus on dimensions such as model performance degradation and content review; in terms of model performance degradation, metrics such as latency can quickly identify problems with service providers or inference nodes, but scoring model answers takes longer (several seconds or even minutes). In an enterprise environment, offline tasks such as selecting the best model may take weeks or even months to complete.
Generating content review metrics requires real-time response. When facing a spam attack, it makes no sense to run a batch task that will only run next week; you need to clearly understand the purpose of the metric, the acceptable delay range, and how to act next.
Applications can be divided into two categories: real-time and asynchronous. The indicator design can also be divided as follows:
Real-time metrics : used to detect issues that require immediate attention, such as model performance degradation, event timeouts, or models returning invalid outputs
Asynchronous metrics : suitable for tasks such as model selection, which may involve running evaluations or technical discussions
Protection mechanism : Depending on the specific scenario, either real-time mode or asynchronous mode can be used
Figure 3: Real-time indicators, asynchronous indicators, and protection mechanisms
There is no limit to the metrics you can use and define, but at the end of the day, you are using them to support business or technical decisions for the next three months. Mathematically speaking, a good metric should provide both magnitude and direction information - just like a vector has both magnitude and direction.
Building metrics that can warn users of problems, whether they are immediate risks or long-term concerns, is critical. For any successful product (whether it is an internal tool or an external service), system failure means user loss.
Back to the example of the LLM answer in the wrong language, it was the user who proactively reported the problem, so we were able to complete the verification in time before it affected enterprise-level users. Although the problem was difficult to reproduce, through the analysis of the logs, we located a record of an abnormal response. The solution to this is to add a set of protection mechanisms to detect the language of the answer within a millisecond time window, and immediately trigger a retry mechanism if it does not match the question language. The effect of this real-time processing solution far exceeds the traditional asynchronous processing mode.
When deciding whether to use real-time or asynchronous indicators, you should make a judgment based on the specific scenario. Taking content review as an example, real-time tagging or filtering of inappropriate content is a good strategy. When specifying indicators, you should always combine them with business scenarios and think about the possible consequences.
Whether it is internal or external product development, the core is to win the trust of users. The most basic requirement is to ensure the basic availability of the product, and the next step is to create an excellent user experience. Customer trust is like an island. As long as the product runs stably and continues to create value, the company can develop steadily on this land.
Figure 4: Customer trust is like an island
System failures (such as the wrong answer language mentioned above) will lead to the loss of customer trust, and customers will be angry about the complaints of their customers. At this time, you can choose the following remedial measures: provide compensation plans to reduce user dissatisfaction, deploy repair mechanisms such as automatic retries, and write root cause analysis (RCA) reports to explain the root causes and solutions of the failures. Although whether trust can be rebuilt depends on the attitude of the customer, the core goal is always to repair the trust relationship by ensuring that the product functions normally.
The more complex the system architecture, the more challenging it is to build observability. Complex LLM pipelines built with cutting-edge technologies such as RAG will significantly increase the difficulty of debugging and monitoring, but if RAG is split into two components: retrieval and generation, monitoring will be effectively simplified. For the retrieval phase, the focus is on context optimization: ensuring that relevant information is provided while eliminating redundant content that may interfere with answer generation, and balancing precision and recall when the ranking is known. The indicators of the generation phase need to focus on format correctness, answer accuracy, and control of redundant information. They can also be further refined into dimensions such as accuracy, answer length, role consistency, and even set hard rules such as "prohibiting the use of specific words such as 'delve'." The multi-component nature of RAG determines that different links require differentiated evaluation indicators.
At this point, you may have conceived several indicators for your application scenarios, but in the final analysis, these indicators must be able to create business value. For example, if LLM generates sensitive content, how much loss will this cause to the business? The risk tolerance of different companies varies significantly. Depending on the target customer group and application scenario, the business team needs to evaluate the specific loss parameters. Similarly, if the LLM used for legal consulting gives wrong advice (such as encouraging users to "sue their neighbors"), it will have serious consequences. We need to quantify these errors into economic losses, and then determine the scale of investment in the indicator system, the additional cost of the security mechanism, and the tolerable delay threshold for online detection. For example, in the case of incorrect translation mentioned above, how should its potential losses be evaluated?
The fundamental purpose of building an indicator system and deploying LLM is to save manpower and time costs. The core value of all automated systems and cutting-edge applications lies in improving efficiency. Of course, social media applications are an exception, the goal of which is to extend the user's online time as much as possible. Perhaps you feel that business logic is completely beyond your scope of responsibility, and developers are only responsible for writing good code. However, there are two misunderstandings in this view: first, understanding the business background is a prerequisite for technical decisions, and developers must know what problems the systems they build are going to solve; second, although the business team undertakes the main value assessment work (which is their responsibility), the technical team also needs to take the initiative to maintain strategic alignment with the business team.
Business teams need to clarify usage scenarios, understand how features integrate with products, evaluate return on investment (ROI), and choose the right model. Although developers should not stay out of such discussions, the construction of the indicator system is not just the responsibility of the technical team. Today, when LLM is widely used in various products, business teams must also invest time to define evaluation indicators suitable for their own products.
Ensure that the indicator system is consistent with current goals and integrate lessons learned from practice. During the LLM application launch process, the team will inevitably accumulate a lot of new knowledge, so we need to establish an indicator cleanup mechanism to eliminate outdated evaluation standards in a timely manner.
Finally, I would like to share some more practical suggestions. Don't pursue perfection from the beginning, adopt a gradual strategy of "small steps and fast progress". This also applies to the construction of the indicator system. We first need to deeply understand the use scenarios and ensure that the business and technical teams reach a consensus. This is my basic idea for evaluating the maturity of LLM and the maturity of the indicator system.
Figure 5: The small-step-fast-run strategy of the LLM indicator system
In the initial stage, if you want to implement the indicator system, you need to complete some preparation work. First, clarify the goals of the construction and its value positioning. Prepare the evaluation data set. If you don’t have a ready-made one, you need to invest time to create it. At the same time, establish basic scoring standards and log systems so that you can track the system operation status, understand the system behavior, and determine what is normal and what is abnormal. You can start with content audits or accuracy indicators based on evaluation data sets. These initial indicators may not be perfect, but they can provide a basis for subsequent optimization.
By the mid-term, we have a deep understanding of the challenges facing the system, and know where the weaknesses, effective mechanisms, and problem points are. At this point, we have formed clear hypotheses on how to solve or at least how to further investigate these problems. We need to establish feedback loops to verify hypotheses and collect feedback through logs or user data to solve these problems. Ideally, we have tried to use basic indicators, and now it is time to introduce more refined indicators. For example, recall metrics (such as net discounted cumulative gain), answer consistency metrics to optimize temperature settings and evaluation trade-offs, or implement language detection functions can be added. These more specific indicators require more infrastructure support to be effectively implemented.
In the "fast run" stage, we can confidently show the results. Not only do we have a lot of high-quality automated tools built internally (such as automatic prompt word tuning), but the indicator system is also highly aligned with specific goals; at this point, we may have accumulated high-quality data for fine-tuning (although this is also a business decision). At this stage, the indicator system can be customized according to needs because we have fully understood the system and product and can accurately identify the required micro indicators.
In this article we explored five key takeaways: Single metrics can be flawed, and the “I am a potato” example clearly illustrates this. Models are not just standalone LLMs, they are part of a wider system, especially as complexity grows with the addition of RAGs, tool usage, or other integrated features. It is critical to build metrics that can alert users to issues, focusing on dimensions that affect product performance and align with business goals. When using LLMs to improve products, it is important to keep it simple. Follow a “step by step” methodology. The worst thing to do is to fill your system dashboard with two dozen metrics that have no follow-up and eventually overwhelm yourself. Don’t overcomplicate: stick to a “step by step” incremental strategy.