A new paradigm for AI evaluation: unlocking the explainability and predictive power of AI performance

Written by

Jasper Cole

Updated on:July-09th-2025

Editor's note: Artificial intelligence has been widely infiltrated into many fields, but its performance is still greatly lacking in explainability and predictability. Therefore, an evaluation method that can take into account both explainability and predictability is needed to help identify the reasons for the failure of artificial intelligence in tasks and guide deployment. Recently, an interdisciplinary research team composed of institutions such as the University of Cambridge, Microsoft Research Asia, Polytechnic University of Valencia, Educational Testing Service, Carnegie Mellon University, and Princeton University jointly proposed a new evaluation paradigm, which for the first time achieved a synergistic breakthrough in explainability and predictability. This new method can not only accurately explain the reasons for failure in specific tasks, but also effectively deduce the potential performance of large models in untested areas, providing key scientific theoretical support for solving the "black box" problem of AI and promoting its reliable deployment.

With the rapid development of artificial intelligence, general artificial intelligence (such as large language models) has performed well in many fields, including solving complex mathematical problems. However, due to its unexplainability and unpredictability, it may still make mistakes in simple tasks such as basic arithmetic. This poses a major challenge to the evaluation of artificial intelligence - it is urgent to develop explainable and predictable evaluation methods to clarify the reasons for system failure and guide reliable deployment. However, there is currently no evaluation paradigm that can meet both requirements.

Traditional performance-oriented evaluation methods lack explanatory and predictive power at the level of individual task instances. For example, a model achieved an average performance of 79.8% on popular math benchmarks such as AIME (American Invitational Mathematics Examination), but this data cannot predict or explain its performance on a single task, nor can it infer its ability in other tests. Compared with simple score aggregation, the academic community has also explored methods such as psychometrics to characterize the capabilities of artificial intelligence, but these methods still fail to take into account both explainability and predictability.

An interdisciplinary research team composed of Cambridge University, Microsoft Research Asia, Polytechnic University of Valencia, Educational Testing Service, Carnegie Mellon University, Princeton University and other institutions recently proposed an innovative AI evaluation paradigm: by developing a universal capability scale to characterize benchmarks and large models in detail, to achieve explanation and prediction . This research breaks through the limitations of traditional evaluation methods and lays a solid foundation for the reliable deployment of AI.

General Scales Unlock AI Evaluation with Explanatory and Predictive Power

Paper link:

https://arxiv.org/abs/2503.06378

Figure 1: Process for explaining and predicting performance of new systems vs. benchmarks. “System Process” (top): Steps for evaluating each new AI system — (1) run the new system on the ADeLe test set, (2) plot characteristic curves for all capability dimensions and extract capability profiles for the system (optional), (3) train a simple evaluator using the annotation levels as input to predict the system’s performance on new task instances. “Task Process” (bottom): Steps for each new task or benchmark — (A) apply the DeLeAn (DeLeAn) standard to the new task using standard LLMs, (B) obtain requirement histograms and requirement profiles to explain the requirements needed for the task (optional), (C) predict performance on the new task for any system for which an evaluator has been built following the “System Process”.

ADeLe: A framework for the annotation of a hierarchy of requirements for a generic competency scale

The researchers first constructed 18 human-understandable general ability scales, which cover 11 basic cognitive abilities, 5 knowledge areas, and 2 external interference factors (see Table 1 for details). Each scale defines a progressive requirement standard from level 0 to level 5, and the higher the level, the stronger the high-level requirements of the task for this ability. For example, in the "Formal Science Knowledge (KNf)" scale, level 0 means that the task can be solved without formal science knowledge, while level 5 requires professional knowledge at the graduate level or above.

Table 1: Description of the 18 general competence scales in the standard set (ranging from 0 to 5)

Based on the above framework, researchers used GPT-4o to perform full-dimensional demand level annotation on 16,000 instances of 63 downstream tasks from 20 benchmarks, and constructed the ADeLe (Annotated-Demand-Levels) v1.0 dataset, which contains all 16,000 task instances and demand annotations. The ADeLe dataset cleverly places a large number of task instances of different benchmarks in the same comparable space, enabling researchers to unlock explanatory and predictive power when evaluating the capabilities and limitations of any large language model. Figure 2 intuitively shows five instances and their annotations in the ADeLe dataset.

Figure 2: Level annotation of five examples using the DeLeAn standard

Double breakthrough in explanatory power and predictive power

Based on the ADeLe test set, the research team conducted three core analyses and revealed several important findings:

1. Exposing the inherent flaws of AI benchmarks through task demand profiles

By analyzing the demand levels of 20 benchmarks, the study found that all benchmarks have a lack of conceptual validity - they can neither effectively measure the target abilities they claim to have (lack of specificity) nor cover a sufficient range of difficulty in the target ability dimension (lack of sensitivity). For example, the "Civil Service Examination" benchmark claims to be able to measure logical reasoning ability, but from the task requirement profile (Figure 3), the successful completion of the task is also highly dependent on other abilities such as knowledge reserves and metacognition. Another example is the "Time Reasoning" benchmark (TimeQA), whose reasoning ability requirement level distribution is too concentrated and cannot effectively distinguish different task requirement levels or difficulty levels. Using the ADeLe methodology to optimize benchmark design can ensure the structural validity of the benchmark by constructing an accurate task requirement profile, clearly define its measurement objectives and evaluate its applicable boundaries.

Figure 3: Distribution of requirements for the 20 benchmarks included in the ADeLe test suite v.1.0

2. In-depth analysis of the capabilities of large language models

The researchers plotted the SCCs of 15 mainstream large language models in 18 ability dimensions, which describe the accuracy of large models at different levels of ability requirements and fit the logistic function. Such curves can fully describe the strengths and weaknesses of the abilities of 15 different LLMs (Figure 4 ) .

Figure 4: Characteristic curves of 15 LLMs in 18 demand dimensions

In addition, by summarizing the SCC by calculating the ability score for each dimension, and then defining it as the x-value in the SCC with a probability of success of 0.5 (the point with the largest slope/information content) in the tradition of psychometrics, many insights were generated. These insights mainly include: 1. The overall ability of new LLMs is better than that of old LLMs, but this conclusion is not always true for all abilities; 2. Knowledge ability is mainly determined and limited by changes in model size and distillation process; 3. Reasoning, learning and abstraction, and social ability are improved in "reasoning" models; 4. The marginal benefit of the scaling law for non-reasoning models is decreasing.

3. Instance-level performance prediction model that is better than black box

The researchers used the demand level vector as input feature and trained a random forest classifier as an evaluator to predict the performance of LLMs in new task instances. Experiments show that the model achieves excellent prediction performance in both in-distribution and out-of-distribution data: in the prediction of the frontier model, the AUROC (Area Under the Receiver Operating Characteristic Curve) value is as high as 0.88, and the calibration error is close to perfect, which is significantly better than the black box baseline method based on GloVe word vectors and fine-tuning LLaMA-3.1-8B, especially on out-of-distribution data. This further verifies the scientific nature of this new paradigm.

At present, this method has been successfully applied to the evaluation of 15 mainstream LLMs. The research team plans to expand it to scenarios such as multimodality and embodied intelligence, providing a scientific and standardized evaluation infrastructure for artificial intelligence research and development, policy making, and security auditing.

This work has achieved a breakthrough in the synergy of explanatory power and predictive power for the first time, marking an important progress in the science of AI evaluation. By building a scalable collaborative community, this method will continue to promote the explainability and predictability of AI system performance and security, and provide key methodological support for addressing the evaluation challenges brought about by the rapid development of general AI.