Top AI researchers reveal why 99% of model evaluations are lying to you

Written by

Iris Vance

Updated on:June-22nd-2025

introduce

With the rapid development of large language models, how do we measure their level? The key lies in building good benchmarks . As Ofir Press pointed out, good benchmarks can expose the weaknesses of existing models to the public and guide the community to improve the models. Press himself has invested a lot of energy in the development of benchmarks in his career and personally led the team to refresh the performance records of multiple benchmarks. He believes that designing high-quality benchmarks is as important as developing new models.

Press is an industry authority on language model evaluation and is currently a postdoctoral fellow at the PLI Laboratory at Princeton University. Datasets he created, such as the SWE-bench software engineering benchmark, have been widely adopted by companies such as OpenAI, Meta, Google, Anthropic, and have been downloaded more than 2 million times. Therefore, his insights on how to build excellent benchmarks are of great reference value. This article is compiled from Press's blog "How to Build Good Language Modeling Benchmarks"

published in 2024 , and explains the core ideas, including dataset construction principles, evaluation methods, common pitfalls (such as data leakage), etc. These methods are extremely important for large model evaluation, fair comparison, and actual deployment. Original address: https://ofir.io/How-to-Build-Good-Language-Modeling-Benchmarks/

What are the characteristics of an excellent benchmark?

A good Benchmark needs to have the following three characteristics:

1. The task should be natural and real

"Natural" means that the questions in the benchmark should come from real life, and are questions that people actually ask and often encounter. For example, all the questions in the SWE-bench benchmark built by Press and his team are program vulnerabilities reported by real users on GitHub, and the task requires the model to try to fix these bugs based on the current state of the code repository . The reason why such benchmarks are natural is that "fixing bugs" itself is a real job that developers do on a daily basis, and if it is solved well, it can save a lot of time for human developers. For example, the AssistantBench benchmark collects assistant-type questions that users will ask in real life (such as "Which yoga studio nearby has vinyasa classes before 8 am from Monday to Friday?"), and the CiteME benchmark focuses on academic citation queries (such as "Which paper first proved that the Transformer model cannot extrapolate long sequences?"). These questions come from real needs, so if the model can perform well on such benchmarks, it often means that it is useful in reality.

In contrast, some "unnatural" benchmarks that are out of touch with reality are unlikely to attract community interest. Press pointed out that questions that resemble IQ tests (such as graphic pattern recognition) or overly simple common sense questions (such as "Bob threw an egg at Alice's face. Is Alice happy, sad, or indifferent?") are not attractive enough today. Perhaps this kind of benchmark made sense when early models could not even master basic common sense, but now that model capabilities have improved, we need to challenge them with more realistic and difficult tasks . A simple way to judge whether a benchmark meets the "natural" standard is to see whether it meets the "practical" standard. Specifically, ask yourself: If a system performs better than the baseline level in this benchmark, is it really useful to humans? A system that can automatically fix even a portion of software bugs can obviously save developers a lot of time; a system that can quickly help people find suitable yoga classes also has direct value.

Press summarizes two intuitive signals that a benchmark is “unnatural” and recommends avoiding them as much as possible:

Unrealistic question format : A typical example is to make the question into an artificial multiple-choice question format . For example, when we go to the doctor, we never say "Doctor, my elbow hurts, and the cause must be one of the following four options..." If the question format in the benchmark test is almost non-existent in reality, you should consider improving it to be more natural.
Fabricated questions : If the questions are not from real users, but are just weird questions that the questioner came up with behind closed doors, they often lack naturalness . Instead of creating them behind closed doors, it is better to look for questions that users actually asked but did not find satisfactory answers to in the logs of large search engines. The questions selected in this way are more representative of real needs and more meaningful.

2. Results can be automatically evaluated

A good benchmark should also be automatically evaluatable . That is, we can use objective criteria to immediately judge whether the answer given by the model is right or wrong. This is easy to do for some tasks. For example, code generation benchmarks can verify whether the program is correct by running unit tests. Many popular benchmarks (such as OpenAI's HumanEval and Press's own SWE-bench) use this unit test automatic evaluation method. For example, math problems or fact query questions often have a unique correct answer, and whether the model output matches the standard answer can be automatically determined.

However , there are also many valuable tasks that are difficult to evaluate automatically . For example, text summarization is a task that is very useful to humans but difficult to evaluate . Requests such as "Summarize this patient's medical records (500 words)" are very useful to human doctors, but unfortunately there are few relevant benchmarks so far, precisely because the summaries given by different models may have their own advantages, and there is a lack of recognized automatic standards to judge which one is better. Although some people are trying to use another language model to score the summaries generated by the model , Press believes that it is not appropriate to let the same AI be both a contestant and a referee . After all, if the evaluator is also AI, it is inevitable that there will be bias, and the model may even "take shortcuts" based on the preferences of the judging model. Ideally, we either let the model solve the task or let the model evaluate the output, but we should not let the same type of model system be responsible for both things at the same time.

3. Testing is challenging

The third key element is challenging . If the benchmark is too simple, its significance will be greatly reduced. If the top model can achieve a high accuracy of , for example, 80% at the time of release, then everyone will think that the problem has been "solved" , and naturally lack the motivation to invest energy in solving it. Therefore, Press recommends that when new benchmarks are launched, they should choose tasks where the current model has a very low success rate. Ideally, the initial accuracy of the best model is only in the single digits, and the lower the better . When he published this blog in 2024, he pointed out that the top model should be in the range of 1% to 35% accuracy at the time of release. Later, as the model progressed rapidly that year, when he revised this blog in January 2025, it was changed to no more than about 10%. Interestingly, Press revised this standard again in May, not only considering those tests where the AI model scored 0 points at the time of release, but also designing those super-difficult tests where the AI might get "-200 points". Researchers need to find those extremely difficult problems that cannot be solved even if the AI performance is improved by 3 times. In other words, we need to select difficult problems that the current models cannot solve correctly as benchmarks.

However, attention should also be paid to the balance between difficulty and enthusiasm for participation . If the benchmark seems almost "impossible to complete", researchers may simply lose interest. Press shared his own experience: when his team launched SWE-bench, the initial accuracy of the strongest model was only 1.96%, and many people said it was too difficult and backed off. For this reason, Press was well prepared - they immediately started to develop the automatic agent system SWE-agent for the benchmark when they released the benchmark, and successfully improved the score to about 13%. When the community saw that someone broke through the double-digit accuracy, the task suddenly "no longer seemed impossible", and many teams immediately joined the ranks of improving the model, and the results have been constantly refreshed since then. This experience shows that benchmark designers must not only dare to propose extremely challenging tasks to stimulate research potential, but also need to consider providing the community with a little "dawn of success" to prevent everyone from being scared off by the difficulty and hesitating.

Bonus: Avoid data leakage

Press also proposed an ideal additional feature : benchmarks should try to avoid data leakage . The so-called "leakage" refers to the benchmark test questions and answers being seen by the model in advance during training. If this happens, the model may not rely on real ability but on memory to answer the questions correctly, and the evaluation results will be distorted. This is a real problem in the era of large models - because mainstream LLMs usually crawl massive amounts of text from the Internet as training corpus, once a new benchmark is publicly released, it may be "picked up" by the training data of future models.

It is very difficult to prevent this, but there are countermeasures. Press tried a clever method in his SciCode benchmark: publish the questions but not the answers . Specifically, SciCode includes a series of super-difficult programming challenges designed by PhDs in science and engineering. Each data point only has a description of the function requirements and a test case for verifying the correctness of the code, while the code for the reference solution is deliberately kept secret and not released to the public. In this way, even if these questions are "leaked" into the training set of the model in the future, the model can only see the description and test requirements of a certain programming question, but does not know what the correct solution code is, so it still needs to really reason to answer. Through this means, the benchmark tests the generalization ability of the model as much as possible. However, Press also admitted that it is very difficult to create a completely "leak-proof" evaluation, and not every benchmark can do this. Therefore, he calls it a "bonus point", a goal worth striving for but not a hard standard.

Other guidelines and considerations

In addition to the above “three principles” and additional bonuses, Press also gave several practical suggestions on how to design indicators and publish evaluation results to ensure that the evaluation of the model is fair, clear , and influential:

(1) Use a single metric to clearly present the results. Try to set a core score for the benchmark so that everyone can compare at a glance. Don't use a bunch of different metrics or split the results into various subcategories . This will confuse people and reduce the community's acceptance of the benchmark. After all, people prefer to say "our model scored 87 points on a certain benchmark" rather than reporting three or four numbers such as accuracy, precision, and recall at the same time. If there is a need for a detailed analysis, it can also be provided in the analysis section of the benchmark paper, but when promoting and comparing externally, the focus should be on a single overall score.

(2) Provide a strong baseline. When publishing a benchmark, the official should report the performance of some of the strongest models on the dataset , including large proprietary models (such as GPT-4.1 or O3) and leading open source models. Do not only use old or weak models for comparison. Doing so will make your benchmark appear difficult on the surface, but in fact it will mislead everyone. The correct approach is to truthfully display the benchmark results of strong models to fairly reflect the difficulty of the task and let latecomers clearly understand the gap between them and the top level .

(3) There is no need to demand permanent difficulty . Benchmarks often have a limited lifespan and are often "broken" or reach performance saturation within one or two years of release. The field of deep learning is changing rapidly, and it is impossible to predict the level of models in five years, so there is no need to deliberately avoid time-sensitive problems for the sake of durability. Even if the answers to some questions in the benchmark may change in two years, it does not matter - it is enough to enable current models to make progress. In fact, famous general NLP benchmarks such as GLUE and SuperGLUE have all been achieved or even surpassed by models at human level within less than a year of their release. Instead of trying to design "timeless" questions, we should focus on truly difficult and meaningful problems at the moment and let the benchmark play its role in leading recent research.

As shown by well-known benchmarks such as GLUE, newly proposed tasks are often quickly "cracked" by models. When GLUE was released in 2018, it was considered "unsolvable by existing methods", but within less than a year, the performance of models on GLUE was equal to or even exceeded the human level. Researchers subsequently launched the more difficult SuperGLUE benchmark to continue to widen the gap between humans and models.

Conclusion

As Press said, good benchmarks "provide a wide space for creativity and can have a huge impact on guiding the future of the community ." Building an excellent large-model evaluation benchmark is not easy. It requires balancing authenticity, evaluability, and challenge, as well as taking into account many details such as preventing leaks and presenting results. However, appropriate benchmarks allow us to compare models more fairly, identify shortcomings, and effectively improve model capabilities. This is crucial to ensuring that AI systems develop in a useful and reliable direction.

Of course, the above guidelines are not rigid dogmas. As Press emphasized, "Rules are meant to be broken" - benchmarks that do not fully meet all the criteria are not necessarily bad. These standards are more of a reference to help determine whether the design ideas are reasonable . Specific situations may require flexible trade-offs: even if it is not possible to cover everything, as long as the overall direction is correct, a benchmark is still worth building and challenging. I hope that these experience tips can help the industry create the next outstanding large model evaluation benchmark, and let us not forget benchmarking, the cornerstone of promoting progress, while paying attention to breakthroughs in model capabilities.

Last words

This article summarizes some basic principles for building a benchmark. These principles can be considered as important guidelines for building a benchmark. I hope that you can refer to these principles when building a benchmark and design meaningful benchmarks to correctly and effectively measure the capabilities of the model.