Sequoia China just launched a new AI benchmark xbench! Defining "good questions" in the second half of AI, and the three stages of Agent technology market matching

Written by

Audrey Miles

Updated on:June-19th-2025

Today , Sequoia China released a new AI benchmark evaluation system xbench (xbench.org), and simultaneously released the paper "xbench: Tracking Agents Productivity, Scaling with Profession-Aligned Real-world Ev aluations".

This is the first AI benchmark test initiated by an investment institution, involving dozens of doctoral students from more than a dozen top universities and research institutions at home and abroad, using a dual-track evaluation system and evergreen evaluation mechanism.

This is also another landmark move by Sequoia after its heavy investment in the AI basic model track. While evaluating and promoting the upper limit and technical boundaries of AI system capabilities, xbench will focus on quantifying the utility value of AI systems in real scenarios and capturing key breakthroughs in Agent products in the long term.

With the rapid development of basic models and the large-scale application of AI Agents, the benchmark tests that are widely used to evaluate AI capabilities are facing an increasingly acute problem. It is becoming increasingly difficult to truly reflect the objective capabilities of AI systems. The most direct manifestation of this is that basic models have "blown up" the benchmark test question banks on the market, and have won high or even full marks on major test lists.

Therefore, building a more scientific, long-term evaluation system that reflects the objective capabilities of AI is becoming an important requirement to guide AI technology breakthroughs and product iterations.

▍ Features of xbench benchmark

xbench adopts a dual-track evaluation system to build a multi-dimensional evaluation data set, aiming to simultaneously track the theoretical capability limit of the model and the actual landing value of the agent . The system innovatively divides the evaluation tasks into two complementary main lines:

Assess the capabilities and technical boundaries of AI systems;
Quantify the utility value of AI systems in real scenarios . The latter requires dynamic alignment with real-world application needs, building evaluation standards with clear business value for each vertical field based on actual workflows and specific social roles.

xbench adopts the Evergreen Evaluation mechanism to ensure timeliness and relevance by continuously maintaining and dynamically updating the test content. xbench will regularly evaluate the mainstream Agent products in the market, track the evolution of model capabilities, capture key breakthroughs in the iteration process of Agent products, and then predict the next Agent application's technology - market fit ( TMF , Tech-Market Fit) . As an independent third party, xbench is committed to designing a fair evaluation environment for each type of product and providing objective and reproducible evaluation results.

The first release includes two core evaluation sets: the Science Question Answering Evaluation Set ( xbench-ScienceQA ) and the Chinese Internet Deep Search Evaluation Set ( xbench-DeepSearch ), and a comprehensive ranking of the main products in this field. At the same time, the evaluation methodology of vertical field intelligent agents was proposed, and a vertical agent evaluation framework for the recruitment and marketing fields was constructed.

Over the past two years, xbench has been a tool used internally by Sequoia China to track and evaluate the capabilities of basic models. Today, Sequoia has made it public and contributed it to the entire AI community. Whether you are a developer of basic models and agents , or an expert or enterprise in related fields, or a researcher with a strong interest in AI evaluation, xbench welcomes everyone to join and become part of using and improving xbench , and together create a new paradigm for evaluating AI capabilities.

xbench was originally an internal monthly review and report on the AGI process and mainstream models conducted by Sequoia China after the launch of ChatGPT in 2022. In the process of building and continuously upgrading the " private question bank " , Sequoia China found that the speed at which mainstream models " exploded " questions was getting faster and faster, and the effective time of benchmark tests was shortening dramatically. It was precisely because of this significant change that Sequoia China questioned the existing evaluation methods——

“ When everyone gets full marks, is it because the students have become smarter, or is there something wrong with the test paper? ”

▍Problems and ideas that Sequoia China hopes to solve

Therefore, Sequoia China began to think about and prepare to solve two core problems:

1 ) What is the relationship between model capability and the actual utility of AI ? What is the significance of increasingly difficult benchmark questions? Are we falling into inertial thinking? Is the actual economic value of AI implementation really positively correlated with the difficulty of AI solving difficult problems?

2) Comparison of capabilities in different time dimensions : Every time xbench changes its question bank, it loses the ability to track AI capabilities before and after. This is because the model version is also iterating under the new question bank, and it is impossible to compare how the capabilities of a single model change in different time dimensions.

When judging a startup project, the entrepreneur’s “ growth slope ” is an important basis, but when it comes to evaluating AI capabilities, the constant updating of the question bank makes the judgment invalid.

In order to solve these two problems, xbench provides a new solution:

1) Break the inertial thinking and develop novel task settings and evaluation methods for real-world practicality.

As AI enters the " second half " , it will not only require increasingly difficult AI Search capability test benchmarks ( AI Capabilities Evals ), but also a set of practical task systems ( Utility Tasks ) aligned with real-world experts . The former examines the capability boundary and is presented in the form of a score , while the latter examines practical tasks and environmental diversity, business KPIs ( Conversion Rate, Closing Rate ) and direct economic output.

Therefore, xbench introduced the concept of the Professional Aligned benchmark. The subsequent assessment will use a " dual track system " , divided into AGI Tracking and Professional Aligned . AI will face more tests on its effectiveness in complex environments, with dynamic question sets collected from the business, rather than just more difficult IQ questions.

2 ) Establish an evergreen evaluation system. Once a static evaluation set is released, there will be a problem of question leakage, which will lead to overfitting and then quickly become invalid . For example, the emergence of LiveBench and LiveCodeBench evaluations, which use dynamically updated questions to expand the evaluation set, alleviate the problem of question leakage.

Regarding AI Capacity Evals : The academic community has proposed many excellent methodologies, but due to insufficient resources and time, it is impossible to maintain dynamic and expanded continuous evaluation. xbench hopes to continue the methods of a series of public evaluation sets and provide third-party, black and white box, and live evaluations.

For Professional Aligned Evals : xbench hopes to establish a live collection mechanism from real business and invite professional experts from various industries to jointly build and maintain a dynamic evaluation set for the industry.

At the same time, based on dynamic updates, xbench designs horizontally comparable capability indicators to observe development speed and key breakthrough signals beyond rankings over time, helping to determine whether a model has reached the market landing threshold, and at what point in time the Agent can take over the existing business processes and provide large-scale services.

▍Evaluate the Agent’s Tech-Market Fit

There are still new challenges in the evaluation tasks of agent applications. This phenomenon can be alleviated by expanding the evaluation set with dynamically updated questions.

First, the product version of the Agent application has a life cycle. The iteration speed of Agent products is very fast, and new functions will be continuously integrated and developed, while the old version of Agent may be offline. Although we can test the capabilities of different products of the same Agent at the same time, we cannot compare the improvement of product capabilities at different times.

At the same time, the external environment that the Agent is exposed to is also changing dynamically. Even if it is the same question, if the solution requires the use of tools such as Internet applications with rapidly updated content, the test results will be different at different times.

Cost is also one of the decisive factors for the implementation of Agent applications. Inference Scaling allows models and agents to achieve better results by investing more inference computing power. This investment can come from a longer thinking chain brought by reinforcement learning, or it can be based on the thinking chain. Inference and aggregation can be introduced more times to further improve the effect.

However, in real-world tasks, it is necessary to consider the input-output ratio brought by Inference Scaling and find a balance between cost, latency, and effect.

Similar to ARC-AGI, we will seek to report the demand curve, human capability curve, and optimal supply curve of existing products on the effect-cost graph for each evaluation set. On the Benchmark score-cost graph, we can divide the market acceptance area in the upper left area and the technical feasibility area in the lower right area.

Labor costs should be part of the edge of the market acceptance zone. The left picture shows the state before the technology is implemented, while the middle picture shows the state after TMF, and the intersection is the incremental value brought by AI.

For AI scenarios with TMF, human resources should be invested more in the forefront of the field and in tasks that cannot be evaluated, and the market will re-price the value of human contributions due to the different scarcities of human resources and AI computing power.

It is believed that each professional field will go through 3 stages:

1. TMF not achieved: There is no intersection between technology credibility and market acceptance. At this time, the agent application is only a tool or concept, unable to deliver results or generate value on a large scale; the agent has little impact on people.

2. Agents and humans work together: The areas of technology credibility and market acceptance intersect, and the intersection area is the incremental value brought by AI, including (1) providing feasible services at a cost lower than the lowest human cost, and (2) helping to improve the work content with repetitive and medium quality requirements. However, high-level work content, due to data scarcity and higher difficulty, still requires human execution. At this time, due to scarcity, the AI Profit obtained by the enterprise may be used to pay for high-end work output.

3. Professional Agents: Domain experts build evaluation systems and guide Agent iterations. Experts’ work shifts from delivering results to building professional evaluation training vertical Agents and providing scaled services.

The transition from 1. to 2. is brought about by breakthroughs in AI technology and the scaling of computing power and data, while the progress from 2. to 3. depends on experts who are familiar with vertical requirements, standards, and historical experience.

In addition, in some areas, AI may bring new ways to meet needs and change the existing business processes and the way production relations are composed.

On the day xbench was launched, the official website xbench.org posted the first batch of evaluation results for mainstream basic models and agents .

Sequoia China said: xbench welcomes community co-construction. For basic model and agent developers, the latest version of xbench evaluation set can be used to verify the effect of their products in the first place and obtain the internal black box evaluation set score; for vertical agent developers, professionals and enterprises in related fields, they are welcome to co-build and publish the Profession Aligned xbench of specific industry vertical standards with xbench ; for researchers engaged in AI evaluation research, who have clear research ideas and hope to obtain professional annotations and maintain evaluation updates for a long time, xbench can help AI evaluation research ideas land and have a long-term impact.