The current evaluation set is too simple. OpenAI launches a new benchmark for deep search evaluation, BrowseComp

Written by
Iris Vance
Updated on:July-01st-2025
Recommendation

OpenAI releases a new deep search benchmark BrowseComp, challenging AI's ability to deeply mine information on the Internet.

Core content:
1. The design concept and challenges of BrowseComp
2. Comparison of AI and human performance in benchmark tests
3. OpenAI open-sources BrowseComp to promote the development of AI agents

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)

In addition to releasing the personal memory function yesterday (the mystery is revealed! ChatGPT's memory function has been fully upgraded, and your exclusive ChatGPT is online), OpenAI also announced the launch and open source of BrowseComp, a new and extremely challenging benchmark. This benchmark is designed to accurately measure the core ability of intelligent agents to locate extremely difficult-to-find information on the Internet, in order to address the significant shortcomings of current evaluation methods.

As AI agents increasingly rely on web browsing to acquire knowledge, it becomes critical to evaluate their ability to deeply mine and synthesize information. Existing benchmarks (such as SimpleQA) have been easily defeated by advanced models with fast browsing tools (such as GPT-4o with browsing capabilities), and cannot effectively measure whether AI has the ability to solve complex, real-world challenges that require persistent exploration and information integration across multiple websites.

BrowseComp is designed to address this critical gap. It contains 1,266 carefully constructed difficult questions with the core feature of "difficult to find, easy to verify". The questions require short, clear, and well-documented answers, and are deliberately designed to be difficult to obtain through simple searches. Different from simple information retrieval, it forces AI agents to have excellent factual reasoning, retrieval, browsing and analysis capabilities. For example:

Please identify the title of a research publication published before June 2023 that mentions cultural traditions, scientific processes, and culinary innovations. The publication is co-authored by three authors, one of whom is an assistant professor in West Bengal and another holds a Ph.D.

Answer: The Basics of Breadmaking: The Science of Bread

In the 1990s, a new school was formed by merging a girls' school with a boys' school to form a coeducational school in a historic town dating back to the second half of the 19th century. The new school was given a Latin name. What was the name of this girls' school?

Answer: Convent of the Sisters of Charity

The distribution of the benchmark problem topics is as follows:

This benchmark is extremely challenging:

  • This is a severe test for top AI : even GPT-4o (with browsing function) has an accuracy rate of only 1.9%.
  • It’s equally difficult for humans : in specialized tests, experienced human researchers were able to solve only 29.2% of the problems in 2 hours.

The test results clearly reveal the difference in capabilities: while the standard model performed poorly, the Deep Research agent trained by OpenAI specifically for deep research and persistent web browsing performed well, with an accuracy rate of 51.5%. This strongly proves the effectiveness of BrowseComp in distinguishing AI's true deep information retrieval capabilities. The study also shows that increasing inference computing resources can significantly improve performance.

OpenAI emphasizes that by open sourcing BrowseComp, it aims to promote the research community to develop more powerful, reliable, and trustworthy AI agents. Although BrowseComp focuses on specific core capabilities, it provides an indispensable and easy-to-evaluate tool for measuring AI's key skills (persistence and creativity) in the information maze.

BrowseComp is now available to the public through OpenAI's simple-evals GitHub repository. OpenAI sincerely invites researchers around the world to use this benchmark for evaluation and innovation, and looks forward to feedback. To maintain the long-term validity of the benchmark, it is strongly recommended not to publicly disseminate specific examples in the dataset on the Internet. The launch of this benchmark can not only align evaluation standards, but more importantly, stimulate major manufacturers to invest in this field. Deep search will usher in a "super moment"!

Project address: https://github.com/openai/simple-evals