Alibaba Open-Sources WebAgent: Enabling AI to Search, Reason and Act Autonomously Like Humans

Written by

Iris Vance

Updated on:June-15th-2025

01 The AI revolution that breaks through traditional search

In the era of information explosion, we are often drowned in the ocean of data. Alibaba's latest open-source WebAgent is trying to change this dilemma. It can autonomously perceive the network environment, make multi-step decisions, and perform complex tasks - from deep mining of academic literature to cross-platform information integration. This AI assistant demonstrates network interaction capabilities close to humans.

Traditional search engines require users to accurately describe their needs and manually filter results, but WebAgent can actively understand fuzzy instructions. When a user asks "to learn about the latest breakthroughs in quantum computing", it will automatically search on platforms such as arXiv and IEEE Xplore, filter out irrelevant literature, compare the research paths of different teams, and finally generate an integrated report. This capability stems from its dual-module design:

WebDancer
End-to-end training framework to improve multi-step search capabilities
WebWalker
Building a benchmark system for large language models in web page traversal

Case test
Enter "Compare the advantages and disadvantages of GPT-4 and Claude 3 in code generation", the system automatically traverses Stack Overflow, technical blogs and paper libraries, extracts key indicators such as test cases and error rates, and generates a comparison matrix.

02 Data Construction: Breaking the Invisible Cage of AI Training

The core innovation of the WebDancer framework begins with the construction of high-quality training data. Although existing datasets such as Mind2Web cover a variety of websites, they lack task diversity, and the quality of operation trajectories varies. The Alibaba team broke through the bottleneck through two innovative methods:

▍The short-trajectory reasoning
large language model directly generates a concise operation path. For example, in the "book an economy hotel" task, the system generates a standardized process of "select date - filter room types - compare prices". The trajectory coherence scores 85.7 points in the HumanEval score (according to the WebDancer paper), far exceeding the 72.3 points of the traditional method.

▍Long-trajectory reasoning
builds complex decision chains through iterative prompting technology. In the task of "writing a research review in a certain field", the system simulates the whole process of "searching databases → screening highly cited papers → extracting theoretical frameworks → comparing experimental methods → integrating controversial points", covering deep reasoning scenarios that are missing from traditional data sets.

Key breakthrough
While the amount of synthetic data has increased by 3 times, the trajectory effectiveness has reached 92% after manual evaluation (WebDancer paper data).

03 Supervised fine-tuning: Let AI learn to "think independently"

When the data is ready, WebDancer injects initial capabilities into the agent through supervised fine-tuning (SFT) :

Trajectory Deconstruction
Break down the operation into three elements: thinking ( Why click? ), action ( Click button ), and observation ( Page loaded )
Block feedback
The loss function only evaluates the rationality of the action and ignores environmental feedback (such as whether the page jump is successful)
Strengthen decision-making logic
Force AI to develop internal judgment mechanisms rather than relying on external signals

▶ Training results :
The model trained by SFT achieved a 45.6% success rate in the WebShop task (data source: WebDancer paper), laying the foundation for subsequent reinforcement learning. In the simulated air ticket booking test, AI can autonomously handle the chain reaction of "no tickets for the flight → automatically adjust the date → match alternative routes".

04 Reinforcement Learning: A Smart Engine for Dynamically Optimizing Decisions

The application of the DAPO algorithm enables WebAgent to achieve a leap in capabilities. The algorithm uses a dynamic sampling mechanism to efficiently utilize neglected high-quality training samples:

▍Trial and error evolution case

Initial failure
When booking a five-star hotel in Shanghai, directly choosing a high-priced room type will result in exceeding the budget
Strategy Adjustment
Learn to "set price filters first → compare user ratings → exclude hidden costs"
Success
Select 4.8-point hotels with breakfast in the 300-500 yuan range

▍Performance Leap
After millions of interactive iterations, WebAgent achieved a 73.2% task completion rate in the WebArena benchmark, 28 percentage points higher than the pure SFT model (ablation experiment in the paper). Especially in cross-site tasks (such as "collect travel guides from Zhihu → compare prices on Ctrip"), the success rate can reach 68.5%.

05 Real scenario: Redefining the way to obtain information

▶ Enterprise market analysis
Input command: "Collect new energy vehicle competing product pricing strategies", the system automatically completes:

Crawl Tesla's official website/Autohome configuration table
Identify the strategy of "reducing laser radar and reducing price by 15%" for Xpeng G9
The time required to integrate the NIO battery swap subsidy policy timetable. It
is only 1/10 of that of traditional manual research.

▶ Academic research accelerates
researchers’ instructions: “Analyze the latest clinical trials of targeted drugs for Alzheimer’s disease”, WebAgent:

Traverse ClinicalTrials.gov, PubMed database
Extracted the control group efficacy of 6 Phase III trials (64.2%-81.7%)
Mark the risk of Eli Lilly's Donanemab causing cerebral edema (incidence rate 12.8%)
and finally generate a comparative report with data source marking