Alibaba open-sources WebDancer to solve complex information retrieval problems in DeepResearch

Written by

Silas Grey

Updated on:June-18th-2025

This official account mainly focuses on cutting-edge AI technologies such as NLP, CV, LLM, RAG, Agent, etc., and shares industry practical cases and courses for free to help you fully embrace AIGC.

Problems facing Deep Research

High-quality datasets:

Most existing QA datasets are shallow and cannot meet the needs of multi-step reasoning. It is necessary to build high-quality, fine-grained browsing data that can reflect diverse user intentions and rich interaction contexts.

Reliable trajectory construction

Building reliable trajectories that support long-term reasoning and task decomposition

Scalable and generalizable training strategies

Design scalable and generalizable training strategies that enable agents to behave robustly in out-of-distribution network environments, complex interaction patterns, and long-term goals

WebDancer

From the perspective of data and training phase, we propose an end-to-end paradigm for building an autonomous information retrieval agent through four key phases:

Browse data structures

CRAWLQA: crawls web pages of knowledge websites, imitates human browsing behavior, recursively visits sub-pages, and uses GPT-4o to generate QA pairs based on the collected content.
This approach can capture rich background knowledge and provide a basis for the construction of complex problems.
E2HQA: Start with simple QA pairs and gradually increase the complexity of questions by searching and rewriting them.
This approach can motivate the agent to gradually transition from simple tasks to complex tasks and improve its reasoning ability.

Trajectory sampling

Based on the ReAct framework, agents interact through Thought-Action-Observation rounds.
By rejecting the samples, we combine the short-chain reasoning (Short-CoT) and long-chain reasoning (Long-CoT) strategies to generate high-quality trajectories.
A three-stage filtering framework is adopted: validity control, correctness verification, and quality assessment to ensure the high quality of the trajectory.

Supervised Fine-tuning (SFT) for Efficient Cold Start

Using synthetic trajectory data, the agent is fine-tuned for multi-step reasoning tasks.
By shielding the loss contribution of external feedback, interference in the learning process is avoided, improving performance and robustness.

Optimizing the agent’s decision-making and generalization capabilities through reinforcement learning (RL)

The DAPO algorithm is used to optimize the agent's decision-making process through a dynamic sampling mechanism, thereby improving its generalization ability in real-world network environments.

Experimental Results

In the experiments, WebDancer performs well on two benchmarks: GAIA and WebWalkerQA.

In the Level 1, Level 2, and Level 3 tests of GAIA, WebDancer achieved pass rates of 41.0%, 30.7%, and 0%, respectively, significantly outperforming other open source frameworks, indicating that WebDancer has significant advantages in handling complex information retrieval tasks.

The core of WebDancer is to enable agents to perform well in dynamic and changing network environments through high-quality data and effective training methods.