WebDancer: Building an End-to-End Agentic Information Search Agent

Written by
Clara Bennett
Updated on:June-09th-2025
Recommendation

Alibaba's latest research explores the construction of intelligent agents that can search the Internet autonomously.

Core content:
1. Challenges of complex network information search problems
2. Four pillars of the WebDancer framework
3. End-to-end process from data construction to reinforcement learning

 
Yang Fangxian
Founder of 53A/Most Valuable Expert of Tencent Cloud (TVP)

 

1. Why do we need AI that can access the Internet?

Imagine that we are looking for information online. Sometimes the questions are simple and we can find the answers with just one search. But more often, the questions are complex and multi-step. For example, we need to search for a concept first, then click on a link in a webpage of the search results, then find relevant information, or even jump and compare between multiple pages, before we can finally piece together the final answer. This is a process that requires deep information search and multi-step reasoning.

Traditional AI models may be good at answering questions based on existing knowledge bases or performing simple searches. However, they often seem to be unable to cope with the real network environment that requires active "exploration" and "interaction". Recent studies, such as OpenAI's Deep Research and x.ai's Grok DeepSearch, have demonstrated the potential of building intelligent agents with powerful information search capabilities through end-to-end reinforcement learning training. However, how to build such a web agent (Web Agent) from scratch that can perceive the network environment, make decisions and take actions to complete complex tasks like a human is still full of challenges.

The challenges are mainly reflected in the following aspects:

  • How to obtain high-quality, fine-grained browsing data that reflects diverse user intent and rich interaction context.

  • How to build reliable agent trajectories that support long-range reasoning and task decomposition.

  • How to design scalable and generalizable training strategies so that intelligent agents can perform robustly in unfamiliar network environments, complex interaction patterns, and long-term goals.

It is in this context that this paper proposes the WebDancer framework, which aims to provide a systematic guide for building end-to-end autonomous information search agents.

2. Core content: Four pillars of WebDancer

The core idea of ​​WebDancer is to build an intelligent agent that can autonomously perform multi-step information search on the Internet. It abstracts the end-to-end process of building such an intelligent agent and proposes solutions from two dimensions: data and the training stage.

The main contributions of the paper can be summarized into the following four key stages:

1. Browsing data construction: Solve the problem of high-quality and diverse training data.

2. Trajectories sampling: Generate high-quality "think-act-observe" sequences of the agent performing tasks from constructed data.

3. Supervised fine-tuning (SFT): Use the sampled trajectory data to fine-tune the basic model to achieve an effective "cold start" and allow the model to initially learn to imitate the behavior patterns of the intelligent agent.

4. Reinforcement learning (RL): Based on SFT, reinforcement learning is used to further optimize the decision-making and generalization capabilities of the intelligent agent, making it perform better in real network environments.

This process provides a systematic, end-to-end pipeline for building long-range information search network agents. The WebDancer framework is based on the ReAct paradigm, a method that tightly couples reasoning with action, which is well-suited for effective learning and generalization in interactive environments.

3. Method Analysis: Data, Trajectory and Two-stage Training

Let's analyze the specific methods of WebDancer in detail.

3.1 Deep Information Search Dataset Synthesis

Constructing complex and diverse QA pairs is key to building network agents, whether using SFT or RL. Most existing QA datasets tend to be "shallow" and can usually be solved with only one or two steps of search. In order to generate complex QA pairs that can inspire multi-step reasoning, goal decomposition, and rich interaction sequences, WebDancer uses two methods to automatically synthesize high-quality datasets:

  • CRAWLQA: This approach simulates human browsing behavior and collects information by systematically crawling and clicking on sublinks on web pages. It starts with the root URLs of authoritative and knowledge-based websites, and then uses a powerful LLM like GPT-4o to generate QA pairs based on the collected web page content. To ensure the diversity and relevance of the questions, they generate specific types of questions (such as COUNT, MULTI-HOP, INTERSECTION) through instruction learning.

  • E2HQA (Easy-to-Hard QA): This method is a bit like "reverse construction". It starts with simple QA pairs (such as SimpleQA style, where the answers are concise factual entities). Then, it iteratively complicates the problem. Specifically, it selects an entity En in the current question Qn, and uses LLM to build a query to search for information Cn related to En. Then, LLM is used to reconstruct Cn into a new query Rn to replace the entity in the original question, forming a new question Qn+1. In this way, the new question needs to solve the previously constructed sub-problems to find the answer, while ensuring that the answer remains unchanged. By controlling the number of rewrites, the complexity of the problem and the number of steps required to solve the problem can be controlled. Figure 1 shows these two data generation pipelines.

     

3.2 Agent Trajectory Rejection Sampling

The ReAct framework is the foundation of WebDancer. A ReAct trajectory consists of multiple Thought-Action-Observation cycles. The agent generates Thoughts (free-form thoughts), Actions (structured actions used to interact with environmental tools), and receives Observations (feedback from the environment). This process is iterative until the task is completed and the final action is an answer. Possible actions include search, visit, and answer.

High-quality trajectory data is crucial for SFT. WebDancer generates trajectories by applying Trajectory Rejection Sampling on QA data and further performs filtering to improve data quality.

Chain-of-Thought (CoT)  is essential for the execution of intelligent agents, which enables high-level workflow planning, self-reflection, information extraction, and action planning. The paper explores methods to construct short and long CoT. For short CoT, powerful models (such as GPT-4o) are directly used to generate trajectories under the ReAct framework. For long CoT, historical actions and observations are sequentially provided to the reasoning model (LRM), allowing it to autonomously decide the next action and record its intermediate reasoning process as the current Thought. The generated trajectory will be rejected and sampled to ensure quality and coherence.

The sampled trajectories are filtered through a three-stage funnel filtering framework :

  • Validity control: discards trajectories that do not conform to the ReAct format or instructions.

  • Correctness verification: Only the trajectories with correct results are retained and judged using GPT-4o.

  • Quality assessment: Apply rules to filter out trajectories with too many actions, hallucinations, or severe duplication, and select trajectories that meet the standards of information redundancy, goal consistency, logical reasoning, and accuracy based on instructions.

     

Those QA pairs that fail to pass the filtering (i.e., do not contain valid trajectories) can be effectively utilized in the reinforcement learning stage.

3.3 Multi-stage and multi-tool agent learning

After obtaining high-quality ReAct format trajectories, WebDancer divides the training into two stages:

Stage 1: Agent Supervised Fine Tuning (SFT)

  • The SFT stage uses the obtained decision trajectories to train the policy model with the goal of achieving a “cold start”.

  • This helps the model learn to couple multi-step reasoning and actions and internalize the behavioral paradigm of alternating reasoning and action.

  • In order to avoid interference from external feedback (Observation), the loss function masks the contribution of Observation and only calculates the loss of the agent's autonomous decision-making steps (Thought and Action). This has been shown to improve performance and robustness.

  • The SFT stage provides a strong initialization for the subsequent RL stage.

     

Phase 2: Agent Reinforcement Learning (RL)

  • The goal of the RL phase is to internalize the agent’s capabilities into the reasoning model and enhance its multi-round, multi-tool usage capabilities through outcome-based reward signals.

  • Based on SFT, WebDancer uses the Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO)  algorithm to optimize the performance of the policy model on the Thought-Action-Observation sequence.

  • DAPO is an RL algorithm that optimizes policies to produce outputs with higher rewards. The key lies in the Dynamic Sampling Mechanism, which can effectively utilize QA pairs that are not fully utilized in the SFT stage.

  • The dynamic sampling mechanism oversamples and filters out prompt words with accuracy equal to 1 or 0. This is crucial to handle invalid or noisy instances that may exist in synthetic data, ensuring that the agent learns from high-quality signals.

  • DAPO updates its policy by maximizing an objective function that considers the rewards and advantages of candidate trajectories. The advantages are normalized based on the raw reward values ​​within the batch.

  • In the RL phase, the agent executes an action Rollout: In the ReAct framework, each round of execution starts with generating a Thought, then generating an Action (including the action name and parameters), interacting with the network environment, and receiving Observations as feedback. The Rollout ends with generating an answer.

  • Reward Design plays a key role in RL training. WebDancer's reward system mainly includes the format score and the answer score. Since the format problem has been largely solved in the SFT stage, the format score has a smaller weight (0.1) and is binary (1 if it fully complies with the format and the tool call is valid). The answer score has a larger weight (0.9) and is binary. It is 1 only when the response is judged to be correct. The judgment is completed by LLM-as-Judge (using the Mj model, built on Qwen-72B-Instruct). The final reward is the weighted sum of the two.

4. Experimental Results and Analysis: How does WebDancer perform?

 The paper evaluates the performance of WebDancer on two challenging web information search benchmarks,  GAIA  and  WebWalkerQA. The evaluation metric used is Pass@1, which is a commonly used metric to measure the accuracy of an agent in completing a task.

Main results  (Table 1):

  • Frameworks without agent capabilities (No Agency), such as directly using the base model or RAG, perform poorly on both benchmarks, which once again emphasizes the importance of active information search and agent decision-making.

  • The closed-source intelligent agent system OpenAI DR achieved the highest score through end-to-end RL training.

  • In open source frameworks, agent-based approaches built on powerful reasoning models such as QwQ-32B generally outperform their non-agent counterparts.

  • WebDancer shows significant performance improvements over the Vanilla ReAct baseline in the ReAct framework. For example, when using Qwen-2.5-32B as the backbone model, WebDancer's average GAIA score increases from 31.0 in Vanilla ReAct to 40.7.

  • In some cases, WebDancer even surpasses the performance of GPT-4o. This shows that even in a lightweight framework, WebDancer's approach can significantly enhance the agent capabilities of the underlying base model.

     

Results on more challenging benchmarks  (Table 2):

  • WebDancer is also evaluated on BrowseComp (En.) and BrowseComp-zh (Zh.), two benchmarks designed to reflect complex information search scenarios.

  • The results show that WebDancer continues to perform strongly on both datasets, highlighting its robustness and effectiveness in handling difficult reasoning and information search tasks. For example, on BrowseComp (En.), WebDancer (based on QwQ-32B) has Pass@1/Pass@3 of 2.8/5.0, which is significantly higher than GPT-4o's 1.9/-.

     

In-depth analysis  (Section 5):

  • Data efficiency: High-quality trajectory data is crucial for SFT. The paper demonstrates the effectiveness of the constructed CRAWLQA and E2HQA datasets and the importance of trajectory filtering through ablation studies. When the amount of data is low, the strictly filtered "Final" dataset performs best.

     

  • The role of SFT and RL: SFT is crucial for "cold starts" and gives the agent powerful multi-step, multi-tool instruction following capabilities. Experiments show that the performance of using RL alone is significantly limited.

  • Impact of RL: For non-inference models, RL brings significant improvements in Pass@3 and Cons@3. RL is able to sample the correct response more efficiently. For LRMs (such as QwQ-32B), RL has no significant gain on Pass@1, Pass@3 or Cons@3, but improves the consistency of answers (the proportion of correct answers in three attempts increases).

     

  • CoT knowledge transfer: The thinking mode knowledge of the reasoning model is difficult to transfer directly to the small instruction model. Although a long CoT is also beneficial for non-reasoning models, it may introduce problems such as high inefficiency. After the reasoning model is trained with a long CoT trajectory synthesized based on the reasoning model, the reasoning performance is significantly improved. This is consistent with previous studies, indicating that there are challenges in cross-model reasoning and knowledge transfer.

     

  • Emergence of Agents: RL forces models to engage in longer reasoning processes and supports more complex agent actions. Compared to SFT, the RL framework promotes the emergence of more complex reasoning strategies by optimizing decision sequences rather than single-step outputs. This enables the model to learn from delayed rewards and explore the action space more deeply, resulting in more coherent and longer reasoning trajectories. RL encourages agents to autonomously decide on intermediate steps, sub-goals, or tools to achieve the final goal.

     

  • Environment dynamics: The network environment is dynamic and constantly changing. Adjusting the decoding temperature has little effect on the final performance, indicating that the instability of the agent is not mainly caused by decoding variability. The performance fluctuations are largely attributed to changes in the network environment itself, which highlights the non-static and open nature of real-world agent deployments.

     

Through detailed experiments and analysis, the paper not only verifies the effectiveness of the WebDancer pipeline but also provides valuable insights and feasible paths for future intelligent agent training.

5. Implications: Future Direction

Although WebDancer has achieved encouraging results, the paper also frankly points out some limitations of the current framework and future research directions :

  • The number and types of tools are limited: Currently, only two basic tools, search and access, are integrated. More complex tools can be integrated in the future, such as more sophisticated simulation of browser behavior by abstracting browser functions, or using a Python sandbox environment to interact with external APIs. This will enable the intelligent agent to perform more human-like and efficient interactions and handle more challenging tasks.

  • Task generalization and benchmarks: Current experiments focus on short-answer information search tasks. A comprehensive network agent should also be able to perform document-level research and long-form generation. How to design reliable and informative reward signals in such open-domain, long-form generation tasks is a problem that needs further research.

  • Data utilization efficiency: Although a large number of QA pairs and trajectories have been collected, the amount of data that can be effectively utilized in the RL stage is relatively small (for example, only about 5,000 pairs of data can be used due to computational and stability limitations). More efficient data utilization strategies are needed in the future to fully tap the value of the dataset.

  • Rollout is expensive: Rollout in the RL phase involves multiple rounds of tool invocation and LLM completion, which is computationally and time-intensive. This limits scalability and slows down iterative development. Developing mechanisms to more efficiently integrate tool invocation and model completion is a promising direction.

  • Hybrid thinking mode: Currently, model training is based on a single type of CoT data. In the future, a hybrid reasoning agent model that can dynamically control the length of reasoning can be developed.

  • Illusions and over-action in thinking patterns: Illusions (e.g., calling a tool that doesn’t exist) or over-action (e.g., performing redundant actions after finding the answer) may occur in tool calls. This is an issue that needs to be addressed in the future.

     

In general, WebDancer provides a solid framework for building end-to-end multi-step information search network agents and experimentally verifies the effectiveness of its two-stage training strategy. It provides valuable experience and a clear path for the community to further develop more complex agent models that can handle complex information search tasks in the real world.