Woter AI detection.Hurry - ends Jul 8th

New Year Sales :up to 80% OFF

AI Humanize AI Translator Bypass AI AI Rewriter AI Detector

PRICING

TRY FOR FREE

ReSearch framework: Let AI think and search like humans

Written by

Audrey Miles

Updated on:July-02nd-2025

" How can large language models (LLMs) combine search and reasoning in complex problems? The ReSearch framework uses reinforcement learning to provide the answer - allowing the model to 'think while searching for information' like humans, and to self-reflect and correct errors. "

Hello everyone, I am Si07. Recently, there are more and more DR products and frameworks. A few days ago, I saw a framework - ReSearch framework. This framework uses reinforcement learning (RL) to allow large language models (LLM) to combine search operations in the reasoning process, just like humans "thinking and searching for information". In addition, it can also self-reflect and correct errors in the reasoning process. Doesn't it sound great? Let's take a deeper look at this framework! And at the end of this article, there are some of my thoughts and thoughts on DR frameworks/products.

1. Why do we need ReSearch?

In recent years, LLMs have been performing better and better in various tasks, such as answering questions and generating text. However, when the problem becomes complex and requires multi-step reasoning and external information retrieval, traditional LLMs are not up to the task. For example, to answer a question like "Who was the president of the United States when Citibank was founded?", the model needs to first find the year Citibank was founded, and then find the president of the United States that year. This need for multi-step reasoning and retrieval is exactly the problem that ReSearch is trying to solve.

Most existing methods are based on manually designed prompts or heuristic rules, but these methods are not only time-consuming and labor-intensive, but also difficult to expand to more complex problems. ReSearch uses reinforcement learning to directly let the model learn how to combine search for reasoning without supervised data, which is undoubtedly a very promising direction. At this moment, students who are familiar with langchain must be able to remember that it had the so-called research concept of the intelligent agent " self-ask-with-search" as early as before version 0.1 .

2. ReSearch core idea: combining reasoning and search

The core of the ReSearch framework is to regard the search operation as part of the reasoning chain. That is, the model will generate a text chain containing "thinking process" and "search query" during the reasoning process, and the search results will be fed back to the model to further affect subsequent reasoning. The whole process is as follows:

1. Interaction between thinking and searching : The model first generates a thinking process (using
<think>tag package), and then decide whether to search (with<search>Label package query), search results (with<result>Label packages) will be fed back to the model to continue reasoning. For example:

   < think > I need to find the year Citibank was founded first. </ think >  
   < search > Year Citibank was founded </ search >  
   < result > Citibank was founded in 1812. </ result >  
   < think > Now I need to find the president of the United States in 1812. </ think >  
   < search > Presidents of the United States in 1812 </ search >  
   < result > The President of the United States in 1812 was James Madison. </ result >  
   < answer > The answer is \boxed{James Madison}. </ answer >

2. Reinforcement learning training : ReSearch optimizes the model through reward signals (such as the correctness of the answer) so that it learns when to search, how to search, and how to use search results for reasoning. During the training process, the model will continue to try different reasoning paths and eventually find the optimal solution.

To more intuitively demonstrate the performance improvement of ReSearch, we can refer to the following figure

As can be seen from the figure, ReSearch significantly outperforms the baseline methods in all benchmarks, especially when dealing with complex multi-hop problems, its performance improvement is particularly obvious.

3. Technical Implementation: ReSearch Training Details

ReSearch training is based on a reinforcement learning algorithm called Group Relative Policy Optimization (GRPO) . Its main idea is to optimize the model to generate higher reward reasoning chains by sampling multiple reasoning chains (rollouts). Here are a few key points:

• Integration of search operations : Search results are explicitly embedded into the reasoning chain, and the model uses special tags (e.g.
<search>and<result>) controls the search process. For example, when the model generates</search>When you add a tag, the system automatically performs a search operation and inserts the result into the inference chain.

• Reward modeling : The reward function is divided into two parts, the formula is as follows:

1. Answer Reward : The correctness of the final answer is calculated through the F1 score.
2. Format Reward : Check whether conforms to the predefined format (such as whether the label is correct, whether the answer is in\boxed{}pack).

  r =  
  {  
      f1(apred, agt), if f1 score > 0  
      0.1, if f1 score == 0 and format is correct  
      0, if f1 score == 0 and format is incorrect  
  }

To help you better understand the training process of ReSearch, we can refer to the following figure

Through this training method, the model can gradually learn how to effectively utilize search operations during reasoning.

• Search result blocking : In order to prevent the model from being overly dependent on search results, the search content will be blocked during training, and only the thinking process and search queries generated by the model will be optimized.

In addition, ReSearch has designed two prompt word templates, one for the basic model and the other for the instruction tuning model, to ensure that the model can generate a reasoning chain format that meets the requirements. For specific prompt word templates, please refer to the following two categories: the basic model and the instruction fine-tuning model:

• Basic model prompt word template

A conversation between User and Assistant. The user asks a question, and the assistant solves
it. The assistant first thinks about the reasoning process in the mind and then provides the
user with the answer. During thinking, the assistant can invoke the wikipedia search tool
to search for fact information about specific topics if needed. The reasoning process and
answer are enclosed within  < think > </ think >  and  < answer > </ answer >  tags respectively,
and the search query and result are enclosed within  < search > </ search >  and  < result >
</ result >  tags respectively. For example,  < think >  This is the reasoning process.  </ think >
< search >  search query here  </ search > < result >  search result here  </ result > < think >
This is the reasoning process.  </ think > < answer >  The final answer is \boxed{answer here}
</ answer > . In the last part of the answer, the final exact answer is enclosed within \boxed{}
with latex format. User:  prompt . Assistant:

• System prompt word template for instruction fine-tuning model

You are a helpful assistant that can solve the given question step by step with the help of the
wikipedia search tool. Given a question, you need to first think about the reasoning process in
the mind and then provide the answer. During thinking, you can invoke the wikipedia search
tool to search for fact information about specific topics if needed. The reasoning process and
answer are enclosed within  < think > </ think >  and  < answer > </ answer >  tags respectively,
and the search query and result are enclosed within  < search > </ search >  and  < result >
</ result >  tags respectively. For example,  < think >  This is the reasoning process.  </ think >
< search >  search query here  </ search > < result >  search result here  </ result > < think >
This is the reasoning process.  </ think > < answer >  The final answer is \boxed{answer here}
</ answer > . In the last part of the answer, the final exact answer is enclosed within \boxed{}
with latex format.

4. Experiments and Results: How does ReSearch perform?

ReSearch performs well on multiple multi-hop question answering benchmarks, including HotpotQA, 2WikiMultiHopQA, MuSiQue, and Bamboogle. The following is a detailed comparison of the experimental results:

• Experimental setup : ReSearch is trained on the MuSiQue dataset and evaluated on the development and test sets of HotpotQA, 2WikiMultiHopQA, MuSiQue, and Bamboogle. All experiments use an open retrieval environment based on Wikipedia.
• Baseline methods : ReSearch was compared with the following baseline methods:

1. No RAG : Generate answers directly without using retrieval.
2. Naive RAG : Simply concatenates the search results with the question to generate the answer.
3. Iter-RetGen : Iterative retrieval and generation method.
4. IRCoT : A method that alternates retrieval and chain reasoning.

• Experimental results : ReSearch significantly outperforms the baseline method in all benchmarks. The specific experimental results can be found in the table below

As can be seen from the table, ReSearch shows significant performance improvement in multi-hop question answering tasks, especially on the Bamboogle dataset.

5. Self-reflection during training

One of the highlights of ReSearch is that it demonstrates the ability to self-reflect and correct errors during training. This ability is not explicitly designed, but emerges naturally through reinforcement learning. For example, in one case, the model initially searched for the wrong keyword, but realized the problem in subsequent thinking, adjusted the search query, and finally got the correct answer. The specific reasoning process can be referred to the following use case study.

The above table shows a case study of ReSearch during training. To more intuitively show the changes in ReSearch's behavior during training, we can refer to the following two figures

The first figure above shows the changes in the response length and number of search operations of the model during training, while the second figure shows the changes in training and validation rewards. As can be seen from the chart, as training progresses, the model gradually learns to use search operations more effectively, and the rewards are also increasing.

6. Open Source Implementation of ReSearch

If you are interested in the implementation of ReSearch, you can refer to its GitHub project (see references at the end of the article). The project provides detailed installation, training, and evaluation steps. The following is a brief description of several key steps:

1. Environment setup : Use conda to create an environment and install dependent packages (such as PyTorch, flash-attn, etc.).

conda create -n re-search python==3.10  
conda activate re-search  
pip3 install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu124  
pip3 install flash-attn --no-build-isolation  
git  clone  https://github.com/Agent-RL/ReSearch.git  
cd  ReSearch  
pip3 install -e .  
conda install -c pytorch -c nvidia faiss-gpu=1.8.0

2. Retrieval service : Start the retrieval service through FlashRAG and FastAPI to standardize the search operation.

cd  scripts/serving  
python retriever_serving.py --config retriever_config.yaml --num_retriever 1 --port 8000

3. Data preparation : Download datasets such as MuSiQue and convert them into a format suitable for training.
```
cd  data  
bash download_dataset.sh  
python prepare_musique.py  
```

4. Training and evaluation : Use the verl framework for training and Flash RAG for evaluation.

cd  scripts/train  
bash train.sh --train_batch_size 8 --ppo_mini_batch_size 8 --apply_chat True --prompt_template_name re_search_template_sys --actor_model_path {model/path} --search_url {retriever-url}

Thoughts on DeepResearch products

The significance of this ReSearch framework is that it not only improves the performance of LLM in complex multi-hop tasks, but also demonstrates the potential of reinforcement learning in combining reasoning and search. This method can be extended to more fields, such as medical diagnosis, legal analysis, etc., and can even be combined with other tools (such as calculators or databases) to further enhance the capabilities of the model, which is of great commercial value.

Recently, almost all AI giants have been involved in DR products. Frameworks such as ReSearch are also constantly evolving with the help of the community.

Related reading

? Four powerful open source tools that can replace OpenAI Deep Research

This made me think:

1. Retrieval is reasoning

The reasoning of traditional LLM is closed, like a monologue trapped in an information cocoon. ReSearch embeds the search operation into the reasoning chain, which is equivalent to opening a window to dynamic knowledge for the model. While thinking about the self-learning of AI, researchers use this dynamic retrieval method to align the knowledge of the real world in real time through RAG-style embedding. This is indeed a kind of "intelligent" ingenuity. This design reminds me of the natural behavior of humans when solving problems: we never think in isolation, but constantly calibrate our cognition through interaction with the outside world. When the model learns to use<search>Tags actively call upon knowledge, and they actually begin to imitate the scientific thinking path of humans: "propose hypothesis-verify hypothesis".

What’s more interesting is that this type of retrieval is not a mechanical splicing, but a strategy selection that is dynamically optimized through reinforcement learning. The GRPO algorithm allows the model to learn to weigh the pros and cons through countless trial and error: When is it necessary to search? How to accurately search for keywords? How to extract value from redundant information? The emergence of this ability marks the paradigm shift of AI from "passive answering" to "active exploration."

2. Emergence of self-reflection ability

The self-correction ability of ReSearch during training reminds me of the human metacognitive process. When the model realizes that the initial search direction is wrong, it will actively adjust the query strategy. This self-examination of the thinking chain is not explicitly programmed, but a natural result of reinforcement learning. This makes me rethink the nature of reinforcement learning - it is not only a tool for parameter optimization, but also a meta-mechanism for shaping the intelligent agent's "learning how to learn".

The significance of this capability goes far beyond multi-hop question answering itself. In medical diagnosis, AI can proactively identify reasoning loopholes and supplement key evidence; in legal analysis, it can dynamically track changes in precedents; even in the field of education, AI tutors can adjust teaching strategies in real time to adapt to students' cognitive biases. This self-correcting intelligence is blurring the boundaries between machine and human thinking.

3. The ecosystem behind open source technology

The open source implementation of ReSearch gives me hope for the democratization of AI. Through detailed training scripts and environment configuration, researchers have lowered the technical threshold, allowing more developers to explore on the shoulders of giants. This open ecosystem reminds me of the evolution of the Linux kernel - when the core algorithm becomes public knowledge, the innovation speed of the entire industry will increase exponentially.

But open source also brings new ethical challenges. When the search capability is abused, AI may become a spreader of false information; when the reward mechanism of reinforcement learning is maliciously manipulated, the model may fall into the "illusion of optimization". This reminds us that while technology is universal, we must build a corresponding security framework. Just as ReSearch avoids overfitting by shielding search content, AI governance also needs a similar "Cognitive Sandbox". For example, we can see similar reports of "poisoning" AI recently. Driven by huge interests, "optimization" for Search may become a part of the gray ecosystem, which urgently requires the formulation of rules and regulations .

4. What is the essence of AI?

ReSearch made me rethink the definition of "intelligence". When the model builds the knowledge graph through iterative search, does it already have some kind of primitive consciousness? When it inserts<think>When tags reflect on themselves, does this behavior imply some kind of "machine intuition"? More importantly, this technology is reshaping the relationship between humans and machines. ReSearch is not simply a replacement for humans, but an extension of cognition. I forgot who said it and where I saw it: "When AI becomes the scaffolding of human thinking, the boundaries of our ability to solve problems will be continuously expanded."

5. Look far ahead

ReSearch's potential is far from exhausted. I imagine combining it with multimodal perception, allowing AI to actively retrieve key clues from visual and auditory information; or fusing it with time series data to give the model the ability to predict the future. A more radical idea is to replace the reward signal of reinforcement learning with an ethical framework based on human feedback, allowing AI to learn to weigh in complex ethical dilemmas. It not only processes information, but also understands the meaning of information; it not only answers questions, but also knows how to ask the right questions. When AI has this ability, the relationship between humans and machines will no longer be control and being controlled, but perhaps it should be: symbiosis and co-creation.