Woter AI detection.Hurry - ends Jul 23rd

New Year Sales :up to 80% OFF

AI Humanize AI Translator Bypass AI AI Rewriter AI Detector

PRICING

TRY FOR FREE

Design and implementation of DeepSearch and DeepResearch

Written by

Iris Vance

Updated on:July-15th-2025

It’s only February, and Deep Search is already becoming the new search standard for 2025. Giants like Google and OpenAI have unveiled their own “Deep Research” products in an effort to seize the opportunity of this wave of technology. (We are also proud to have released an open source version ofnode-deepsearch).

Perplexity followed suit and launched their Deep Research. Musk's X AI went a step further and directly integrated the deep search function into their Grok 3 model, which is essentially a variant of Deep Research.

Frankly speaking, the concept of deep search is not innovative. It is essentially what we often called RAG (Retrieval Enhanced Generation) or multi-hop question and answer last year. But at the end of January this year, with the release of Deepseek-r1, it has received unprecedented attention and development.

Just last weekend, Baidu Search and Tencent WeChat Search have integrated Deepseek-r1 into their search engines. AI engineers have realized that by incorporating long-term thinking and reasoning processes into search systems, they can achieve more accurate and in-depth retrieval results than ever before.

But why did this change happen now? Throughout 2024, "Deep(Re)Search" did not seem to attract much attention. You know, as early as the beginning of 2024, Stanford NLP Lab released the STORM project, a long report generation based on the network. Is it just because "Deep Search" sounds more fashionable than multi-hop QA, RAG or STORM? But to be honest, sometimes a successful rebranding can make the industry suddenly accept things that have already existed.

We believe the real turning point is OpenAI’s release ofo1-preview, which introduced the concept of "test-time compute" and subtly changed the industry's perception. The so-called "test-time compute" refers to investing more computing resources in the reasoning stage (that is, the stage where the large language model generates the final result) rather than focusing on the pre-training or post-training stage. Classic examples include Chain-of-Thought (CoT) reasoning, and"Wait"Techniques such as injection (also known as budget control) give the model more room for internal thinking, such as evaluating multiple potential answers, conducting more in-depth planning, and self-reflection before giving a final answer.

This "computation during reasoning" concept, as well as models that focus on reasoning, are guiding users to accept a concept of "delayed gratification": using longer waiting time in exchange for higher quality and more practical results. Just like the famous Stanford marshmallow experiment, those children who can resist the temptation to eat a marshmallow immediately and get two marshmallows later tend to achieve better long-term results. Deepseek-r1 further consolidates this user experience, and whether you like it or not, most users have silently accepted this.

This marks a significant departure from traditional search requirements. In the past, if your solution couldn’t respond within 200 milliseconds, it was almost a failure. But in 2025, experienced search developers and RAG engineers prioritize top-1 precision and recall over latency. Users are accustomed to longer processing times: as long as they can see that the system is working hard, they will be happy.<thinking>.

In 2025, showing reasoning is standard practice, with many chat interfaces rendering it in a dedicated UI area.<think>content.

In this article, we discuss the principles of DeepSearch and DeepResearch by examining our open source implementations. We present key design decisions and point out potential caveats.

What is Deep Search?

The core idea of DeepSearch is to iterate through the three stages of search, reading, and reasoning until the best answer is found. The search stage uses search engines to explore the Internet, while the reading stage focuses on detailed analysis of specific web pages (for example, using Jina Reader). The reasoning stage is responsible for evaluating the current state and deciding whether to break the original problem into smaller sub-problems or try other search strategies.

Unlike the 2024 RAG system, which typically runs the search-generation process only once, DeepSearch performs multiple iterations and requires explicit stopping conditions. These conditions can be based on token usage restrictions or the number of failed attempts.

Try DeepSearch at search.jina.ai and observe<thinking>, and see if you can spot where the loop occurs.

From another perspective, DeepSearch can be viewed as an LLM agent equipped with various network tools (such as search engines and web page readers). This agent analyzes the current observations and past operation records to decide the next course of action: whether to give an answer directly or continue to explore the network. This constructs a state machine architecture in which the LLM is responsible for controlling the transition between states.

At each decision point, you have two options: you can carefully design prompts to let the standard generative model generate specific instructions; or you can use a specialized reasoning model like Deepseek-r1 to naturally infer the next action to take. However, even with r1, you need to periodically interrupt its generation process, inject the tool's output (such as search results, web page content) into the context, and prompt it to continue the reasoning process.

Ultimately, these are just implementation details. Whether you carefully design prompt words or use the inference model directly, they all follow the core design principle of DeepSearch: a continuous cycle of searching, reading, and reasoning.

So what is DeepResearch?

DeepResearch is based on DeepSearch, adding a structured framework for generating long research reports. Its workflow generally starts with creating a table of contents, and then systematically applying DeepSearch to each required part of the report: from the introduction to related work, to the methodology, and finally to the conclusion. Each chapter of the report is generated by inputting specific research questions into DeepSearch. Finally, all chapters are integrated into a prompt word to improve the coherence of the overall narrative of the report.

In 2024, we also did a "Research" project internally. At that time, in order to ensure the coherence of the report, we adopted a relatively stupid method. Each iteration would take all chapters into consideration and make multiple coherence improvements. But now it seems that this approach is a bit too much, because today's large language models already have a very long context window, which can completely complete the coherence revision in one go, and the effect is better.

But we didn’t release the “Research” project in the end for the following reasons:

The main problem is that the quality of the report never met our internal standards. We tested it with two well-known internal queries, "Jina AI's competitor analysis" and "Jina AI's product strategy". The results were disappointing. The report was mediocre and lacked highlights, and did not bring us any "aha" surprises. Secondly, the reliability of the search results was also poor, and the illusion problem was serious. Finally, the overall readability was also terrible, with a lot of duplication and redundancy between the various sections. In short, it was worthless. Moreover, the report was long and time-consuming to read without any gain.

However, this project also accumulated valuable experience for us and spawned some sub-products:

For example, we deeply understood the importance of reliability of search results and fact-checking at the paragraph and even sentence level, which directly led to the development of the g.jina.ai endpoint. We also realized the value of query expansion and began to invest in training small language models (SLM) for query expansion. Finally, we really liked the name ReSearch, which cleverly expressed the idea of reinventing search and was also a pun. It would be a pity not to use it, so we ended up using it in the 2024 yearbook.

In the summer of 2024, our "Research" project adopted an "incremental" approach, focusing on generating longer reports. It first synchronously generates the report's table of contents (TOC), and then synchronously generates the content of all chapters. Finally, each chapter is incrementally revised asynchronously, and each revision takes into account the overall content of the report. In the demonstration video above, the query we used is "Jina AI's competitive product analysis".

DeepSearch vs DeepResearch

Many people tend to confuse DeepSearch with DeepResearch. But in our opinion, they solve completely different problems. DeepSearch is the building block of DeepResearch and the core engine on which the latter operates.

The focus of DeepResearch is to write high-quality, readable long research reports. This is not just about searching for information, but also a systematic project that requires the integration of effective visual elements (such as charts and tables), the use of a reasonable chapter structure, ensuring that the logic between sub-chapter is smooth, the terminology is consistent throughout the text, avoiding information redundancy, and using smooth transition sentences to connect the context. These elements are not directly related to the underlying search function, so we regard DeepSearch as the company's development focus.

The following table summarizes the differences between DeepSearch and DeepResearch. It is worth mentioning that both DeepSearch and DeepResearch rely on long context and reasoning models, but for slightly different reasons.

It is easy to understand that DeepResearch needs a long context to generate a long report. Although DeepSearch looks like a search tool, it also needs to remember previous search attempts and web page content in order to plan subsequent operations, so a long context is also indispensable.

Understanding DeepSearch Implementation

Open source link: https://github.com/jina-ai/node-DeepResearch

The core of DeepResearch lies in its loop reasoning mechanism. Unlike most RAG systems that try to answer questions in one go, we use an iterative loop approach. It continues to search for information, read relevant sources, and make inferences until it finds the answer or runs out of token budget. Here is a simplified skeleton of this large while loop:

// Main inference loop
while  (tokenUsage < tokenBudget && badAttempts <= maxBadAttempts) {
  // Track progress
  step++; totalStep++;

  // Get the current question from the gaps queue, if not, use the original question
  const currentQuestion = gaps.length > 0 ? gaps.shift() : question;

  // Generate prompt words based on the current context and allowed operations
  system = getPrompt(diaryContext, allQuestions, allKeywords,
                    allowReflect, allowAnswer, allowRead, allowSearch, allowCoding,
                    badContext, allKnowledge, unvisitedURLs);

  // Let LLM decide the next action
  const result = await LLM.generateStructuredResponse(system, messages, schema);
  thisStep = result.object;

  // Perform the selected action (answer, reflect, search, visit, encode)
  if  (thisStep.action ===  'answer' ) {
    // Process the reply action...
  }  else  if  (thisStep.action ===  'reflect' ) {
    // Handle reflective actions...
  } // ... and so on for other actions
}

To ensure the stability and structure of the output, we took a key measure: at each step, we selectively disabled certain operations.

For example, when there is no URL in memory, we will prohibit the "visit" operation; if the last answer was rejected, we will prevent the agent from immediately repeating the "answer" operation. This constraint mechanism can guide the agent to move in the right direction and avoid going in circles.

System prompt words

In the design of system prompts, we use XML tags to define each part, which can generate more robust system prompts and generated content. At the same time, we found that directly in the JSON SchemadescriptionAdding field constraints to the field will have a better effect. Admittedly, reasoning models like DeepSeek-R1 can theoretically automatically generate most of the prompt words. However, considering the limitation of context length and our need for fine control over the agent's behavior, this way of explicitly writing prompt words is more reliable in practice.

function  getPrompt(params...) {
  const sections = [];

  // Add a Header containing system instructions
  sections.push( "You are a senior AI research assistant, good at multi-step reasoning..." );

  // Add the accumulated knowledge fragment (if any)
  if  (knowledge?.length) {
    sections.push( "<knowledge>[knowledge entry]</knowledge>" );
  }

  // Add context information from previous actions
  if  (context?.length) {
    sections.push( "<context>[Action History]</context>" );
  }

  // Add failed attempts and learned strategies
  if  (badContext?.length) {
    sections.push( "<bad-attempts>[failed attempts]</bad-attempts>" );
    sections.push( "<learned-strategy>[Improvement Strategy]</learned-strategy>" );
  }

  // Define the available actions based on the current state
  sections.push( "<actions>[Available action definitions]</actions>" );

  // Add response format instructions
  sections.push( "Please respond in valid JSON format and strictly match the JSON schema." );

  return  sections.join( "\n\n" );
}

Traversing Knowledge Gap Problems

In DeepSearch, “knowledge gap questions” refer to the knowledge gaps that the agent needs to fill before answering the core question. Instead of trying to answer the original question directly, the agent will identify and solve sub-questions that can build the necessary knowledge foundation.

This is a very elegant approach.

// After identifying knowledge gaps in the Reflection Action
if  (newGapQuestions.length > 0) {
  // Add the new question to the head of the queue
  gaps.push(...newGapQuestions);

  // Always add the original question to the end of the queue
  gaps.push(originalQuestion);
}

It creates a FIFO (first in, first out) queue with a round-robin mechanism that follows the following rules:

New knowledge gap questions will be pushed to the head of the queue first.
The original question is always at the end of the queue.
At each step, the system extracts issues from the head of the queue for processing.

The beauty of this design is that it maintains a shared context for all problems. That is, when a knowledge gap problem is solved, the knowledge gained can be immediately applied to all subsequent problems, and ultimately help us solve the original problem.

FIFO queue vs recursion

In addition to the FIFO queue, we can also use recursion, which actually corresponds to the depth-first search strategy. For each "knowledge gap" problem, recursion will create a new call stack with independent context. The system must completely solve each knowledge gap problem (and all its potential sub-problems) before returning to the parent problem.

For example, a simple 3-level deep knowledge gap problem is recursive, and the numbers in the circles indicate the order in which the problems are solved.

In recursive mode, the system must fully solve Q1 (and its possible subproblems) before it can proceed to other problems! This is in stark contrast to the queue method, which will return to Q1 after processing 3 knowledge gap problems.

In practical applications, we found that it is difficult to control the budget with recursive methods. Because subproblems may continue to derive new subproblems, without clear guidelines, it is difficult to determine how much token budget should be allocated to them. Compared with the complex budget control and possible delayed return problems, the clear context isolation advantage brought by recursion seems a bit insignificant. In contrast, the design of the FIFO queue strikes a good balance between depth and breadth, ensuring that the system continues to accumulate knowledge, gradually improves, and eventually returns to the original problem, rather than being bogged down in potential infinite recursion.

Query Rewrite

One interesting challenge we encountered was how to effectively rewrite a user’s search query:

// In the search action handler
if  (thisStep.action ===  'search' ) {
  //Search request deduplication
  const uniqueRequests = await dedupQueries(thisStep.searchRequests, existingQueries);

  // Rewrite natural language queries into more efficient search expressions
  const optimizedQueries = await rewriteQuery(uniqueRequests);

  // Make sure not to repeat previous searches
  const newQueries = await dedupQueries(optimizedQueries, allKeywords);

  // Perform the search and store the results
  for  (const query of newQueries) {
    const results = await searchEngine(query);
    if  (results.length > 0) {
      storeResults(results);
      allKeywords.push(query);
    }
  }
}

We found that query rewriting is far more important than expected, and it can even be said to be one of the most critical factors in determining the quality of search results. A good query rewriter can not only convert the user's natural language into a keyword form that is more suitable for BM25 algorithm processing, but also expand the query to cover more potential answers in different languages, tones, and content formats.

In terms of query deduplication, we initially tried a solution based on LLM, but found it difficult to precisely control the similarity threshold and the results were not ideal.jina-embeddings-v3Its outstanding performance in semantic text similarity tasks allows us to easily achieve cross-language deduplication without worrying about non-English queries being misjudged and filtered. Coincidentally, it was the Embedding model that played a key role in the end. We did not intend to use it for memory retrieval at first, but unexpectedly found that it performed very efficiently in deduplication tasks.

Crawling web content

Web crawling and content processing are also crucial links. We use Jina Reader API here. In addition to the complete web page content, we also collect summary fragments returned by the search engine as auxiliary information for subsequent reasoning. These fragments can be regarded as concise summaries of the web page content.

// Access the behavior handler
async  function  handleVisitAction(URLs) {
  // Normalize and filter visited URLs
  const uniqueURLs = normalizeAndFilterURLs(URLs);

  // Process each URL in parallel
  const results = await Promise.all(uniqueURLs.map(async url => {
    try {
      // Get and extract content
      const content = await readUrl(url);

      // Store as knowledge
      addToKnowledge(`What is  in  ${url} ?`, content, [url],  'url' );

      return  {url, success:  true };
    } catch (error) {
      return  {url, success:  false };
    } finally {
      visitedURLs.push(url);
    }
  }));

  // Update the log based on the results
  updateDiaryWithVisitResults(results);
}

To facilitate tracing, we normalized the URLs and limited the number of URLs visited at each step to control the proxy's memory usage.

Memory Management

A key challenge of multi-step reasoning is how to effectively manage agent memory. The memory system we designed distinguishes what counts as "memory" and what counts as "knowledge". But in any case, they are all part of the LLM prompt context, separated by different XML tags:

// Add knowledge entry
function  addToKnowledge(question, answer, references,  type ) {
  allKnowledge.push({
    question: question,
    answer: answer,
    references: references,
    type :  type , //  'qa' ,  'url' ,  'coding' ,  'side-info'
    updated: new Date().toISOString()
  });
}

// Record steps to the log
function  addToDiary(step, action, question, result, evaluation) {
  diaryContext.push(`
In  step ${step}  , you took ** ${action} ** action on question: " ${question} ".
[Details and results]
[Assessment (if any)]
`);
}

Considering the ultra-long context trend of LLM in 2025, we chose to abandon the vector database and adopt the context memory approach. The agent's memory consists of three parts within the context window: the acquired knowledge, the visited websites, and the log of failed attempts. This approach allows the agent to directly access the complete history and knowledge state during reasoning without the need for an additional retrieval step.

Answer evaluation

We also found that answer generation and evaluation work better when they are separated into different prompts. In our implementation, when we receive a new question, we first determine the evaluation criteria and then evaluate them one by one. The evaluator refers to a small number of examples for consistency evaluation, which is more reliable than self-evaluation.

// Independent evaluation phase
async  function  evaluateAnswer(question, answer, metrics, context) {
  // Determine the evaluation criteria based on the question type
  const evaluationCriteria = await determineEvaluationCriteria(question);

  // Evaluate each criterion one by one
  const results = [];
  for  (const criterion of evaluationCriteria) {
    const result = await evaluateSingleCriterion(criterion, question, answer, context);
    results.push(result);
  }

  // Determine whether the answer passes the overall evaluation
  return  {
    pass: results.every(r => r.pass),
    think: results.map(r => r.reasoning).join( '\n' )
  };
}

Budget Control

Budget control is not just about saving costs, but about ensuring that the system fully processes the problem before the budget is exhausted and avoids returning answers too early. Since the release of DeepSeek-R1, our thinking on budget control has shifted from simply saving budget to encouraging deeper thinking and striving for high-quality answers.

In our implementation, we explicitly require the system to identify knowledge gaps before attempting to answer them.

if  (thisStep.action ===  'reflect'  && thisStep.questionsToAnswer) {
  // Force in-depth reasoning and add sub-questions
  gaps.push(...newGapQuestions);
  gaps.push(question); // Don't forget the original question
}

By flexibly enabling and disabling certain operations, we can guide the system to use tools that can deepen reasoning.

// After the answer fails
allowAnswer =  false ; // Force the agent to search or reflect

To avoid wasting tokens on invalid paths, we limit the number of failed attempts. When we get close to the budget limit, we activate "beast mode" to ensure we give an answer no matter what.

// Start beast mode
if  (!thisStep.isFinal && badAttempts >= maxBadAttempts) {
  console.log( 'Enter Beast mode!!!' );

  //Configure prompts to guide decisive answers
  system = getPrompt(
    diaryContext, allQuestions, allKeywords,
    false ,  false ,  false ,  false ,  false , // disable other operations
    badContext, allKnowledge, unvisitedURLs,
    true   // Enable beast mode
  );

  // Force the answer to be generated
  const result = await LLM.generateStructuredResponse(system, messages, answerOnlySchema);
  thisStep = result.object;
  thisStep.isFinal =  true ;
}

The prompt message of Beast Mode is deliberately written in an exaggerated way, clearly telling LLM: You must make a decisive decision now and give an answer based on the existing information!

<action-answer>
? Activate the highest combat power! Absolute priority! ?

Primary Directive:
- Eliminate all hesitation! Give an answer, it is better than silence!
- Local strategies are possible – use all the information you have!
- Allows reusing previous failed attempts!
- When you are undecided: act decisively based on the existing intelligence!

Failure is not allowed! You must achieve your goal! ⚡️
</action-answer>

This ensures that even when faced with difficult or ambiguous questions, we can come up with a usable answer rather than coming up empty handed.

in conclusion

DeepSearch can be said to be an important breakthrough in search technology in dealing with complex queries. It breaks down the entire process into independent search, reading, and reasoning steps, overcoming many limitations of traditional single-round RAG or multi-hop question-answering systems.

During the development process, we are constantly reflecting: Standing at the time node of 2025, facing the dramatic changes in the entire search industry after the release of DeepSeek-R1, what should the future search technology foundation look like? What new needs are emerging now? Which needs are outdated? Which needs are actually false needs?

Looking back at the entire implementation process of DeepSearch, we carefully summarized: which are the expected and indispensable elements, which are taken for granted and not actually needed, and which are completely unexpected but ultimately become crucial.

First, a long-context LLM that can generate output in a canonical format (such as JSON Schema) is necessary . Perhaps a reasoning model is also needed to improve the ability of action reasoning and query expansion.

Query expansion is also an absolute necessity , and it is an unavoidable link whether it is implemented using SLM, LLM, or a dedicated reasoning model. But after doing this project, we found that SLM may not be suitable for this task, because query expansion must be inherently multilingual, and cannot be limited to simple synonym replacement or keyword extraction. It must be comprehensive enough, with a token foundation covering multiple languages (so that the scale can easily reach 300 million parameters), and it must be smart enough to think outside the box. Therefore, relying solely on SLM for query expansion may not work.

Web search and web reading capabilities are undoubtedly the top priority . Fortunately, our [Reader (r.jina.ai)] performs very well. It is not only powerful but also has good scalability. This also inspires me to improve our search endpoint (s.jina.ai) can be optimized in the next iteration.

The vector model is useful, but in completely unexpected places. We thought it would be used for in-memory retrieval, or in conjunction with a vector database to compress context, but it turns out that neither is needed. In the end, we found that the vector model works best for deduplication, which is essentially an STS (semantic text similarity) task. Since the number of queries and knowledge gaps is usually in the hundreds, it is completely sufficient to calculate cosine similarity directly in memory without using a vector database.

We did not use the Reranker model , but in theory it could help determine which URLs should be visited first based on the query, URL title, and summary snippet. Multilingual capabilities are a basic requirement for Embeddings and Reranker models, as queries and questions are multilingual. Long context processing helps Embeddings and Reranker models, but it is not a decisive factor. We did not encounter any issues caused by the use of vectors, which may be due tojina-embeddings-v3Excellent context length, reaching 8192 tokens). Overall,jina-embeddings-v3andjina-reranker-v2-base-multilingualStill my top choice, they have multi-language support, SOTA performance, and good long context handling.

The Agent framework ultimately proved to be unnecessary. In terms of system design, we prefer to stick to the native capabilities of LLM and avoid introducing unnecessary abstraction layers. Vercel AI SDK provides great convenience in adapting to different LLM vendors, greatly reducing the development workload. You can switch between Gemini Studio, OpenAI, and Google Vertex AI by simply modifying one line of code. Agent memory management makes sense, but the introduction of a dedicated framework for this is still worth discussing. I personally believe that over-reliance on the framework may build a barrier between LLM and developers, and the syntactic sugar it provides may become a burden on developers. Many LLM/RAG frameworks have verified this. It is a wiser choice to embrace the native capabilities of LLM and avoid being bound by the framework.