Optimal text segment selection and URL reordering in DeepSearch/DeepResearch

Written by
Silas Grey
Updated on:July-12th-2025
Recommendation

In-depth exploration of how DeepSearch/DeepResearch can effectively improve search quality.

Core content:
1. Using the delayed segmentation algorithm to extract the best text segment of a web page
2. Using Reranker to intelligently re-rank URLs
3. Pragmatic search system design based on the 80-20 principle

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)


  1. Extracting optimal text segments from long web pages : How to use late-chunking algorithms to select the most relevant small pieces of information from long web page content.

  2. Rerank the collected URLs : How to use the reranker to let LLM Agent intelligently choose which URL to crawl among hundreds of URLs?

Some people may still remember the conclusion in our previous article: "In DeepSearch, the Embeddings model is only suitable for query deduplication such as STS (Semantic Text Similarity) tasks, and Reranker is not even in our initial DeepSearch programming implementation."

Now it seems that these two types of recall models still have their value, but their usage is different from our conventional cognition. We have always followed the "80-20" principle in search, and will not force any model to take care of emotional value or to prove our market presence as an embeddings and reranker provider . We are very 80-20, very pragmatic, so pragmatic that we only care about the most essential needs of the search system.

So, after weeks of trial and error and iteration, we found some unconventional but very effective applications of Embeddings and Reranker in DeepSearch/DeepResearch systems.

Select the best text segment from a long text

The problem is this: after reading the webpage content with Jina Reader, we need to put it into the Agent's context as a piece of knowledge for it to reason about. Although it is the most convenient way to cram all the content into the LLM context, it is definitely not the best choice considering the token cost and generation speed. In actual applications, we need to find the most relevant parts of the content and only add these parts as knowledge to the Agent's context.

? This refers to the situation where the content is still too long even after being cleaned up into clean Markdown using Jina Reader, such as long pages such as GitHub Issues, Reddit posts, forum discussions, and blog articles.

The LLM-based screening method also has the same cost and latency issues, so we have to look for solutions with small models: we need smaller and cheaper models that still support multiple languages. This is a key factor because we cannot guarantee that the questions or documents will always be in Chinese.

On one side we have the question (the original query or the "information gap" question), and on the other side is a large amount of Markdown content, most of which is irrelevant. We need to select the most relevant fragments to the question. This is very similar to the chunking problem that the RAG community has been working on since 2023 - using the retriever model to retrieve only relevant chunks and summarize them in the context window.

However, there are two key differences in our case:

  1. A finite number of blocks of text in a finite number of documents.

Assuming that each block has about 500 tokens, a typical long web document has about 200,000 tokens (median) to 1 million tokens (99th percentile). We use Jina Reader to crawl 4-5 URLs in each step, which will probably generate hundreds of text blocks. In other words, hundreds of vectors and hundreds of cosine similarities. This can be easily processed in memory with JavaScript, and there is no need for a vector database.

  1. We need consecutive chunks of text to form effective knowledge summaries.

We cannot accept a summary consisting of scattered sentences like [1-2, 6-7, 9, 14, 17, ...]. A more useful knowledge summary should be something like [3-15, 17-24, ...], which can better maintain the coherence of the text. This makes it easier for LLM to copy and quote from the knowledge source, and can also reduce "hallucinations".

The rest are the same considerations that developers complain about: each text block cannot be too long, because the vector model cannot handle too much context; chunking will cause context loss and make the vectors of each text block independent and identically distributed; and, how to find the best boundary to maintain readability and semantics? If you know what we are talking about, then you may have also been troubled by these problems in your RAG implementation.

But long story short - usejina-embeddings-v3Late  Chunking perfectly solves these three problems. Late Chunking retains the context information of each chunk, is insensitive to boundaries, andjina-embeddings-v3It is also the most advanced in asymmetric multilingual retrieval tasks. Interested readers can follow our blog post or paper for details. The overall implementation is as follows.

☞ "Delayed segmentation" strategy in long text embedding model 

? https://arxiv.org/pdf/2409.04701

This figure shows the summary selection algorithm, which works similarly to a one-dimensional convolution (Conv1D). The process first splits a long document into fixed-length chunks, and then uses delayed segmentation tojina-embeddings-v3These chunks of text are vectorized. After calculating the similarity score between each chunk and the question, a sliding window moves over these similarity scores to find the window with the highest average.

Here is the schematic code: Use late segmentation and average pooling similar to "one-dimensional convolution" to pick out the paragraphs most relevant to the question.

function  cherryPick ( question, longContext, options {
  if  (longContext.length < options.snippetLength * options.numSnippets)
    return  longContext;

  const  chunks = splitIntoChunks(longContext, options.chunkSize);

  const  chunkEmbeddings = getEmbeddings(chunks,  "retrieval.passage" );
  const  questionEmbedding = getEmbeddings([question],  "retrieval.query" )[ 0 ];

  const  similarities = chunkEmbeddings.map( embed  =>
    cosineSimilarity(questionEmbedding, embed));

  const  chunksPerSnippet =  Math .ceil(options.snippetLength / options.chunkSize);
  const  snippets = [];
  const  similaritiesCopy = [...similarities];

  for  ( let  i =  0 ; i < options.numSnippets; i++) {
    let  bestStartIndex =  0 ;
    let  bestScore = - Infinity ;

    for  ( let  j =  0 ; j <= similarities.length - chunksPerSnippet; j++) {
      const  windowScores = similaritiesCopy.slice(j, j + chunksPerSnippet);
      const  windowScore = average(windowScores);

      if  (windowScore > bestScore) {
        bestScore = windowScore;
        bestStartIndex = j;
      }
    }

    const  startIndex = bestStartIndex * options.chunkSize;
    const  endIndex =  Math .min(startIndex + options.snippetLength, longContext.length);
    snippets.push(longContext.substring(startIndex, endIndex));

    for  ( let  k = bestStartIndex; k < bestStartIndex + chunksPerSnippet; k++)
      similaritiesCopy[k] = - Infinity ;
  }

  return  snippets.join( "\n\n" );
}

When calling the Jina Embeddings API, remember totaskSet to retrieval, openlate_chunking,truncateAlso set it up like this:

await axios.post(
  'https://api.jina.ai/v1/embeddings' ,
  {
    model:  "jina-embeddings-v3" ,
    task:  "retrieval.passage" ,
    late_chunking:  true ,
    input: chunks,
    truncate:  true
  },
  { headers });

If you want to vectorize the problem, remember totaskReplaceretrieval.query, then turn offlate_chunking.

The complete implementation code can be found on GitHub: https://github.com/jina-ai/node-DeepResearch/blob/main/src/tools/jina-latechunk.ts

Sorting URLs for "Read Next"

The problem is this: During each long DeepSearch process, you may collect a bunch of URLs from the search engine results page (SERP). Every time you open a web page, you can find a lot of new links. Even after deduplication, there are easily hundreds of URLs. Similarly, it is definitely not a good idea to feed all of them to LLM. Not only will it waste precious context length, but what's worse is that we found that LLM basically chooses blindly. Therefore, we have to find a way to guide LLM to pick out those URLs that are most likely to contain the answer.

curl https://r.jina.ai/https://example.com \
  -H  "Accept: application/json"  \
  -H  "Content-Type: application/json"  \
  -H  "X-Retain-Images: none"  \
  -H  "X-Md-Link-Style: discarded"  \
  -H  "X-Timeout: 20"  \
  -H  "X-With-Links-Summary: all"

This command is the best configuration for Jina Reader to crawl web pages in DeepSearch. It will single out all the links on the page and put them in the links field, while deleting them from the content field.

You can think of this problem as "PageRank in context," except that we're scoring hundreds of URLs in a single session.

We take into account several factors: the last update time, the frequency of the domain name, the path structure of the web page, and most importantly, the semantic relevance to the question to calculate a comprehensive score. However, we can only use information that is available before clicking on the URL.

1. Frequency signal : If a URL appears multiple times in different information sources, it will be more weighted. In addition, if a domain name appears frequently in search results, URLs from this domain name will also be given extra points. Generally speaking, popular domain names tend to contain more authoritative content.

2. Path structure : We analyze the path structure of the URL to determine which content is clustered together. If multiple URLs belong to the same path level, they will score higher; but the deeper the path, the score bonus will gradually decrease.

3. Semantic relevance : We usejina-reranker-v2-base-multilingualTo evaluate the semantic relevance of the question and the textual information of each URL (such as title and summary), this is a typical re-ranking problem. The textual information of each URL comes from the following places:

  • The title and summary returned by the Search Engine Results Page (SERP) API (https://s.jina.ai/ API, with 'X-Respond-With': 'no-content', can return only the title and summary without the specific content).

  • The anchor text of the URL on the page (using the https://r.jina.ai interface, setting 'X-With-Links-Summary': 'all', you can return the summary information of all links in the page, that is, the anchor text).

4. Last updated : Some queries in DeepSearch have high timeliness requirements, so generally speaking, newer URLs have higher value. However, without large-scale indexing capabilities like Google, it is difficult to accurately determine the last updated time of a web page. We use a combination of the following signals to give a timestamp with a confidence score so that the latest content can be displayed first when needed:

  • The filtering functions provided by the SERP API (such as the tbs parameter of s.jina.ai, which can filter by time).

  • HTTP Header information analysis (such as Last-Modified and ETag fields).

  • Metadata extraction (such as meta tags and Schema.org timestamps).

  • Content pattern recognition (recognizing dates visible in HTML).

  • CMS platform-specific metrics (such as WordPress, Drupal, Ghost, etc.)

5. Restricted content : The content on some social media platforms is restricted or requires payment to access. Without logging in, there is no way to legally access this content. Therefore, we actively maintain a blacklist to record these problematic URLs and domains, lower their rankings, and avoid wasting computing resources on these inaccessible content.

6. Domain diversity : Sometimes, the top URLs all come from the same domain, which can cause DeepSearch to fall into a "local optimum" and affect the quality of the final results. As mentioned earlier, the top-ranked URLs all come from StackOverflow. In order to improve the diversity of the results, we can use an "explore-exploit" strategy: select the top K URLs from each domain.

The complete code implementation of URL sorting can be found on our Github: https://github.com/jina-ai/node-DeepResearch/blob/main/src/utils/url-tools.ts#L192

< action-visit >
- Crawl and read full content from URLs, you can get the fulltext, last updated datetime etc of any URL.
- Must check URLs mentioned in  < question >  if any
- Choose and visit relevant URLs below for more knowledge. higher weight suggests more relevant:
< url-list >
  + weight: 0.20 "https://huggingface.co/docs/datasets/en/loading": "Load - Hugging FaceThis saves time because instead of waiting for the Dataset builder download to time out, Datasets will look directly in the cache. Set the environment ...Some datasets may have more than one version based on Git tags, branches, or commits. Use the revision parameter to specify the dataset version you want to load ..."
  + weight: 0.20 "https://huggingface.co/docs/datasets/en/index": "Datasets - Hugging Face? Datasets is a library for easily accessing and sharing datasets for Audio, Computer Vision, and Natural Language Processing (NLP) tasks. Load a dataset in a ..."
  + weight: 0.17 "https://github.com/huggingface/datasets/issues/7175": "[FSTimeoutError] load_dataset · Issue #7175 · huggingface/datasetsWhen using load_dataset to load HuggingFaceM4/VQAv2, I am getting FSTimeoutError. Error TimeoutError: The above exception was the direct cause of the following ..."
  + weight: 0.15 "https://github.com/huggingface/datasets/issues/6465": "`load_dataset` uses out-of-date cache instead of re-downloading a ...When a dataset is updated on the hub, using load_dataset will load the locally cached dataset instead of re-downloading the updated dataset."
  + weight: 0.12 "https://stackoverflow.com/questions/76923802/hugging-face-http-request-on-data-from-parquet-format-when-the-only-way-to-get-i": "Hugging face HTTP request on data from parquet format when the ...I've had to get the data from their data viewer using the parquet option. But when I try to run it, there is some sort of HTTP error. I've tried downloading ..."
</ url-list >
</ action-visit >

Summarize

Since we open-sourced DeepSearch on February 2, 2025, we have discovered two engineering details that significantly improved quality. Interestingly, both details leverage multilingual Embedding and Reranker models in an “in-context window” manner. This pales in comparison to the large pre-computed indexes these models typically require.

This may indicate that future search technology will develop in a polarized direction. We can use Kahneman’s dual process theory to understand this trend:

  •  Fast - think (grep, BM25, SQL): fast, rule-based pattern matching with low computational effort.

  •  Slow - think (LLM): Comprehensive reasoning with deep contextual understanding, but computationally intensive.

  • Middle PlaceWith  Mid-think (Embedding, Reranker and other recall models): It has a certain semantic understanding ability, which is better than simple pattern matching, but its reasoning ability is far inferior to LLM.

One possibility is that the two-tier search architecture is becoming more and more popular : lightweight and efficient SQL/BM25 is responsible for the search input, and then the results are directly fed to the LLM for search output. Then, the remaining value of the middle-tier model is transferred to the tasks within a specific context window: such as filtering, deduplication, sorting, etc. In these scenarios, using LLM for complete reasoning is inefficient.

However, selecting key snippets and URL ranking is still the fundamental link that directly affects the quality of DeepSearch/DeepResearch systems. We hope that our findings can help you improve your own systems.

Query expansion is another key quality factor . We are actively evaluating various approaches, from simple prompt-based rewriting, to small models, to reasoning-based approaches. Please look forward to our follow-up research results in this direction!