How does RAG choose the best block size?

Explore how to optimize large language model applications through optimal chunking strategies to improve the efficiency and accuracy of semantic search and Chatbot.
Core content:
1. The role and importance of chunking in large language models
2. The impact of chunking strategies on semantic search and Chatbot
3. The trade-offs in choosing the appropriate chunking size and method
Today I will talk to you about the importance of chunking in building LLM (large language model) related applications.
If you often work with LLM-related projects, the "chunking" mentioned here is particularly important. Simply put, chunking is to split a large text into smaller parts. This process is critical to optimizing the relevance of content retrieved from the vector database, especially when we use LLM to embed content.
We all know that before any content is indexed, it needs to be embedded first. The main purpose of chunking is to ensure that when we embed, we can minimize noise while keeping the semantics relevant.
For example, in semantic search, we need to index a large number of documents, each of which contains important information about a specific topic. Through an effective chunking strategy, we can ensure that the search results can accurately capture the user's query intent. If the chunks are too small or too large, the search results will be inaccurate and may even miss some important related content.
A general rule of thumb is: if a piece of text makes sense to a human without context, it will also make sense to a language model. Therefore, finding the best chunk size for your collection of documents is key to ensuring accurate and relevant search results.
Of course, the chunking strategy is not only applicable to semantic search, but also widely used in conversational AI (Chatbot). In Chatbot, the embedded chunked content will be used to build a context based on the knowledge base , so that Chatbot has a reliable basis to generate answers. In this scenario, choosing the right chunking strategy is very important for two main reasons:
First, this determines whether the context is actually relevant to our question;
Second, this also determines whether we can fit the text into a limited token window before sending it to an external model provider (such as OpenAI).
In some cases, such as when using a model like GPT-4 that supports 32k context windows, accommodating large chunks of text may not be a problem. But we still need to be cautious about overly large chunks, as this may affect the relevance of the retrieved results.
In this article, we will explore several different chunking methods and discuss the trade-offs to consider when choosing a chunk size and method. Finally, we will provide some recommendations to help you choose the best chunk size and method for your application.
Different performance of embedding short content and long content
When we embed content, we can expect that the embedding will behave differently depending on the length of the content. Short content (such as sentences) and long content (such as paragraphs or entire articles) will be processed significantly differently.
When a sentence is embedded, the resulting vector focuses on the specific meaning of the sentence. Comparisons are naturally done at this granularity. However, this approach may ignore broader contextual information in the paragraph or document.
In contrast, when entire paragraphs or documents are embedded, the embedding process takes into account both the overall context and the relationships between sentences and phrases. The resulting vector representations are often more comprehensive, capturing the macro meaning and theme of the text. On the other hand, larger input texts may introduce noise, diluting the importance of individual sentences or phrases, making it difficult to find exact matches when querying the index.
In addition, the length of the query also affects the relevance between embeddings. Shorter queries (such as single sentences or phrases) focus more on details and are therefore better suited to match sentence-level embeddings. Longer queries (such as multi-sentence or multi-paragraph) may be better suited to match paragraph or document-level embeddings, as they are usually looking for a broader context or topic.
The index may also be non-homogeneous, i.e. contain embeddings of chunks of different sizes. This situation poses both challenges and potential positive effects. On the one hand, the relevance of query results may fluctuate due to differences in the semantic representation of long and short content. On the other hand, non-homogeneous indexes have the potential to capture a wider range of context and information, as chunks of different sizes represent different granularities of the text, which can make the system more flexible in responding to various types of queries.
Factors to consider in chunking strategy
There are several variables that influence the selection of the best chunking strategy, and these variables vary depending on the application scenario. Here are some key points to note:
What is the nature of the content you want to index? Are you dealing with long documents like articles or books or short content like term explanations or chat messages? The answer not only determines which model is best suited for your goals, but also what chunking strategy should be used. What embedding model is used and on which chunk sizes does it perform best? For example, the sentence-transformer model performs well on individual sentences, while models like text-embedding-ada-002 perform better on chunks of 256 or 512 tokens. What is the expected length and complexity of user queries? Are the queries short and specific, or long and complex? This will also influence how you choose your chunking strategy to ensure better correlation between embedded queries and embedded chunks. How will the results be used in your specific application? For example, will they be used for semantic search, question answering, summary generation, or something else? If the results need to be passed to another LLM with token restrictions, you must take this into account and limit the chunk size based on the number of chunks you want to put in the request.
Answering these questions will help you develop a chunking strategy that balances performance with accuracy, resulting in more relevant query results.
Block method
There are many ways to divide blocks, and each method has its own advantages and disadvantages in different situations. By analyzing the pros and cons of each method, we should be able to find the method that best suits our scenario.
Fixed size chunks
This is the most common and straightforward way to chunk: we just need to decide how many tokens each chunk will contain, and we can choose whether to have some overlap between them. Generally, we keep some overlap so that the semantic context is not lost between chunks. Fixed-size chunking is the best path in most common cases. Compared to other forms of chunking, this method is computationally cheap and easy to use because it does not require any NLP library.
Here is an example of fixed-size chunking using LangChain :
text = "..." # text
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(
separator = "\n" ,
chunk_size = 256 ,
chunk_overlap = 20
)
docs = text_splitter.create_documents([text])
Content-Aware Chunking
This is a set of strategies that take advantage of the characteristics of the content being chunked and impose more complex chunking methods. Here are a few examples:
Sentence segmentation
As mentioned earlier, many models excel at embedding sentence-level content. Naturally, we will use sentence chunking, and there are many methods and tools to achieve this, including:
Naive segmentation: The simplest approach is to segment sentences by period (".") and line break. While this method is quick and simple, it does not take into account all possible edge cases. Here is a very simple example:
text = "..." # text
docs = text.split( "." )
NLTK : The Natural Language Toolkit (NLTK) is a popular Python library for processing human language data. It provides a sentence tokenizer that helps split text into sentences, creating more meaningful chunks. For example, to use NLTK with LangChain, you can do this:
text = "..." # text
from langchain.text_splitter import NLTKTextSplitter
text_splitter = NLTKTextSplitter()
docs = text_splitter.split_text(text)
spaCy : spaCy is another powerful Python library for NLP tasks. It provides sophisticated sentence segmentation capabilities that can efficiently divide text into independent sentences to better preserve context. For example, using spaCy with LangChain, you can do this:
text = "..." # text
from langchain.text_splitter import SpacyTextSplitter
text_splitter = SpaCyTextSplitter()
docs = text_splitter.split_text(text)
Recursive chunking
Recursive chunking divides the input text into smaller chunks using a set of delimiters in a hierarchical and iterative manner. If the initial attempt fails to produce chunks of the desired size or structure, the method recursively calls itself until the desired chunk size or structure is achieved. This means that while the chunks may not necessarily be exactly the same size, they will still "tend" to be similar sizes.
Here is an example of how to use LangChain for recursive chunking:
text = "..." # text
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
# Set a very small chunk size, just for demonstration purposes.
chunk_size = 256 ,
chunk_overlap = 20
)
docs = text_splitter.create_documents([text])
Specialized Block
Markdown and LaTeX are two common examples of structuring and formatting content. In these cases, specialized chunking methods can be used to preserve the original structure during the chunking process.
Markdown : Markdown is a commonly used lightweight markup language for formatting text. By recognizing Markdown syntax (such as headings, lists, and code blocks), content can be intelligently divided according to its structure and hierarchy, thereby generating more semantically consistent blocks. For example:
from langchain.text_splitter import MarkdownTextSplitter
markdown_text = "..."
markdown_splitter = MarkdownTextSplitter(chunk_size= 100 , chunk_overlap= 0 )
docs = markdown_splitter.create_documents([markdown_text])
LaTex : LaTeX is a document preparation system and markup language commonly used in academic papers and technical documentation. By parsing LaTeX commands and environments, you can create blocks that respect the logical organization of the content (such as chapters, subsections, and equations), resulting in more accurate and contextual results. For example:
from langchain.text_splitter import LatexTextSplitter
latex_text = "..."
latex_splitter = LatexTextSplitter(chunk_size= 100 , chunk_overlap= 0 )
docs = latex_splitter.create_documents([latex_text])
Semantic Chunking
This is a new chunking method first proposed by Greg Kamradt . In his code example, Kamradt points out that the global chunk size may be too simple to take into account the meaning of the paragraphs within the document . If we use this mechanism, we will not know if we are combining paragraphs that are unrelated to each other.
Here is the code example address: https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/tutorials/LevelsOfTextSplitting/5_Levels_Of_Text_Splitting.ipynb
This semantic analysis can be used to create chunks of sentences organized around the same theme or topic.
Here are the steps required to make semantic chunking work:
Break the document into sentences. Create groups of sentences: For each sentence, create a group that includes the sentences before and after it. This group is essentially "anchored" by the sentence that created it. You can decide how many sentences to include in each group - but all sentences in the group will be associated with a single "anchor" sentence. Generate an embedding for each sentence group and associate it with an “anchor” sentence. Compare the distances between each group sequentially: When looking at sentences in a document sequentially, the semantic distance between the current sentence group and its previous sentence group will be low as long as the subject or topic remains the same . Conversely, a higher semantic distance indicates that the subject or topic has changed. This can effectively distinguish one block from the next.
LangChain creates a semantic chunker based on Kamradt’s work .
The address is: https://python.langchain.com/docs/modules/data_connection/document_transformers/semantic-chunker/
Find the best chunk size for your application
The following suggestions can help you determine the optimal chunk size when common chunking methods (such as fixed chunking) are not suitable.
Preprocess the data - Before determining the best chunk size, the data needs to be preprocessed to ensure quality. For example, if the data was obtained from the web, you may want to remove HTML tags or other elements that only add noise. Select a range of chunk sizes - Once data preprocessing is complete, the next step is to select a range of potential chunk sizes to test. As mentioned earlier, the choice should take into account the nature of the content (e.g. short messages or long documents), the embedding model that will be used and its capabilities (e.g. token limits). The goal is to maintain accuracy while preserving context. A variety of chunk sizes can be explored to begin with, including smaller chunks (e.g. 128 or 256 tokens) to capture more fine-grained semantic information, and larger chunks (e.g. 512 or 1024 tokens) to retain more context. Evaluate the performance of each chunk size - To test various chunk sizes, use multiple indexes. Using a representative dataset, create embeddings for the chunk sizes you want to test and save them in the index. Then run a series of queries to evaluate the quality and compare the performance of different chunk sizes. This is an iterative process that requires testing different chunk sizes for different queries until you determine the best chunk size that best suits your content and expected queries.
in conclusion
There is no one-size-fits-all solution for chunking, so what works for one use case may not work for another. Hopefully this post will help you better understand how to choose the right chunking strategy for your application.
- END -