Text Splitting - Text Splitting is an important part of RAG

The refined processing of text chunking in the RAG model improves the accuracy of understanding and answering.
Core content:
1. The importance of text chunking in the RAG model
2. The implementation idea of RecursiveCharacterTextSplitter
3. The advantages and application demonstration of chunking based on text structure
In RAG, text segmentation is one of the key steps that determines the accuracy of model understanding and answering.
Simple length-blocking is easy to implement, but it is easy to interrupt semantics; LangChain provides RecursiveCharacterTextSplitter
, the paragraph and sentence structure will be prioritized, and the recursive strategy will be used to control the block size while maintaining semantic coherence.
1. Chunking based on text structure 2. Implementation ideas of RecursiveCharacterTextSplitter 2.1 Choosing a delimiter 2.2 Split text by delimiter 2.3 Arrange the segmented blocks 3. Use 4. Split results 5. Graphical display of blocks Summarize
-- To receive the learning materials package, see the end of the article
Among the core steps of RAG, there is a crucial step: "Text Splitting" .
Its main function is to divide a large text into smaller and more reasonable segments so that the model can better understand, process or store the content.
If the entire article is not broken down, the granularity of the embedding is too coarse, and it is easy to make mistakes when answering questions. Therefore, whether the segmentation is good or not directly affects the relevance and accuracy of the final answer.
The most basic chunking method is to split the document based on its length. This simple yet effective method ensures that each chunk does not exceed the specified size limit.
The main benefits of length-based splitting are: simple and clear implementation, consistent block size, and easy to adapt to the requirements of different models. The disadvantages are: too rigid and ignore the text structure
Related reading:
Text Splitting, an important part of RAG
1. Chunking based on text structure
Typical texts are naturally organized into hierarchical units such as paragraphs, sentences, and words.
We can leverage this inherent structure to guide our splitting strategies, creating splits that preserve natural language fluency, maintain semantic coherence within splits, and accommodate different levels of text granularity.
LangChain RecursiveCharacterTextSplitter
Implemented this concept:
RecursiveCharacterTextSplitter
Try to maintain the integrity of larger units such as paragraphs.If a unit exceeds the chunk size, it will be moved to the next level (e.g., sentence). If necessary, this process is continued down to the word level.
2. Implementation ideas of RecursiveCharacterTextSplitter
2.1 Choosing a delimiter
Finds the first delimiter from the provided list that occurs in text. If a suitable delimiter is found, all subsequent delimiters are saved for possible subsequent recursive segmentation. If no delimiter is found, the last delimiter (usually the empty string) is used.
For example:
Assume the separator list is ["\n\n", "\n", " ", ""], for the text "Hello\nWorld":
First check "\n\n", which does not exist in the text Then check if "\n" exists in the text. Select "\n" as the delimiter Save [" ", ""] as new_separators for later use
separator = separators[ -1 ]
new_separators = []
for i, _s in enumerate(separators):
_separator = _s if self._is_separator_regex else re.escape(_s)
if _s == "" :
separator = _s
break
if re.search(_separator, text):
separator = _s
new_separators = separators[i + 1 :]
break
_separator = separator if self._is_separator_regex else re.escape(separator)
2.2 Split text by delimiter
splits = _split_text_with_regex(text, _separator, self._keep_separator)
2.3 Arrange the segmented blocks
Process each segmented text block:
If the text block is smaller than the specified size, add it to the temporary list
If the text block is larger than the specified size and there are other delimiters available, recursively split it
If the text block is larger than the specified size, but there are no other delimiters, then just add
Merge all text blocks that meet the size requirements
Return the final segmentation result
for s in splits:
if self._length_function(s) < self._chunk_size:
_good_splits.append(s)
else :
if _good_splits:
merged_text = self._merge_splits(_good_splits, _separator)
final_chunks.extend(merged_text)
_good_splits = []
if not new_separators:
final_chunks.append(s)
else :
other_info = self._split_text(s, new_separators)
final_chunks.extend(other_info)
if _good_splits:
merged_text = self._merge_splits(_good_splits, _separator)
final_chunks.extend(merged_text)
return final_chunks
3. Code Implementation
from langchain.text_splitter import RecursiveCharacterTextSplitter
text = """
Zhu Xian
Author: Xiao Ding
Episode 1
Prologue
Time: Unknown, probably a long time ago.
Location: The vast land of China.
Since ancient times, human beings have witnessed all kinds of strange things in the world around them, such as lightning, thunder, strong winds and rainstorms, as well as natural and man-made disasters, countless casualties, and widespread grief, which are absolutely beyond human power to deal with or resist. Therefore, they believe that there are all kinds of gods above the nine heavens, and that below the nine netherworlds is also the place where the souls return, the palace of Yama.
Thus, the idea of gods and immortals spread throughout the world. Countless people sincerely bowed down and worshipped the various gods they had imagined and created, praying for blessings and complaining about their grievances, with incense burning profusely.
Since ancient times, all mortals have to die. But people hate death and love life, and the existence of Yama in the underworld adds a bit of fear and suffering. Under this background, the idea of immortality came into being.
Compared with other living species, humans may be at a disadvantage in terms of physique, but it is true that humans are the most intelligent creatures. Driven by the pursuit of immortality, generations of intelligent people have devoted their entire lives to studying and researching.
Up to now, although the true immortality has not been found, there are some people who practice Taoism and have comprehended some of the creation of heaven and earth. With their mortal bodies, they have mastered powerful forces and with the help of various secret treasures and magical instruments, they can actually shake the heaven and earth and have the power of thunder.
Some of the predecessors who had attained great enlightenment were said to have lived for thousands of years without dying. People in the world believed that attaining enlightenment would lead to becoming immortals, so more and more people devoted themselves to the path of cultivating the Tao.
China is a vast land, boundless and boundless. Only the Central Plains is the most fertile and rich, and eight out of ten people in the world live here. The wild lands in the southeast, northwest and northeast are dangerous and dangerous, with many ferocious beasts and birds, many poisonous miasma and many barbarians who eat raw meat and drink blood, so few people go there. However, it has been said since ancient times that there are descendants of the ancients who survived in the world, hiding in the deep mountains and valleys, and living for more than ten thousand years, but no one has seen them.
Today, there are countless people practicing Taoism. Due to the vastness of China and the abundance of extraordinary people, there are many different ways of practicing Taoism. The way to immortality has not yet been found, but there are gradually different schools and the difference between good and evil. As a result, there are many sectarianism, intrigues, and even fighting and killing.
When immortality seems so far away and unattainable, the power gained through cultivation has gradually become the goal of many people.
In today's world, the righteous are flourishing and the evil are retreating. The Central Plains is a land of beautiful mountains and rivers, with a large number of people and abundant resources. It is firmly occupied by the righteous families. Among them, the "Qingyun Sect", "Tianyin Temple" and "Fenxiang Valley" are the three pillars and the leaders.
This story begins with the "Qingyun Gate".
"""
text_splitter = RecursiveCharacterTextSplitter(chunk_size= 150 )
docs = text_splitter.create_documents([text])
for doc in docs:
print( '-' * 50 )
print(doc)
4. Split results
By observing the results of text segmentation, we can see that RecursiveCharacterTextSplitter divides the entire text into 7 complete blocks under the setting of chunk_size=150.
When segmenting, priority is given to the natural separation between paragraphs (\n\n), so that each block maintains a relatively independent topic.
This block division method not only ensures the semantic coherence of each block of content, but also controls the text length within a reasonable range, providing a good foundation for subsequent text processing and analysis.
--------------------------------------------------
page_content = 'Zhu Xian
Author: Xiao Ding
Episode 1
Prologue
Time: Unknown, probably a long time ago.
Location: The vast land of China.
Since ancient times, human beings have witnessed all kinds of strange things in the world around them, such as lightning and thunder, strong winds and rainstorms, as well as natural and man-made disasters, countless casualties, and widespread grief, which are absolutely beyond human power to deal with or resist. Therefore, they believe that there are all kinds of gods above the nine heavens, and that below the nine netherworlds is also the place where the souls return, the palace of Yama.'
--------------------------------------------------
page_content= 'Thus, the idea of gods and immortals spread throughout the world. Countless people bowed down in sincerity, worshipped the various gods they had imagined and created, prayed for blessings and complained about their grievances, and burned incense.
Since ancient times, all mortals have to die. But people hate death and love life, and the existence of the underworld adds a bit of fear and suffering. Under this background, the idea of immortality came into being.
--------------------------------------------------
page_content= 'Compared to other species, humans may be at a disadvantage in terms of physique, but it is true that they are the most intelligent creatures. Driven by the pursuit of immortality, generations of intelligent people have devoted their entire lives to studying.'
--------------------------------------------------
page_content= 'So far, although the true immortality has not been found, there are some cultivators who have comprehended some of the creation of heaven and earth. With their mortal bodies, they have mastered powerful forces and, with the help of various secret treasures and magical instruments, they can actually shake the heaven and earth and have the power of thunder.
Some of the predecessors who have achieved great enlightenment are said to have lived for thousands of years without dying. People in the world believe that they can become immortals if they achieve enlightenment, so more and more people are devoted to the path of cultivating the Tao.'
--------------------------------------------------
page_content= 'The vast land of China is boundless. Only the Central Plains is the most fertile and rich, and eight out of ten people in the world live here. The wild lands in the southeast, northwest and northeast are dangerous and dangerous, with many ferocious beasts and birds, many poisonous miasma and many barbarians who eat raw meat and drink blood, so few people go there. However, it has been said since ancient times that there are ancient species that survived in the world, hiding in the deep mountains and valleys, and living for more than ten thousand years, but no one has seen them.'
--------------------------------------------------
page_content= 'Today, there are countless people practicing Taoism. Due to the vastness of China and the abundance of extraordinary people, there are many different ways of practicing Taoism. The way to immortality has not yet been found, but there are gradually different schools and the difference between good and evil. As a result, there are many sectarianism, intrigues and even fighting and killing.'
--------------------------------------------------
page_content= 'When immortality seems so far away and unattainable, the power gained through practice has gradually become the goal of many people.
In today's world, the righteous are flourishing and the evil are retreating. The Central Plains is a land of beautiful mountains and rivers, with a large number of people and abundant resources. It is firmly occupied by the righteous families. Among them, the "Qingyun Sect", "Tianyin Temple" and "Fenxiang Valley" are the three pillars and the leaders.
This story begins from the Qingyun Gate.
5. Graphical display of blocks
You can see the chunking results in a graphical way at www.chunkviz.com
Summarize
Text segmentation is not only a technical implementation issue, but also a core strategy that affects the final effect of the RAG system.
Simple chunking is easy to use but has limited effect. Structured recursive chunking is better at preserving semantics and improving relevance.
If you want to build a high-quality question-answering system, you cannot choose the block division method randomly, but must carefully design it based on the characteristics of the text and the application scenario.