Recommendation
In-depth exploration of efficient large-scale document segmentation technology to improve the processing efficiency and accuracy of AI knowledge base.
Core content:
1. Technical challenges and problems of large-scale document input into AI knowledge base
2. The core concept and advantages of document segmentation technology
3. Technical comparison and code implementation of five mainstream segmentation methods
Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)
Hello everyone! Recently, when developing an AI testing assistant based on a large model, I encountered a difficult technical problem: how to effectively input a large volume of requirement documents into the AI knowledge base?Why is this a technical problem?In various AI technology communities, developers often ask: "How to deal with hundreds of pages of documents when developing AI knowledge base applications?"Frankly, I also had a hard time when I first tried it. Directly feeding the entire document into a large model will cause two obvious problems:1. The model indicates that the context window capacity is insufficient (i.e. the input exceeds the processing limit of the model)2. Even when using a model with a very long context window, too much irrelevant information will lead to low model understanding accuracy and poor generation qualityThis is as inefficient as trying to consult a colleague on a specific question without first reading him an entire reference manual. ?? Document segmentation: a key technology to improve AI processing efficiencyWhat’s the solution? Document chunking !The core idea of this technology is to decompose large documents into semantically coherent small units. This is not just a simple and crude division by a fixed number of words, but it needs to consider semantic integrity and ensure that each block contains meaningful and complete information.Next, I will introduce five mainstream block partitioning methods in detail, from basic applications to advanced technologies, covering the needs of different scenarios. The code implementation is based on the LlamaIndex framework.? ️Technical comparison of five block partitioning methods1️⃣Fixed -length block method: a basic and practical solutionSplit the text into a fixed number of tokens or characters, and keep a certain overlap area to maintain contextual coherence.✅ Simple to implement, suitable for rapid deployment ✅ Fast processing speed and low computing resource requirements❌ May break semantic integrity (e.g. split a complete concept into different chunks) ❌ Low processing accuracy for complex structured documentsImplementation code examplefrom llama_index.core.node_parser import SimpleNodeParser parser = SimpleNodeParser.from_defaults( chunk_size=512, # Chinese documents recommend setting it to around 384 tokens chunk_overlap=64 # Overlapping area to ensure contextual coherence) nodes = parser.get_nodes_from_documents(documents)
2️⃣Structure -aware chunking: respect the original structure of the documentIdentify the internal structure of the document (such as Markdown heading levels, HTML tags), and divide it into blocks according to the logical organization of the document.✅ Maintain the integrity of the original logical structure of the document ✅ Avoid separating semantically related paragraphs or sections❌ Only applies to documents with explicit structural markup ❌ Limited effectiveness on unstructured plain textImplementation code exampleMarkdown document processingfrom llama_index.core.node_parser import MarkdownNodeParserparser = MarkdownNodeParser()nodes = parser.get_nodes_from_documents(markdown_docs)
from llama_index.core.node_parser import HTMLNodeParser parser = HTMLNodeParser(tags=["p", "h1"]) # Specify the tags to be extracted nodes = parser.get_nodes_from_documents(html_docs)
3️⃣ Sliding window chunking: keeping context coherentBy moving a fixed-size window over the text, a group of adjacent sentences are captured each time to form text snippets with contextual associations.✅ Effectively maintain contextual coherence and reduce information gaps ✅ Particularly suitable for processing continuous data streams❌ Generates a lot of overlapping content and has low storage efficiency ❌ Window parameter configuration needs to be fine-tuned, affecting processing qualityImplementation code exampleimport nltkfrom llama_index.core.node_parser import SentenceWindowNodeParser node_parser = SentenceWindowNodeParser.from_defaults( window_size=3, # Number of sentences contained on each side window_metadata_key="window", original_text_metadata_key="original_sentence",)
4️⃣Semantic embedding chunking: segmentation based on semantic similarityThe semantic segmenter uses the embedding model to make semantic similarity judgments, thereby adaptively selecting breakpoints between sentences (splitting when the similarity is below a set threshold), which ensures that the document block contains semantically related sentences.✅ Able to maintain semantic integrity and avoid conceptual fragmentation ✅ Automatically identify topic transition points, such as from "Technical Description" to "Application Case"❌ The computing resources are consumed a lot, and the performance overhead is obvious when processing large documents ❌ Parameter tuning is complex and requires multiple attemptsImplementation code examplefrom llama_index.core.node_parser import SemanticSplitterNodeParser from llama_index.embeddings.openai import OpenAIEmbedding embed_model = OpenAIEmbedding()splitter = SemanticSplitterNodeParser( buffer_size=1, breakpoint_percentile_threshold=95, embed_model=embed_model)
5️⃣ LLM dynamic segmentation: high-level intelligent segmentation based on large modelsDirectly leverage the deep understanding capabilities of large language models to allow the models to autonomously determine the optimal chunking strategy and boundaries.✅ Strongest semantic understanding ability, able to capture complex conceptual associations ✅ Support multi-dimensional and multi-level block strategies❌ API call costs are high and are not suitable for projects with limited budgets ❌ The processing speed is relatively slow and is not suitable for scenarios with high real-time requirementsImplementation code examplefrom llama_index.core.llms import OpenAIimport json def llm_chunking(text): llm = OpenAI(model="gpt-4-turbo") prompt = f"""Divide the following technical document into logical units, each unit contains a complete technical concept: {text} Return JSON format: [{{"title":"Unit title","content":"Text content"}}]""" response = llm.complete(prompt) try: return json.loads(response.text) except json.JSONDecodeError: raise ValueError("LLM response format error")
? Comparison of technical indicators of five methodsBlock method | Processing speed | Semantic Preservation | Difficulty | Applicable scenarios |
Fixed Block | ⭐⭐⭐⭐ | ⭐ | ⭐ | Rapid prototyping system |
Sliding Window | ⭐⭐⭐ | ⭐⭐ | ⭐⭐ | Conversation records, interview transcripts |
Structure-aware chunking | ⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐ | Documents in specific formats such as Markdown/HTML/JSON |
Embed Blocks | ⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐ | Long narrative text |
LLM Block | ⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | Various complex documents |
? Practical case: Implementation of block strategy of AI test assistantIn the AI test assistant project I developed, I adopted a semantic block strategy and used the embedding model of the Bailian platform. The reason for choosing this solution is that the test requirement documents usually contain complex technical details and logical relationships, and semantic integrity needs to be maintained to ensure the accuracy of test understanding. The implementation code is as follows:Settings.embed_model = dashscope_embed_model()# Semantic segmentation configuration Settings.node_parser = SemanticSplitterNodeParser( buffer_size=128, # Keep 128 tokens overlapping area breakpoint_percentile_threshold=95, # 95% threshold automatically finds the best segmentation point embed_model = dashscope_embed_model())
Technical reflection : Although semantic chunking performs well in maintaining conceptual integrity, it consumes a lot of computing resources and fails to fully utilize the structural characteristics of the document. Therefore, I am designing an optimization solution for the hybrid chunking strategy:- Use structure-aware chunking for structured documents such as Markdown/HTML
- Semantic chunking for unstructured plain text
- Use fixed blocks to improve processing efficiency for simple scenarios or resource-constrained situations
This hybrid strategy is expected to significantly improve the overall system performance while maintaining processing quality. The detailed implementation will be introduced in subsequent technical sharing.Although document segmentation technology seems to be a technical detail in the process of building a RAG system, it is actually a key link in determining system performance and effectiveness. Choosing an appropriate segmentation strategy requires comprehensive consideration based on specific application scenarios, document characteristics, and resource constraints.