Efficient segmentation of large knowledge base documents: from basics to advanced

Written by
Iris Vance
Updated on:July-08th-2025
Recommendation

In-depth exploration of efficient large-scale document segmentation technology to improve the processing efficiency and accuracy of AI knowledge base.

Core content:
1. Technical challenges and problems of large-scale document input into AI knowledge base
2. The core concept and advantages of document segmentation technology
3. Technical comparison and code implementation of five mainstream segmentation methods

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)
Hello everyone! Recently, when developing an AI testing assistant based on a large model, I encountered a difficult technical problem: how to effectively input a large volume of requirement documents into the AI ​​knowledge base?

Why is this a technical problem?
In various AI technology communities, developers often ask: "How to deal with hundreds of pages of documents when developing AI knowledge base applications?"
Frankly, I also had a hard time when I first tried it. Directly feeding the entire document into a large model will cause two obvious problems:
1. The model indicates that the context window capacity is insufficient (i.e. the input exceeds the processing limit of the model)
2. Even when using a model with a very long context window, too much irrelevant information will lead to low model understanding accuracy and poor generation quality
This is as inefficient as trying to consult a colleague on a specific question without first reading him an entire reference manual. ?

Document segmentation: a key technology to improve AI processing efficiency
What’s the solution? Document chunking !
The core idea of ​​this technology is to decompose large documents into semantically coherent small units. This is not just a simple and crude division by a fixed number of words, but it needs to consider semantic integrity and ensure that each block contains meaningful and complete information.
Next, I will introduce five mainstream block partitioning methods in detail, from basic applications to advanced technologies, covering the needs of different scenarios. The code implementation is based on the LlamaIndex framework.

️Technical comparison of five block partitioning methods

1️⃣Fixed  -length block method: a basic and practical solution
Technical principle  
Split the text into a fixed number of tokens or characters, and keep a certain overlap area to maintain contextual coherence.
Technology Assessment  
✅ Simple to implement, suitable for rapid deployment  
✅ Fast processing speed and low computing resource requirements
❌ May break semantic integrity (e.g. split a complete concept into different chunks)  
❌ Low processing accuracy for complex structured documents
Implementation code example
from llama_index.core.node_parser import SimpleNodeParser parser = SimpleNodeParser.from_defaults( chunk_size=512, # Chinese documents recommend setting it to around 384 tokens chunk_overlap=64 # Overlapping area to ensure contextual coherence) nodes = parser.get_nodes_from_documents(documents)

2️⃣Structure  -aware chunking: respect the original structure of the document
Technical principle  
Identify the internal structure of the document (such as Markdown heading levels, HTML tags), and divide it into blocks according to the logical organization of the document.
Technology Assessment  
✅ Maintain the integrity of the original logical structure of the document  
✅ Avoid separating semantically related paragraphs or sections
❌ Only applies to documents with explicit structural markup  
❌ Limited effectiveness on unstructured plain text
Implementation code example

Markdown document processing
from llama_index.core.node_parser import MarkdownNodeParserparser = MarkdownNodeParser()nodes = parser.get_nodes_from_documents(markdown_docs)
HTML document processing
from llama_index.core.node_parser import HTMLNodeParser parser = HTMLNodeParser(tags=["p", "h1"]) # Specify the tags to be extracted nodes = parser.get_nodes_from_documents(html_docs)

3️⃣  Sliding window chunking: keeping context coherent
Technical principle  
By moving a fixed-size window over the text, a group of adjacent sentences are captured each time to form text snippets with contextual associations.
Technology Assessment  
✅ Effectively maintain contextual coherence and reduce information gaps  
✅ Particularly suitable for processing continuous data streams
❌ Generates a lot of overlapping content and has low storage efficiency  
❌ Window parameter configuration needs to be fine-tuned, affecting processing quality
Implementation code example
import nltkfrom llama_index.core.node_parser import SentenceWindowNodeParser node_parser = SentenceWindowNodeParser.from_defaults( window_size=3, # Number of sentences contained on each side window_metadata_key="window", original_text_metadata_key="original_sentence",)

4️⃣Semantic  embedding chunking: segmentation based on semantic similarity
Technical principle  
The semantic segmenter uses the embedding model to make semantic similarity judgments, thereby adaptively selecting breakpoints between sentences (splitting when the similarity is below a set threshold), which ensures that the document block contains semantically related sentences.
Technology Assessment  
✅ Able to maintain semantic integrity and avoid conceptual fragmentation  
✅ Automatically identify topic transition points, such as from "Technical Description" to "Application Case"
❌ The computing resources are consumed a lot, and the performance overhead is obvious when processing large documents  
❌ Parameter tuning is complex and requires multiple attempts
Implementation code example
from llama_index.core.node_parser import SemanticSplitterNodeParser from llama_index.embeddings.openai import OpenAIEmbedding embed_model = OpenAIEmbedding()splitter = SemanticSplitterNodeParser( buffer_size=1, breakpoint_percentile_threshold=95, embed_model=embed_model)

5️⃣  LLM dynamic segmentation: high-level intelligent segmentation based on large models
Technical principle  
Directly leverage the deep understanding capabilities of large language models to allow the models to autonomously determine the optimal chunking strategy and boundaries.
Technology Assessment  
✅ Strongest semantic understanding ability, able to capture complex conceptual associations  
✅ Support multi-dimensional and multi-level block strategies
❌ API call costs are high and are not suitable for projects with limited budgets  
❌ The processing speed is relatively slow and is not suitable for scenarios with high real-time requirements
Implementation code example
from llama_index.core.llms import OpenAIimport json def llm_chunking(text): llm = OpenAI(model="gpt-4-turbo") prompt = f"""Divide the following technical document into logical units, each unit contains a complete technical concept: {text} Return JSON format: [{{"title":"Unit title","content":"Text content"}}]""" response = llm.complete(prompt) try: return json.loads(response.text) except json.JSONDecodeError: raise ValueError("LLM response format error")

? Comparison of technical indicators of five methods

Block method

Processing speed

Semantic Preservation

Difficulty

Applicable scenarios

Fixed Block

⭐⭐⭐⭐

Rapid prototyping system

Sliding Window

⭐⭐⭐

⭐⭐

⭐⭐

Conversation records, interview transcripts

Structure-aware chunking

⭐⭐⭐

⭐⭐⭐

⭐⭐

Documents in specific formats such as Markdown/HTML/JSON

Embed Blocks

⭐⭐

⭐⭐⭐⭐

⭐⭐⭐

Long narrative text

LLM Block

⭐⭐⭐⭐⭐

⭐⭐⭐⭐⭐

Various complex documents



Practical case: Implementation of block strategy of AI test assistant
In the AI ​​test assistant project I developed, I adopted a semantic block strategy and used the embedding model of the Bailian platform. The reason for choosing this solution is that the test requirement documents usually contain complex technical details and logical relationships, and semantic integrity needs to be maintained to ensure the accuracy of test understanding. The implementation code is as follows:
Settings.embed_model = dashscope_embed_model()# Semantic segmentation configuration Settings.node_parser = SemanticSplitterNodeParser( buffer_size=128, # Keep 128 tokens overlapping area breakpoint_percentile_threshold=95, # 95% threshold automatically finds the best segmentation point embed_model = dashscope_embed_model())
Technical reflection : Although semantic chunking performs well in maintaining conceptual integrity, it consumes a lot of computing resources and fails to fully utilize the structural characteristics of the document. Therefore, I am designing an optimization solution for the hybrid chunking strategy:
  • Use structure-aware chunking for structured documents such as Markdown/HTML
  • Semantic chunking for unstructured plain text
  • Use fixed blocks to improve processing efficiency for simple scenarios or resource-constrained situations

This hybrid strategy is expected to significantly improve the overall system performance while maintaining processing quality. The detailed implementation will be introduced in subsequent technical sharing.

Conclusion
Although document segmentation technology seems to be a technical detail in the process of building a RAG system, it is actually a key link in determining system performance and effectiveness. Choosing an appropriate segmentation strategy requires comprehensive consideration based on specific application scenarios, document characteristics, and resource constraints.