Woter AI detection.Hurry - ends Jul 19th

New Year Sales :up to 80% OFF

AI Humanize AI Translator Bypass AI AI Rewriter AI Detector

PRICING

TRY FOR FREE

Efficient segmentation of large knowledge base documents: from basics to advanced

Written by

Iris Vance

Updated on:July-08th-2025

Hello everyone! Recently, when developing an AI testing assistant based on a large model, I encountered a difficult technical problem: how to effectively input a large volume of requirement documents into the AI knowledge base?

Why is this a technical problem?

In various AI technology communities, developers often ask: "How to deal with hundreds of pages of documents when developing AI knowledge base applications?"

Frankly, I also had a hard time when I first tried it. Directly feeding the entire document into a large model will cause two obvious problems:

1. The model indicates that the context window capacity is insufficient (i.e. the input exceeds the processing limit of the model)

2. Even when using a model with a very long context window, too much irrelevant information will lead to low model understanding accuracy and poor generation quality

This is as inefficient as trying to consult a colleague on a specific question without first reading him an entire reference manual. ?

? Document segmentation: a key technology to improve AI processing efficiency

What’s the solution? Document chunking !

The core idea of this technology is to decompose large documents into semantically coherent small units. This is not just a simple and crude division by a fixed number of words, but it needs to consider semantic integrity and ensure that each block contains meaningful and complete information.

Next, I will introduce five mainstream block partitioning methods in detail, from basic applications to advanced technologies, covering the needs of different scenarios. The code implementation is based on the LlamaIndex framework.

? ️Technical comparison of five block partitioning methods

1️⃣Fixed -length block method: a basic and practical solution

Technical principle

Split the text into a fixed number of tokens or characters, and keep a certain overlap area to maintain contextual coherence.

Technology Assessment

✅ Simple to implement, suitable for rapid deployment

✅ Fast processing speed and low computing resource requirements

❌ May break semantic integrity (e.g. split a complete concept into different chunks)

❌ Low processing accuracy for complex structured documents

Implementation code example

from llama_index.core.node_parser import SimpleNodeParser parser = SimpleNodeParser.from_defaults( chunk_size=512, # Chinese documents recommend setting it to around 384 tokens chunk_overlap=64 # Overlapping area to ensure contextual coherence) nodes = parser.get_nodes_from_documents(documents)

2️⃣Structure -aware chunking: respect the original structure of the document

Technical principle

Identify the internal structure of the document (such as Markdown heading levels, HTML tags), and divide it into blocks according to the logical organization of the document.

Technology Assessment

✅ Maintain the integrity of the original logical structure of the document

✅ Avoid separating semantically related paragraphs or sections

❌ Only applies to documents with explicit structural markup

❌ Limited effectiveness on unstructured plain text

Implementation code example

Markdown document processing

from llama_index.core.node_parser import MarkdownNodeParserparser = MarkdownNodeParser()nodes = parser.get_nodes_from_documents(markdown_docs)

HTML document processing

from llama_index.core.node_parser import HTMLNodeParser parser = HTMLNodeParser(tags=["p", "h1"]) # Specify the tags to be extracted nodes = parser.get_nodes_from_documents(html_docs)

3️⃣ Sliding window chunking: keeping context coherent

Technical principle

By moving a fixed-size window over the text, a group of adjacent sentences are captured each time to form text snippets with contextual associations.

Technology Assessment

✅ Effectively maintain contextual coherence and reduce information gaps

✅ Particularly suitable for processing continuous data streams

❌ Generates a lot of overlapping content and has low storage efficiency

❌ Window parameter configuration needs to be fine-tuned, affecting processing quality

Implementation code example

import nltkfrom llama_index.core.node_parser import SentenceWindowNodeParser node_parser = SentenceWindowNodeParser.from_defaults( window_size=3, # Number of sentences contained on each side window_metadata_key="window", original_text_metadata_key="original_sentence",)

4️⃣Semantic embedding chunking: segmentation based on semantic similarity

Technical principle

The semantic segmenter uses the embedding model to make semantic similarity judgments, thereby adaptively selecting breakpoints between sentences (splitting when the similarity is below a set threshold), which ensures that the document block contains semantically related sentences.

Technology Assessment

✅ Able to maintain semantic integrity and avoid conceptual fragmentation

✅ Automatically identify topic transition points, such as from "Technical Description" to "Application Case"

❌ The computing resources are consumed a lot, and the performance overhead is obvious when processing large documents

❌ Parameter tuning is complex and requires multiple attempts

Implementation code example

from llama_index.core.node_parser import SemanticSplitterNodeParser from llama_index.embeddings.openai import OpenAIEmbedding embed_model = OpenAIEmbedding()splitter = SemanticSplitterNodeParser( buffer_size=1, breakpoint_percentile_threshold=95, embed_model=embed_model)

5️⃣ LLM dynamic segmentation: high-level intelligent segmentation based on large models

Technical principle

Directly leverage the deep understanding capabilities of large language models to allow the models to autonomously determine the optimal chunking strategy and boundaries.

Technology Assessment

✅ Strongest semantic understanding ability, able to capture complex conceptual associations

✅ Support multi-dimensional and multi-level block strategies

❌ API call costs are high and are not suitable for projects with limited budgets

❌ The processing speed is relatively slow and is not suitable for scenarios with high real-time requirements

Implementation code example

from llama_index.core.llms import OpenAIimport json def llm_chunking(text): llm = OpenAI(model="gpt-4-turbo") prompt = f"""Divide the following technical document into logical units, each unit contains a complete technical concept: {text} Return JSON format: [{{"title":"Unit title","content":"Text content"}}]""" response = llm.complete(prompt) try: return json.loads(response.text) except json.JSONDecodeError: raise ValueError("LLM response format error")

? Comparison of technical indicators of five methods

Block method	Processing speed	Semantic Preservation	Difficulty	Applicable scenarios
Fixed Block	⭐⭐⭐⭐	⭐	⭐	Rapid prototyping system
Sliding Window	⭐⭐⭐	⭐⭐	⭐⭐	Conversation records, interview transcripts
Structure-aware chunking	⭐⭐⭐	⭐⭐⭐	⭐⭐	Documents in specific formats such as Markdown/HTML/JSON
Embed Blocks	⭐⭐	⭐⭐⭐⭐	⭐⭐⭐	Long narrative text
LLM Block	⭐	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐	Various complex documents

? Practical case: Implementation of block strategy of AI test assistant

In the AI test assistant project I developed, I adopted a semantic block strategy and used the embedding model of the Bailian platform. The reason for choosing this solution is that the test requirement documents usually contain complex technical details and logical relationships, and semantic integrity needs to be maintained to ensure the accuracy of test understanding. The implementation code is as follows:

Settings.embed_model = dashscope_embed_model()# Semantic segmentation configuration Settings.node_parser = SemanticSplitterNodeParser( buffer_size=128, # Keep 128 tokens overlapping area breakpoint_percentile_threshold=95, # 95% threshold automatically finds the best segmentation point embed_model = dashscope_embed_model())

Technical reflection : Although semantic chunking performs well in maintaining conceptual integrity, it consumes a lot of computing resources and fails to fully utilize the structural characteristics of the document. Therefore, I am designing an optimization solution for the hybrid chunking strategy:

Use structure-aware chunking for structured documents such as Markdown/HTML
Semantic chunking for unstructured plain text
Use fixed blocks to improve processing efficiency for simple scenarios or resource-constrained situations

This hybrid strategy is expected to significantly improve the overall system performance while maintaining processing quality. The detailed implementation will be introduced in subsequent technical sharing.

Conclusion

Although document segmentation technology seems to be a technical detail in the process of building a RAG system, it is actually a key link in determining system performance and effectiveness. Choosing an appropriate segmentation strategy requires comprehensive consideration based on specific application scenarios, document characteristics, and resource constraints.