How to improve the accuracy of a large model private knowledge base? Slicing is the key

Written by
Audrey Miles
Updated on:July-12th-2025
Recommendation

Use large models to build private knowledge bases and improve information retrieval accuracy.
Core content:
1. How RAG technology enhances answering capabilities through retrieval
2. Key steps for document segmentation and vectorized storage
3. Challenges and solutions in long text processing

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)

Preface

RAG is like an open-book exam. Before answering the questions, we first look through the materials we prepared (enhance our abilities by searching for relevant documents). This way, the answers will not only be more accurate, but also include more contextual information, making them more in-depth and targeted.

In RAG, we need to prepare knowledge materials in advance, then vectorize them and store them in a vector database. When answering questions, we vectorize the questions and then search our database to see if there is similar content. If so, we recall the content and give it to a large model (such as Deepseek). Then, we give the corresponding answers through the reasoning and induction of the large model.

Similar to the following steps:

Step 1: Prepare data   First, we need to “cut” the prepared knowledge documents into small document blocks and store them in the database. If these small document blocks also generate corresponding “embeddings” (which can be understood as a mathematical representation), then the database can be upgraded to vector storage to facilitate subsequent fast search.

Step 2: Retrieval   When asking a question, the system first vectorizes the question and searches the database for the most relevant small document blocks through vector search, full-text search, or multi-modal methods. After locating the information fragments that best match the user's question, they are fed to the big model in a contextual manner. This way, the big model can not only find the answer faster, but also ensure that the answer is more accurate and targeted, while also reducing illusions.

Challenges 

For enterprises or individuals, there are many policy articles or documents, which are usually very long. If these long articles are given to a large model, the existing large model's computing power cannot process them all at once, so the text needs to be cut into pieces. However, the bigger the piece, the better. When vectorizing, long texts face the following core challenges:

1. Semantic Information Dilution

  • 1. Multi-topic coverage : Long texts often contain multiple topics or viewpoints. The overall vector is difficult to accurately capture the semantic details, resulting in the core content being diluted.
  • 2. Loss of contextual continuity : Directly averaging word vectors or using [CLS] tags cannot effectively handle the coherent semantic relationship of long texts, which may break the contextual logic.

2. Computational Complexity and Resource Consumption

  • 1. Attention mechanism bottleneck : The self-attention layer of the Transformer architecture has O(n²) complexity. When processing long texts of more than a thousand words, the computing resources grow exponentially.
  • 2. Memory pressure : The generated high-dimensional vector requires a lot of storage space, which may reduce the matching efficiency during retrieval.

3. Limitations of Text Blocking Technology

  • 1. Optimal block size selection : Too large blocks lead to semantic confusion, too small blocks destroy context coherence, 50% overlapping blocks can alleviate this problem but increase storage redundancy.
  • 2. Special content processing : The segmentation of non-continuous text such as tables and code snippets may destroy the structured information and affect the subsequent QA effect.

4. Model Architecture Constraints

  • 1. Input length limit : The maximum input length of most pre-trained models is between 512-2048 tokens. The excess length needs to be truncated or segmented.
  • 2. Position encoding distortion : Block processing will reset the position encoding, affecting the model's capture of long-range dependencies.

5. Balance between quality and efficiency

  • 1. Dimension optimization dilemma : high-dimensional vectors improve accuracy but increase computational cost, while low-dimensional vectors simplify computation but lose semantic information.
  • 2. Noise sensitivity : The impact of noise such as spelling errors and colloquial expressions on vector quality in long texts is amplified.

Slicing Methodology 

Above we can see that "chunking" plays a key role and faces challenges in RAG. It directly determines the accuracy of our recalled knowledge, so it is particularly important to choose an appropriate chunking method; the key to effectively splitting files is to balance the integrity of information and the convenience of management. You can adopt a variety of strategies such as fixed-size chunking, semantic chunking, recursive chunking, chunking based on document structure, and chunking based on LLM .

  • 1. Fixed-size chunking : Split the text into chunks based on a predefined number of characters, words, or tokens while retaining a certain amount of overlap. This method is simple to implement, but may truncate sentences, causing information to be scattered across different chunks. 1
  • 2. Semantic Chunking : Segment the document based on meaningful units, and continue to add units to existing chunks until the cosine similarity drops significantly. This approach maintains the natural flow of language.
  • 3. ‌Recursive Chunking‌ : Chunking is done based on intrinsic delimiters such as paragraphs or chapters. If a chunk exceeds the limit, it is further split into smaller chunks. This approach also maintains the natural flow of the language.
  • 4. Chunking based on document structure : Chunking is done by using the intrinsic structure of the document (such as titles, chapters, or paragraphs). This method can maintain the natural structure of the document, but the premise is that the document has a clear structure.
  • LLM-based chunking : Use hint engineering to guide LLM to generate meaningful chunking. This approach incorporates the intelligence of large models but may require more computational resources and time.

The summary table is as follows:

Chunking strategy
advantage
shortcoming
Typical application scenarios
1. Fixed size chunking
  • ✅ Simple and fast implementation
  • ✅ Extremely low computational cost
  • ✅ Stable memory usage
  • ❌ May sever semantic associations
  • ❌ Need to debug the block size repeatedly
  • ❌ Not friendly to long texts
  • Basic question answering system
  • Processing unified format documents
2. Semantic Chunking
  • ✅ Preserve complete semantic units
  • ✅ Improve contextual relevance
  • ✅ Dynamically adapt to content
  • ❌ Dependence on NLP model quality
  • ❌ High computing resource consumption
  • ❌ Slow processing speed
  • Professional field analysis
  • Logical reasoning tasks
3. Recursive Blocking
  • ✅ Multi-granularity content coverage
  • ✅ Strong ability to capture redundant information
  • ✅ Flexible adjustment of layer depth
  • ❌ High implementation complexity
  • ❌ Information duplication may occur
  • ❌ Requires multi-layer index management
  • Academic Literature Processing
  • Legal contract analysis
4. Document structure division
  • ✅ Accurately locate chapter information
  • ✅ Maintain the original logical structure
  • ✅ Support cross-blockquote
  • ❌ Depends on document format specifications
  • ❌ Difficulty in processing unstructured data
  • ❌ Pre-parsing rules required
  • Technical manual processing
  • Paper analysis system
5. LLM-based Blocking
  • ✅ Understand deep semantic intent
  • ✅ Dynamically generate optimal block structure
  • ✅ Suitable for complex tasks
  • ❌ Significant inference latency
  • ❌ High API call cost
  • ❌ There is a risk of model hallucination
  • High-precision question answering system
  • Cross-text

Slicing methods supported by RAGFlow 

It is a relatively good way to choose a suitable block strategy based on business scenarios and text characteristics. RAGFlow supports multiple block methods. The following table shows the descriptions of different block methods and their supported document formats.

Chunking method:

Template
Description
File format
General
It supports many file formats, and you need to set the corresponding segmentation method yourself. It is difficult to control and needs to be combined with a natural language processing model to achieve good results.
DOCX, EXCEL, PPT, IMAGE, PDF, TXT, MD, JSON, EML, HTML, JPEG, JPG, PNG, TIF, GIF
Resume

DOCX, PDF, TXT
Q&A
Problem description and answer, more suitable for customer service questions and answers
EXCEL, CSV/TXT
Manual
Will use OCR to segment documents
PDF
Table (table format file

EXCEL, CSV/TXT
Paper

PDF
Book (Book Type)

DOCX, PDF, TXT
Laws

DOCX, PDF, TXT
Presentation

PDF, PPTX
Picture

JPEG, JPG, PNG, TIF, GIF
One (full file)
The file will not be cut, and one file will be given to the large model directly, which has a good context, but the file length depends on the length supported by the configured large model.
DOCX, EXCEL, PDF, TXT
Tag
You need to set the tag description and tag in advance, similar to Q&A. In the above analysis, set the tag library, which will automatically match the hit block and add tags.
EXCEL, CSV/TXT

The recall method is also critical

Although we have mastered how to segment documents, recalling segmented data is also a critical step. How to improve the recall accuracy of data is also a problem that needs to be solved urgently.

There are many factors that affect the accuracy of data recall. It is difficult or even impossible to solve this problem from a single aspect. Therefore, improving data accuracy requires approaching multiple aspects.

The following are some of the current mainstream methods:

1. Hybrid retrieval method‌  performs vector retrieval (semantic matching) and full-text retrieval (keyword matching) simultaneously, and merges the results through linear weighting or reverse rank fusion (RRF). Introduce a re-ranking model (such as BGE-Reranker) to re-rank the multi-way recall results and give priority to retaining highly relevant fragments‌.

2. Multi-way recall strategy:  Use multi-model parallel retrieval (such as BM25, DPR, and Embedding models) to cover matching requirements of different granularities. For complex queries, break the question into multiple sub-queries, search them separately, and merge the results.

3. Dynamic parameter adjustment‌  Set similarity threshold (such as 0.5-0.7) to filter low-correlation segments; dynamically adjust block size and recall quantity based on business feedback‌

The above method has been used in RAGFlow to enhance data recall. When using it, we can adjust and verify the relevant parameters to achieve better results.