Throw cold water: CherryStudio + local knowledge base is not as simple as you think

In-depth analysis of the construction and misunderstandings of the CherryStudio knowledge base to help you use AI assistants efficiently.
Core content:
1. Common misunderstandings in the construction of the CherryStudio knowledge base
2. The basic principles and workflow of the knowledge base
3. The key steps of raw data processing and user problem analysis
Knowledge base, not that simple
Recently, many friends have started using CherryStudio, an all-purpose AI assistant, after watching my (pretend to be) series of tutorials introducing CherryStudio. If you haven't read it yet, you can check out the previous content of this account, I believe you will definitely gain something.
I believe that many friends use CherryStudio not only for AI dialogue, but also to build their own knowledge base so that AI can generate more targeted answers based on the information they provide.
This is a very natural and beautiful idea. However, many people found that the effect was not what they imagined after they tried it.
This is the question this article will explore.
If you are struggling with this problem, or if you are planning to build your own knowledge base, the following content will definitely help you.
Note: Although this article uses CherryStudio as an example, this knowledge is not limited to CherryStudio and is probably applicable to other similar tools.
Correcting a cognitive misunderstanding
Many people imagine "AI + knowledge base" as throwing all the collected information into CherryStudio's knowledge base. When you ask a question, the AI will carefully read the content, collect relevant content from it, and then conduct comprehensive analysis and thinking to output a very perfect answer.
No, that’s not the case at all!
Some people think that if they throw a bunch of data tables into the knowledge base, AI will perform professional statistical analysis, and when they ask about a certain data, AI will answer fluently and accurately.
No, that’s not the case!
Please remember one thing here: all the original data you put into the knowledge base cannot be accessed by AI (not limited to DeepSeek) at all!
AI only has access to a very small number of data fragments that may be relevant to your question.
Why is this happening?
Basic principles of knowledge base
If you want to make good use of the knowledge base, you must understand the basic principles and workflow of the knowledge base, and you must understand the following picture .
Although this flowchart looks a bit complicated, the logical relationship is very clear. Below I will try to explain it to you in a simple and easy-to-understand way.
This picture is divided into three parts from top to bottom by dotted lines:
Processing of raw data
In the first line, when users add various raw materials into the knowledge base, a program will first perform preprocessing, extract useful text content, remove useless interference information, and then split them into countless text chunks.
You can compare it to breaking down an entire book into paragraphs (or even sentences) of content.
When these text blocks are added to the (embedded) vector database, they are vectorized by the embedding model. That is, the original text fragment is converted into a super long digital sequence through the algorithm , like this:
[-0.023 0.145 -0.067 0.098 0.032 0.124 -0.012 ...]
If the embedding model is 1024-dimensional, then each segment will be converted into a vector containing 1024 values.
Then, what is finally stored in the vector database is not only countless such vector values, but also the contents of the text blocks corresponding to them .
In this step, please think carefully: What kind of data is suitable for splitting? What kind of data is not suitable?
User problem handling process
As can be seen in the second line of the flowchart, the question raised by the user does not go directly to the big model, but must first be embedded for vectorization processing and become a vector containing 1024 numerical values.
Then, take it to the vector database for similarity matching.
Please note: what is matched here is not the text content, but a one-to-one match of vectors composed entirely of numbers through an algorithm .
Through a large number of fast vector matches, a few ( very few ) vectors with relatively high matching degrees are finally selected from the vector library . The knowledge base then retrieves the original texts corresponding to these vectors, which are the text fragments most likely to be related to the user's question.
The process of generating reply content
The third line of the flowchart is where the big model actually begins to answer questions.
The original text of the fragment retrieved from the vector database, together with the original text of the user's question, is merged together and submitted to the large model (DeepSeek). It will combine this information with its own trained data to conduct comprehensive analysis and reasoning, and finally generate a reply content for the user.
Please pay attention to two questions in this step: How far is the distance between the original data and the large model in the figure? How much of the data in the knowledge base can the large model finally obtain?
Mystery solved
I believe that the knowledge base workflow described above is not particularly difficult to understand.
If you understand it, many of your confusions should no longer exist.
So, in the future, don’t try to let the big model tell you that there are several documents about xxx in your knowledge base, it can’t see it at all!
Also, don’t try to cram a bunch of data tables into the knowledge base and ask the big model to give you the total value of a certain item. It can’t see all the data at all!
This is not how big models are used, and this is not how knowledge bases are used. Of course, this does not mean that local knowledge bases are useless. To use them well, certain methods and skills are required.
As for how to use it, due to space limitations, I will introduce it in detail later.
Look at the CherryStudio Knowledge Base
When you have some basic understanding of the knowledge base, open the CherryStudio knowledge base and take a look, you will gain some new insights.
Embedding Model
When creating a new knowledge base, the first thing to choose is the embedding model.
Now you should understand that embedding models and large language models do completely different things, so there will be no DeepSeek for you to choose here.
The currently more useful Chinese embedding model is the bge series. You can also try to process the same content through different embedding models and compare the effects.
Model Information
At the bottom of the Knowledge Base page, you will also see model information.
The number of dimensions of the embedding model indicates how many numbers each fragment will be converted into. Although it is difficult for humans to understand at first glance, it is very suitable for computers to use algorithms for efficient calculations.
search
If you enter a keyword and search in the knowledge base, you can see the content it returns, which is a series of segmented fragments.
You may notice that each of the segments is about the same length. If you count the words in each segment, you'll find that they're all the same!
Yes, this is the fixed length of each segment when the original material is split.
If you look again, you will easily find that many sentences at the beginning and end have been roughly broken up.
The percentage in the upper right corner of each fragment indicates the matching degree calculated by the algorithm.
You can take a closer look yourself. Is the content returned by the search really related to what you want to search for? If not, you will understand why the AI's response is still very bad even though the knowledge base has been added. Because the information it gets is a bunch of garbage!