The knowledge base is too messy to find information? 5 tools to improve your RAG search!

Are you still worried about the messy knowledge base? 5 practical tools can help you manage the RAG search system efficiently and let the big model understand you better!
Core content:
1. Practical application analysis of the three major tools of metadata, tags, and knowledge catalog
2. Differentiated design ideas for file catalogs and knowledge catalogs
3. How knowledge maps break through the traditional tree structure to achieve intelligent retrieval
This article will first briefly review the three tools of metadata, tags, and knowledge catalogs . Then, in response to questions from readers, we will focus on the relationship between knowledge catalogs, file catalogs, and knowledge maps , as well as when and how to build them, and how to improve the effectiveness of RAG.
Very dry, but very useful.
Tool 1: Metadata ( system unified annotation standards )
Metadata is data about data and is mainly used by administrators to describe the objective attributes of files or set access permissions.
It is also possible to open some special types for users to accurately specify the scope of questions and answers, such as file name, applicable objects, etc.
Metadata needs to have unified standards. For example, the type of file name is uniformly called file name, the type of file author is uniformly called author, the type of file publication time is uniformly called publication time , etc.
Tool 2: Label ( users can label as they wish )
Tags are a special type of metadata, which can be understood as metadata whose type is "tag" and whose value can have multiple values.
Both administrators and users can use it, and the annotation method is not restricted, and you can do whatever you want. Therefore, it will be more numerous and messy.
Tool 3: Knowledge Catalog (organizational knowledge attribution)
The knowledge directory is a folder of knowledge.
For how to create a knowledge directory, please refer to Tencent IMA Knowledge Base Workbench.
When uploading documents, users can upload them directly in the root directory of their personal knowledge base, or create a folder in the root directory and then upload them in the folder.
Over time, the structure of the knowledge catalog will emerge.
Tool 4: File Directory ( Organize the ownership of files )
Sometimes, the people who collect documents and organize knowledge may not be the same group of people.
In this case, a separate file library can be designed to support the creation of file directories (different from knowledge directories).
Documents in the knowledge base are selectively added from the document library.
Why do you do this?
File directory: Usually divided into hierarchical structures according to "management needs" such as department, time, author, document type, etc.
Knowledge catalog: Usually divided into hierarchical structures according to "cognitive needs" such as concepts, topics, business, knowledge, etc.
For example, there is an "Annual Sales Report" document:
- The file directory tree may be stored as:
/department/sales department/2024/annual sales report.docx
- Under the knowledge directory tree, it may be stored as:
xx Company Knowledge Base/Business Management/Sales/Annual Report
Although they are all trees, the starting point and meaning of their organization are different.
Regarding the file directory, it is currently only used to manage files and has not been used in RAG. If I think of anything else in the future, I will write about it.
Tool 5: Knowledge Map ( Upgrade from Tree to Network )
In addition to the tree-like "knowledge catalog" , there is also a graphical "knowledge map" , also called a "label system" .
The tag system is a network system formed by establishing connections between tags.
So, what is the difference between a knowledge map and a knowledge directory?
Knowledge Catalog (tree relationship)
The "belonging relationship" between organizational knowledge emphasizes superiors and subordinates.
for example:
Primary School Mathematics
├─ Numbers and Algebra
│ ├─ Natural numbers
│ ├─ Score
│ └─ Decimal
├─ Straight Line and Angle
└─ Area and Perimeter
"Natural Numbers" belongs to "Numbers and Algebra" and "Area and Perimeter" belongs to "Shapes and Geometry".
Each knowledge point has a unique "parent topic", like a branch of a tree.
Knowledge map (network relationship)
The "correlation relationship" between organizational knowledge does not distinguish between primary and secondary, and does not necessarily have superiors and subordinates.
"Fractions" and "decimals" can be converted to each other
" Area" and "perimeter" can be expressed as decimals or fractions
css[fraction] —— [decimal]
\ /
[Area and Perimeter]
In summary:
- Knowledge directory (tree) : who is the superior, who is the subordinate, who belongs to whom, with clear hierarchy, forming a tree
- Knowledge map (network) : who is related to whom, who cooperates with whom, who interacts with whom, forming a network
The above are the concepts of metadata, tags, file directories, knowledge directories, and knowledge maps.
Let’s talk about knowledge catalogs and knowledge maps, and how to improve RAG effects.
Knowledge Catalog: Feeding All to the Big Model
Since the directory is inductive, the volume is not too large. The entire directory structure can be fed to the large model to analyze the most suitable nodes. RAG only searches in the files mounted on these node paths.
Knowledge directories (folders) are created manually by users and can be created at any time. Just move files to the directory.
Users can also select a knowledge directory so that the question and answer session will only be conducted within the scope of the documents under that directory.
Knowledge map: suitable for graph retrieval
The knowledge map is a labeling system for knowledge.
There are no standard constraints on labels, and you can label them however you want, so there may be many labels, and the relationship between labels is organized using a graph data structure.
Therefore, the entire label network may be very large and is not suitable for feeding all of it into a large model like a knowledge directory. Instead, it is more suitable for retrieval using a knowledge graph.
Specifically, there are two ways to form a labeling system:
A general idea of retrieval based on the graph tag system (simplifies the retrieval logic of other links),
for example:
Identify one or more tag words from the user question. The most matching one or more tag words existing in the system are retrieved through semantics. Retrieve the n-order associated labels of these system label words through the graph (n can be set by yourself, the larger the query, the slower the query and the more noise there is). The matched system tags and their associated n-th-order tags are used as candidate tag result sets. Retrieve the knowledge content most relevant to the user's question within the scope of documents annotated with candidate tags.
Compared with using a knowledge catalog to define the search scope, using a tag system to define the scope will take into account the connections between knowledge and may retrieve more unexpected but reasonable knowledge.
For example:
In a medical institution's knowledge management system, it is difficult to manually accurately label all appropriate tags for each document or knowledge fragment in the face of massive amounts of medical literature and medical records.
Therefore, the system first constructed a model layer of the medical knowledge graph: this layer clearly defined core entity types such as "disease", "symptoms", "drugs", "examination items", and typical relationships between them (such as "treatment", "complications", "may manifest as", etc.).
Under this pattern constraint, entity extraction technology is used to automatically identify entities such as "hypertension", "headache", "aspirin", and "blood routine test" from the text. At the same time, these entities are organically connected together based on the pattern layer relationship to form a business-related label network.
For example, a case description is automatically extracted and tagged with "hypertension", "heart disease", "aspirin (medication)", "dizziness (symptom)", etc., and a "may manifest as" relationship may be established between "hypertension" and "dizziness". This not only facilitates search and association, but also makes the label system truly compatible with medical services.
Seeing this, you may have a question: since the tag system can establish relationships between tags, what is the purpose of the directory?
There are a few considerations:
To summarize:
- Metadata gives files attributes
- Tags give files their characteristics
- The file directory organizes the management level of files
- The knowledge catalog organizes the subject hierarchy of the documents
- The knowledge map sorts out the feature associations of files
The key is to choose and combine them according to actual needs.
If the right tools are used, the retrieval accuracy of RAG can really be improved qualitatively.
Answer a few specific questions
ask:
" How to implement it specifically? For example, is the directory tree initialized and built in advance, or is it built dynamically when preprocessing the document? When preprocessing the document, how should we determine which directory tree the file should be placed in? "
answer:
A knowledge directory is a folder that organizes files (such as documents) or data (such as data tables) from a knowledge perspective and is mainly constructed by users themselves.
For example, Tencent ima allows users to create multi-level folders in their personal knowledge base to classify the files they upload, and gradually form a knowledge directory structure over time.
ask:
“ Is the knowledge catalog constructed at the same time as the metadata? If it is hierarchical, how is this hierarchy reflected? Is the hierarchical relationship reflected during retrieval? Does the knowledge catalog exist in the same form as tags? ”
answer:
Tencent ima's knowledge base directory is a good example. This directory is created by the user himself and can be created at any time, and then the corresponding documents can be moved to the corresponding folders.
The hierarchy of the knowledge directory reflects the subject to which the files belong. This depends on how the user creates a multi-level directory from the perspective of knowledge subject attribution.
Knowledge catalogs and tags are indeed different.
Tags are flat, and a document can have multiple tags; while the knowledge directory is tree-like, and a document can only be in one directory position, with a clear hierarchical relationship. When searching, the hierarchical relationship of the knowledge directory will indeed be used, for example, you can search under a specific directory branch (large model recognition or user manual selection) to improve relevance.
ask:
“ This is very complicated. There are many ways, such as tags, metadata, keywords, knowledge graphs, etc. However, there is still no guarantee of high accuracy. ”
answer:
This is indeed the case. Because of this, we can think about RAG in the same way as we think about driverless cars, that is, we can divide the problems to be solved into several levels, such as giving priority to effective responses to simple factual questions. This part can also bring some work improvements to users (such as Tencent's ima knowledge base). Then, through the continuous improvement of knowledge governance tools, methodologies, and RAG's own retrieval strategy, the level of problems that RAG can solve can be gradually improved. Especially in some professional fields, the requirements for RAG are very high, and more exploration is needed in governance. With more dimensions of governance, the retrieval methods available to RAG will also increase. Furthermore, we can consider adding technologies such as Agent and MCP, which will be written later.