Woter AI detection.Hurry - ends Jun 29th

New Year Sales :up to 80% OFF

AI Humanize AI Translator Bypass AI AI Rewriter AI Detector

PRICING

TRY FOR FREE

The knowledge base is too messy to find information? 5 tools to improve your RAG search!

Written by

Caleb Hayes

Updated on:June-13th-2025

This article will first briefly review the three tools of metadata, tags, and knowledge catalogs . Then, in response to questions from readers, we will focus on the relationship between knowledge catalogs, file catalogs, and knowledge maps , as well as when and how to build them, and how to improve the effectiveness of RAG.

Very dry, but very useful.

Tool 1: Metadata ( system unified annotation standards )

Metadata is data about data and is mainly used by administrators to describe the objective attributes of files or set access permissions.

It is also possible to open some special types for users to accurately specify the scope of questions and answers, such as file name, applicable objects, etc.

Metadata needs to have unified standards. For example, the type of file name is uniformly called file name, the type of file author is uniformly called author, the type of file publication time is uniformly called publication time , etc.

Tool 2: Label ( users can label as they wish )

Tags are a special type of metadata, which can be understood as metadata whose type is "tag" and whose value can have multiple values.

Both administrators and users can use it, and the annotation method is not restricted, and you can do whatever you want. Therefore, it will be more numerous and messy.

Tool 3: Knowledge Catalog (organizational knowledge attribution)

The knowledge directory is a folder of knowledge.

For how to create a knowledge directory, please refer to Tencent IMA Knowledge Base Workbench.

When uploading documents, users can upload them directly in the root directory of their personal knowledge base, or create a folder in the root directory and then upload them in the folder.

Over time, the structure of the knowledge catalog will emerge.

Tool 4: File Directory ( Organize the ownership of files )

Sometimes, the people who collect documents and organize knowledge may not be the same group of people.

In this case, a separate file library can be designed to support the creation of file directories (different from knowledge directories).

Documents in the knowledge base are selectively added from the document library.

Why do you do this?

1. The directory structure created from the "file perspective" and the " knowledge perspective " is likely to be different:

File directory: Usually divided into hierarchical structures according to "management needs" such as department, time, author, document type, etc.

Knowledge catalog: Usually divided into hierarchical structures according to "cognitive needs" such as concepts, topics, business, knowledge, etc.

For example, there is an "Annual Sales Report" document:

The file directory tree may be stored as:/department/sales department/2024/annual sales report.docx

Under the knowledge directory tree, it may be stored as:xx Company Knowledge Base/Business Management/Sales/Annual Report

Although they are all trees, the starting point and meaning of their organization are different.

2. Knowledge base files are added uniformly from the file library, and the knowledge ingestion status of the files can be tracked

For example, if your boss asks you: "Have the 1,000 documents collected last time been entered into the knowledge base? How many have been entered, and which documents have not been entered?", if you don't have the knowledge intake status of the documents, this question will instantly confuse you.

Regarding the file directory, it is currently only used to manage files and has not been used in RAG. If I think of anything else in the future, I will write about it.

Tool 5: Knowledge Map ( Upgrade from Tree to Network )

In addition to the tree-like "knowledge catalog" , there is also a graphical "knowledge map" , also called a "label system" .

The tag system is a network system formed by establishing connections between tags.

So, what is the difference between a knowledge map and a knowledge directory?

Knowledge Catalog (tree relationship)

The "belonging relationship" between organizational knowledge emphasizes superiors and subordinates.

for example:

Primary School Mathematics

├─ Numbers and Algebra

│ ├─ Natural numbers

│ ├─ Score

│ └─ Decimal

└─ Graphics and Geometry

├─ Straight Line and Angle

└─ Area and Perimeter

"Natural Numbers" belongs to "Numbers and Algebra" and "Area and Perimeter" belongs to "Shapes and Geometry".

Each knowledge point has a unique "parent topic", like a branch of a tree.

Knowledge map (network relationship)

The "correlation relationship" between organizational knowledge does not distinguish between primary and secondary, and does not necessarily have superiors and subordinates.

For example, the knowledge points of "fractions", "decimals", and "area and perimeter" are related as follows:

"Fractions" and "decimals" can be converted to each other

" Area" and "perimeter" can be expressed as decimals or fractions

css[fraction] —— [decimal]

\ /

[Area and Perimeter]

In summary:

Knowledge directory (tree) : who is the superior, who is the subordinate, who belongs to whom, with clear hierarchy, forming a tree

Knowledge map (network) : who is related to whom, who cooperates with whom, who interacts with whom, forming a network

The above are the concepts of metadata, tags, file directories, knowledge directories, and knowledge maps.

Let’s talk about knowledge catalogs and knowledge maps, and how to improve RAG effects.

Knowledge Catalog: Feeding All to the Big Model

Since the directory is inductive, the volume is not too large. The entire directory structure can be fed to the large model to analyze the most suitable nodes. RAG only searches in the files mounted on these node paths.

Knowledge directories (folders) are created manually by users and can be created at any time. Just move files to the directory.

Users can also select a knowledge directory so that the question and answer session will only be conducted within the scope of the documents under that directory.

Knowledge map: suitable for graph retrieval

The knowledge map is a labeling system for knowledge.

There are no standard constraints on labels, and you can label them however you want, so there may be many labels, and the relationship between labels is organized using a graph data structure.

Therefore, the entire label network may be very large and is not suitable for feeding all of it into a large model like a knowledge directory. Instead, it is more suitable for retrieval using a knowledge graph.

Specifically, there are two ways to form a labeling system:

1. Users who upload files can freely mark one or more keywords (tags) on the files.

Then, the system background will build relationships between tags on the same file (through a large model), and all tags on each file can form a subgraph and be stored in the graph knowledge base.

2. Dedicated knowledge management personnel can preset a knowledge graph model layer in the system background.

Then, when the file is uploaded, the relevant entities and relationships in the file content are automatically extracted based on the information in the schema layer to form a subgraph and store it in the graph knowledge base.

A general idea of retrieval based on the graph tag system (simplifies the retrieval logic of other links),

for example:

Identify one or more tag words from the user question.
The most matching one or more tag words existing in the system are retrieved through semantics.
Retrieve the n-order associated labels of these system label words through the graph (n can be set by yourself, the larger the query, the slower the query and the more noise there is).
The matched system tags and their associated n-th-order tags are used as candidate tag result sets.
Retrieve the knowledge content most relevant to the user's question within the scope of documents annotated with candidate tags.

Compared with using a knowledge catalog to define the search scope, using a tag system to define the scope will take into account the connections between knowledge and may retrieve more unexpected but reasonable knowledge.

For example:

In a medical institution's knowledge management system, it is difficult to manually accurately label all appropriate tags for each document or knowledge fragment in the face of massive amounts of medical literature and medical records.

Therefore, the system first constructed a model layer of the medical knowledge graph: this layer clearly defined core entity types such as "disease", "symptoms", "drugs", "examination items", and typical relationships between them (such as "treatment", "complications", "may manifest as", etc.).

Under this pattern constraint, entity extraction technology is used to automatically identify entities such as "hypertension", "headache", "aspirin", and "blood routine test" from the text. At the same time, these entities are organically connected together based on the pattern layer relationship to form a business-related label network.

For example, a case description is automatically extracted and tagged with "hypertension", "heart disease", "aspirin (medication)", "dizziness (symptom)", etc., and a "may manifest as" relationship may be established between "hypertension" and "dizziness". This not only facilitates search and association, but also makes the label system truly compatible with medical services.

Seeing this, you may have a question: since the tag system can establish relationships between tags, what is the purpose of the directory?

There are a few considerations:

1. The knowledge catalog emphasizes the attribution of knowledge , so its volume is not very large. It can all be fed into a large model to identify the catalog nodes most relevant to user questions. This method is much more accurate than the multi-step processing of label system retrieval .

2. The knowledge directory is to create folders, which users will be able to operate .

3. The tag system must consider both tags and the complex relationships between tags, which increases the cognitive complexity of users. It is not suitable to be built entirely by humans, so it will be implemented with the help of entity extraction + knowledge graph. And it usually requires the establishment of a model layer.

4. Under the constraints of the model layer, the extracted knowledge tags are more in line with the needs of the business field. Therefore, compared with the knowledge catalog, the manual intervention in the construction of the tag system will be relatively less, and the quality of the construction will be inferior to that of the knowledge catalog.

To summarize:

Metadata gives files attributes

Tags give files their characteristics

The file directory organizes the management level of files
The knowledge catalog organizes the subject hierarchy of the documents
The knowledge map sorts out the feature associations of files

The key is to choose and combine them according to actual needs.

If the right tools are used, the retrieval accuracy of RAG can really be improved qualitatively.

Answer a few specific questions

ask:

" How to implement it specifically? For example, is the directory tree initialized and built in advance, or is it built dynamically when preprocessing the document? When preprocessing the document, how should we determine which directory tree the file should be placed in? "

answer:

A knowledge directory is a folder that organizes files (such as documents) or data (such as data tables) from a knowledge perspective and is mainly constructed by users themselves.

For example, Tencent ima allows users to create multi-level folders in their personal knowledge base to classify the files they upload, and gradually form a knowledge directory structure over time.

ask:

“ Is the knowledge catalog constructed at the same time as the metadata? If it is hierarchical, how is this hierarchy reflected? Is the hierarchical relationship reflected during retrieval? Does the knowledge catalog exist in the same form as tags? ”

answer:

Tencent ima's knowledge base directory is a good example. This directory is created by the user himself and can be created at any time, and then the corresponding documents can be moved to the corresponding folders.

The hierarchy of the knowledge directory reflects the subject to which the files belong. This depends on how the user creates a multi-level directory from the perspective of knowledge subject attribution.

Knowledge catalogs and tags are indeed different.

Tags are flat, and a document can have multiple tags; while the knowledge directory is tree-like, and a document can only be in one directory position, with a clear hierarchical relationship. When searching, the hierarchical relationship of the knowledge directory will indeed be used, for example, you can search under a specific directory branch (large model recognition or user manual selection) to improve relevance.

ask:

“ This is very complicated. There are many ways, such as tags, metadata, keywords, knowledge graphs, etc. However, there is still no guarantee of high accuracy. ”

answer:

This is indeed the case. Because of this, we can think about RAG in the same way as we think about driverless cars, that is, we can divide the problems to be solved into several levels, such as giving priority to effective responses to simple factual questions. This part can also bring some work improvements to users (such as Tencent's ima knowledge base). Then, through the continuous improvement of knowledge governance tools, methodologies, and RAG's own retrieval strategy, the level of problems that RAG can solve can be gradually improved. Especially in some professional fields, the requirements for RAG are very high, and more exploration is needed in governance. With more dimensions of governance, the retrieval methods available to RAG will also increase. Furthermore, we can consider adding technologies such as Agent and MCP, which will be written later.