Woter AI detection.Hurry - ends Jun 29th

New Year Sales :up to 80% OFF

AI Humanize AI Translator Bypass AI AI Rewriter AI Detector

PRICING

TRY FOR FREE

RAG is becoming less and less accurate? From Dify and ima knowledge base, how metadata and tags can make big models understand you better

Written by

Silas Grey

Updated on:June-18th-2025

Have you ever had this experience: "There are more and more knowledge base documents, but the knowledge base questions and answers are becoming less and less reliable, and RAG retrieves a bunch of irrelevant content."

In this era of information explosion, we do not lack information, but what we lack is the ability to find the "right information" .

Metadata and tags may seem ordinary, but they can greatly enhance RAG capabilities.

This article discusses how they can assist the RAG system in truly understanding user intent and accurately finding the required information.

Metadata can be simply understood as "data that describes data".

Imagine you are holding a book in your hand. The title, author, publication date, number of pages - these are all metadata. Although they are not the main content of the book, they can help you quickly understand the basic information of the book and decide whether it is worth reading.

According to Understanding Metadata: What is Metadata, and What is it For? (2017), metadata can be broken down into the following categories:

Metadata Type	definition	Popular explanation	Example	Main Application
Descriptive metadata	Describes the content of the resource and helps find or understand information about the resource	Tags that tell you "what this is"	Title, author, subject, genre, publication date	Resource Discovery
Administrative metadata	Information required to manage resources or related to resource creation	Instructions on how to manage it	There are three sub-types:
- Technical	Technical information required to decode and render digital files	Instructions that tell the computer "how to open and display"	File type, file size, creation date/time, compression scheme	Resource Management
- Maintenance	Information that supports the long-term management and future migration of digital files	Document "Health Record" and "Maintenance Manual"	Checksum, integrity verification, event record preservation	Resource maintenance
- Permission Type	Intellectual property information attached to the content	"Copyright Statement" and "Usage Agreement" of the resource	Copyright status, licensing terms, rights holders	Permission Control
Structural metadata	Information describing the relationships between the parts of a resource	Contents of "Table of Contents" and "Assembly Instructions"	Document directory, table structure, video subtitle file	Content Relevance
Markup language metadata	Integrate metadata and tag structural or semantic features in content	"Smart Tags" and "Formatting Instructions" in Text	Paragraph marker, title marker, list marker, name marker, date marker	Content analysis

After reading this table, you may wonder, there are so many types of metadata...

Compared with metadata, we may be relatively familiar with the concept of tags.

Whenever we watch Douyin, Bilibili, or browse Zhihu, the keywords with # are tags .

Baidu Encyclopedia defines tags as follows:

Tags are a way of organizing Internet content. They are highly relevant keywords that help people easily describe and classify content for easy retrieval and sharing. Tags transfer the right to organize content from website administrators to users, fully reflecting the bottom-up and user-participated characteristics of web2.0.

Simply put, tags are "sticky notes" affixed by users themselves to help content be better found and classified.

Tags are essentially a type of "descriptive metadata". However, unlike other metadata, tags are more free and open:

Metadata usually has a strict structure and specifications, while tags do not need to follow a predefined structure.

Metadata is usually added by the system or professionals, while tags can be created freely by ordinary users.

Metadata tends to be objective descriptions, while tags can contain subjective judgments and personal understandings.

It's like official archives and personal notes, both have value, but in different application scenarios .

Dify, a large model application development platform for developers, added metadata support about two months ago .

In Dify, metadata is divided into two categories:

Built-in metadata (automatically extracted and cannot be deleted or modified):

File name, file type, uploader, upload time, update time, file source, file size, word count, etc.

Custom metadata (user added, editable):

Content summary, document type (contract, report, manual, etc.), applicable industry, applicable region , applicable period, applicable entity, etc.

Dify allows you to uniformly configure metadata types at the knowledge base level, and then set the corresponding metadata values in all documents under the knowledge base. This design allows metadata to be uniformly managed within the knowledge base.

For example, developers can manually set metadata filters to ensure that users' questions are constrained within the specified knowledge range, thereby improving the security and relevance of the retrieval.

In addition, Dify also supports allowing the big model to automatically identify metadata information that may be contained in user questions. You only need to change the manual model to automatic mode and then select a big model. (However, the automatic mode does not seem to be able to see the logs of the actual extracted metadata, so it is not easy to know whether it is effective).

Regarding dify's metadata, here are 3 more points:

If multiple knowledge bases are added to the knowledge retrieval node, the metadata selection function will be unavailable.
Dify's metadata is not open to actual users on the application side (such as on the Q&A page )
Dify also supports optimizing , allowing more precise granularity control of RAG.

Let’s look at ima. ima is an intelligent workbench with knowledge base as its core. It pays more attention to the end-user experience and puts the labeling capabilities directly into the hands of users.

ima is divided into five main sections: Notes, Personal Knowledge Base, Shared Knowledge Base, Knowledge Base Plaza and Home Page. Depending on the usage scenario, ima provides a flexible tag and knowledge base selection mechanism:

In notes/homepage : You can select multiple knowledge bases using @, but you cannot select tags

Within a knowledge base : You can select multiple tags via @, but you cannot select other knowledge bases

This design may seem restrictive, but it is actually a well-thought-out user experience consideration:

User intentions are obviously different in different scenarios (notes focus on "which libraries to obtain from", knowledge bases focus on "what kind of information to find")
Avoid retrieval failures caused by inconsistent labels across knowledge bases, because it is rare for a document to be tagged with labels from different knowledge bases at the same time, so documents may often not be retrieved
Reduce the cognitive burden on users to think about two dimensions (knowledge base and tags) at the same time

In addition, ima also supports structural metadata (folders), allowing users to organize files through an intuitive hierarchical structure, providing another retrieval dimension.

Dify allows application developers and administrators to limit the scope of RAG searches through descriptive and administrative metadata, and allows users to tag file segments through markup language metadata for better segmentation.

ima allows end users of the application to limit and to organize files in the knowledge base through structural metadata (catalogs).

Structural metadata is not used by Dify. IMA is used to organize documents in the knowledge base, but it is not clear whether it is used at the RAG level.

So what if we use structural metadata on top of RAG? For example, a directory.

The hierarchical nature of the directory contrasts sharply with the flatness of the labels.

Imagine that when users ask questions, they do not always actively set tags. So how can we improve the accuracy of retrieval without tags ?

One idea is to give the current knowledge base directory structure together with the user questions to the big model, let the big model select the most relevant directory branches, and then RAG only in the files under these branches.

Just like you wouldn't just look for information page by page in an entire book, you would first look at the table of contents, find the chapters that might contain the information you need, and then focus on those chapters.

The hierarchical relationship of the directory can provide richer semantic information than discrete tags, and is more reliable than letting AI extract tags from scratch.

Metadata can significantly improve RAG performance in four ways:

1. Use descriptive metadata to constrain the scope of RAG searches

Allow users to manually select metadata such as tags and file types to limit searches to a specific range .

For example, if a user wants to know "the work status of Department A this week", he can select the two labels "Department A" and "Weekly Report", set the "Submission Date" to this week, and then ask "Help me summarize the work status of Department A this week".

2. Use structural metadata to increase RAG recall paths

The directory structure guides AI to identify the most relevant content branches and optimize the search scope.

For example, when a user asks "What is the onboarding process for new employees?", AI can first identify that the most relevant directory is "HR/Recruitment Process", and then search only in the documents under this directory, greatly improving accuracy.

3. Using administrative metadata to implement RAG permission control

Label files with permission levels to ensure that users can only retrieve knowledge within their permission scope.

For example, different permission levels are set for different documents within the company (all employees, management, specific departments, etc.), and the system will automatically filter the search scope based on the user's identity.

4. Optimize document segmentation using markup language metadata

Improve document segmentation through special markers, allowing RAG to locate text segments .

For example, the system can allow users to mark poorly segmented content in the online preview , and the system can then re-segment it.

Last words

In this era of information explosion, what we face is no longer the problem of obtaining information, but how to find accurate and sufficient content from massive amounts of information.

When it comes to knowledge, it is not only about managing it with a knowledge base, but also about operating it with our cognition.

Don’t continue to fantasize that a big model + RAG can handle the knowledge base. We need to bring metadata , tags, and users into the picture.

When you encounter the RAG problem again, think about the metadata and tags first. Are they ready?