RAG is becoming less and less accurate? From Dify and ima knowledge base, how metadata and tags can make big models understand you better

Written by
Silas Grey
Updated on:June-18th-2025
Recommendation

In the era of information overload, how can large AI models understand user needs more accurately? This article explores the key role of metadata and tags in improving the accuracy of RAG systems.

Core content:
1. The importance of metadata and tags in information retrieval
2. Metadata classification and its application in resource discovery and management
3. The characteristics of tags and their role in content organization and retrieval

Yang Fangxian
Founder of 53A/Most Valuable Expert of Tencent Cloud (TVP)

Have you ever had this experience: "There are more and more knowledge base documents, but the knowledge base questions and answers are becoming less and less reliable, and RAG retrieves a bunch of irrelevant content."

In this era of information explosion, we do not lack information, but what we lack is the ability to find the "right information" .

Metadata and tags may seem ordinary, but they can greatly enhance RAG capabilities.

This article discusses how they can assist the RAG system in truly understanding user intent and accurately finding the required information.

Metadata can be simply understood as "data that describes data".

Imagine you are holding a book in your hand. The title, author, publication date, number of pages - these are all metadata. Although they are not the main content of the book, they can help you quickly understand the basic information of the book and decide whether it is worth reading.

According to Understanding Metadata: What is Metadata, and What is it For? (2017), metadata can be broken down into the following categories:

Metadata Type

definition

Popular explanation

Example

Main Application

Descriptive metadata

Describes the content of the resource and helps find or understand information about the resource

Tags that tell you "what this is"

Title, author, subject, genre, publication date

Resource Discovery

Administrative metadata

Information required to manage resources or related to resource creation

Instructions on how to manage it

There are three sub-types:

- Technical

Technical information required to decode and render digital files

Instructions that tell the computer "how to open and display"

File type, file size, creation date/time, compression scheme

Resource Management

- Maintenance

Information that supports the long-term management and future migration of digital files

Document "Health Record" and "Maintenance Manual"

Checksum, integrity verification, event record preservation

Resource maintenance

- Permission Type

Intellectual property information attached to the content

"Copyright Statement" and "Usage Agreement" of the resource

Copyright status, licensing terms, rights holders

Permission Control

Structural metadata

Information describing the relationships between the parts of a resource

Contents of "Table of Contents" and "Assembly Instructions"

Document directory, table structure, video subtitle file

Content Relevance

Markup language metadata

Integrate metadata and tag structural or semantic features in content

"Smart Tags" and "Formatting Instructions" in Text

Paragraph marker, title marker, list marker, name marker, date marker

Content analysis

After reading this table, you may wonder, there are so many types of metadata...

Compared with metadata, we may be relatively familiar with the concept of tags.

Whenever we watch Douyin, Bilibili, or browse Zhihu, the keywords with # are tags .

Baidu Encyclopedia defines tags as follows:

Tags are a way of organizing Internet content. They are highly relevant keywords that help people easily describe and classify content for easy retrieval and sharing. Tags transfer the right to organize content from website administrators to users, fully reflecting the bottom-up and user-participated characteristics of web2.0.

Simply put, tags are "sticky notes" affixed by users themselves to help content be better found and classified.

Tags are essentially a type of "descriptive metadata". However, unlike other metadata, tags are more free and open:

  • Metadata usually has a strict structure and specifications, while tags do not need to follow a predefined structure.
  • Metadata is usually added by the system or professionals, while tags can be created freely by ordinary users.
  • Metadata tends to be objective descriptions, while tags can contain subjective judgments and personal understandings.

It's like official archives and personal notes, both have value, but in different application scenarios .

Dify, a large model application development platform for developers, added metadata support about two months ago .

In Dify, metadata is divided into two categories:

Built-in metadata (automatically extracted and cannot be deleted or modified):

  • File name, file type, uploader, upload time, update time, file source, file size, word count, etc.

Custom metadata (user added, editable):

  • Content summary, document type (contract, report, manual, etc.), applicable industry, applicable region , applicable period, applicable entity, etc.

Dify allows you to uniformly configure metadata types at the knowledge base level, and then set the corresponding metadata values ​​in all documents under the knowledge base. This design allows metadata to be uniformly managed within the knowledge base.

For example, developers can manually set metadata filters to ensure that users' questions are constrained within the specified knowledge range, thereby improving the security and relevance of the retrieval.

In addition, Dify also supports allowing the big model to automatically identify metadata information that may be contained in user questions. You only need to change the manual model to automatic mode and then select a big model. (However, the automatic mode does not seem to be able to see the logs of the actual extracted metadata, so it is not easy to know whether it is effective).

Regarding dify's metadata, here are 3 more points:

  1. If multiple knowledge bases are added to the knowledge retrieval node, the metadata selection function will be unavailable.
  2. Dify's metadata is not open to actual users on the application side (such as on the Q&A page )
  3. Dify also supports optimizing , allowing more precise granularity control of RAG.

Let’s look at ima. ima is an intelligent workbench with knowledge base as its core. It pays more attention to the end-user experience and puts the labeling capabilities directly into the hands of users.

ima is divided into five main sections: Notes, Personal Knowledge Base, Shared Knowledge Base, Knowledge Base Plaza and Home Page. Depending on the usage scenario, ima provides a flexible tag and knowledge base selection mechanism:
  • In notes/homepage : You can select multiple knowledge bases using @, but you cannot select tags
  • Within a knowledge base : You can select multiple tags via @, but you cannot select other knowledge bases

This design may seem restrictive, but it is actually a well-thought-out user experience consideration:

  1. User intentions are obviously different in different scenarios (notes focus on "which libraries to obtain from", knowledge bases focus on "what kind of information to find")
  2. Avoid retrieval failures caused by inconsistent labels across knowledge bases, because it is rare for a document to be tagged with labels from different knowledge bases at the same time, so documents may often not be retrieved
  3. Reduce the cognitive burden on users to think about two dimensions (knowledge base and tags) at the same time

In addition, ima also supports structural metadata (folders), allowing users to organize files through an intuitive hierarchical structure, providing another retrieval dimension.

Dify allows application developers and administrators to limit the scope of RAG searches through descriptive and administrative metadata, and allows users to tag file segments through markup language metadata for better segmentation.

ima allows end users of the application to limit and to organize files in the knowledge base through structural metadata (catalogs).

Structural metadata is not used by Dify. IMA is used to organize documents in the knowledge base, but it is not clear whether it is used at the RAG level.

So what if we use structural metadata on top of RAG? For example, a directory.

The hierarchical nature of the directory contrasts sharply with the flatness of the labels.

Imagine that when users ask questions, they do not always actively set tags. So how can we improve the accuracy of retrieval without tags                             ?

One idea is to give the current knowledge base directory structure together with the user questions to the big model, let the big model select the most relevant directory branches, and then RAG only in the files under these branches.

Just like you wouldn't just look for information page by page in an entire book, you would first look at the table of contents, find the chapters that might contain the information you need, and then focus on those chapters.

The hierarchical relationship of the directory can provide richer semantic information than discrete tags, and is more reliable than letting AI extract tags from scratch.

Metadata can significantly improve RAG performance in four ways:

1. Use descriptive metadata to constrain the scope of RAG searches

Allow users to manually select metadata such as tags and file types to limit searches to a specific range .

For example, if a user wants to know "the work status of Department A this week", he can select the two labels "Department A" and "Weekly Report", set the "Submission Date" to this week, and then ask "Help me summarize the work status of Department A this week".

2. Use structural metadata to increase RAG recall paths

The directory structure guides AI to identify the most relevant content branches and optimize the search scope.

For example, when a user asks "What is the onboarding process for new employees?", AI can first identify that the most relevant directory is "HR/Recruitment Process", and then search only in the documents under this directory, greatly improving accuracy.

3. Using administrative metadata to implement RAG permission control

Label files with permission levels to ensure that users can only retrieve knowledge within their permission scope.

For example, different permission levels are set for different documents within the company (all employees, management, specific departments, etc.), and the system will automatically filter the search scope based on the user's identity.

4. Optimize document segmentation using markup language metadata

Improve document segmentation through special markers, allowing RAG to locate text segments .

For example, the system can allow users to mark poorly segmented content in the online preview , and the system can then re-segment it.

Last words

In this era of information explosion, what we face is no longer the problem of obtaining information, but how to find accurate and sufficient content from massive amounts of information.

When it comes to knowledge, it is not only about managing it with a knowledge base, but also about operating it with our cognition.

Don’t continue to fantasize that a big model + RAG can handle the knowledge base. We need to bring metadata , tags, and users into the picture.

When you encounter the RAG problem again, think about the metadata and tags first. Are they ready?